diff --git a/docs/clusters-at-yale/guides/cesm.md b/docs/clusters-at-yale/guides/cesm.md index 14373eb57..829dec6c2 100644 --- a/docs/clusters-at-yale/guides/cesm.md +++ b/docs/clusters-at-yale/guides/cesm.md @@ -163,7 +163,7 @@ In your case directory, there will be a file that looks like `slurm-.log #### CESM Run Logs -If the last few lines of the slurm log direct you to look at `cpl.log.` file, change directory to your case “run” directory (usually in your project directory): +If the last few lines of the Slurm log direct you to look at `cpl.log.` file, change directory to your case "run" directory (usually in your project directory): ``` bash cd ~/project/CESM/$CASE/run diff --git a/docs/clusters-at-yale/guides/clustershell.md b/docs/clusters-at-yale/guides/clustershell.md index 31cfe4c56..365464bf7 100644 --- a/docs/clusters-at-yale/guides/clustershell.md +++ b/docs/clusters-at-yale/guides/clustershell.md @@ -55,7 +55,7 @@ nodeset -f @job:1234567 #### State group -List expanded node names that are idle according to slurm +List expanded node names that are idle according to Slurm ``` bash # similar to sinfo -t IDLE -o "%N" diff --git a/docs/clusters-at-yale/guides/cryoem.md b/docs/clusters-at-yale/guides/cryoem.md index 80b4914d6..42acefd16 100644 --- a/docs/clusters-at-yale/guides/cryoem.md +++ b/docs/clusters-at-yale/guides/cryoem.md @@ -30,7 +30,7 @@ We have GPU-enabled versions of RELION available on McCleary as [software module #### Example Job Parameters -RELION reserves one worker (slurm task) for orchestrating an MPI-based job, which +RELION reserves one worker (Slurm task) for orchestrating an MPI-based job, which they call the "master". This can lead to inefficient jobs where there are tasks that could be using a GPU but are stuck being the master process. You can request a better layout for your job with a [heterogeneous job](https://slurm.schedmd.com/heterogeneous_jobs.html), allocating CPUs on a cpu-only compute node for the task that will not use GPUs. Here is an example 3D refinement job submission script (replace `choose_a_version` with the version you want to use): ``` bash diff --git a/docs/clusters-at-yale/guides/cryosparc.md b/docs/clusters-at-yale/guides/cryosparc.md index 92b2d3b13..933f960ff 100644 --- a/docs/clusters-at-yale/guides/cryosparc.md +++ b/docs/clusters-at-yale/guides/cryosparc.md @@ -180,7 +180,7 @@ wait #### b. Adjust the script contents as desired for memory, CPU, time, and partition. -#### c. Submit the script to SLURM. +#### c. Submit the script to Slurm. ``` bash sbatch YourScriptName diff --git a/docs/clusters-at-yale/guides/jupyter_ssh.md b/docs/clusters-at-yale/guides/jupyter_ssh.md index f5a9ebedc..ca2fc47b9 100644 --- a/docs/clusters-at-yale/guides/jupyter_ssh.md +++ b/docs/clusters-at-yale/guides/jupyter_ssh.md @@ -11,7 +11,7 @@ The main steps are: ### Start the Server Here is a template for submitting a jupyter-notebook server as a batch job. -You may need to edit some of the slurm options, including the time limit or the partition. +You may need to edit some of the Slurm options, including the time limit or the partition. You will also need to either load a module that contains `jupyter-notebook`. !!! tip @@ -69,7 +69,7 @@ jupyter lab --no-browser --port=${port} --ip=${node} Once you have submitted your job and it starts, your notebook server will be ready for you to connect. You can run `squeue -u${USER}` to check. You will see an "R" in the ST or status column for your notebook job if it is running. If you see a "PD" in the status column, you will have to wait for your job to start running to connect. -The log file with information about how to connect will be in the directory you submitted the script from, and be named jupyter-notebook-[jobid].log where jobid is the slurm id for your job. +The log file with information about how to connect will be in the directory you submitted the script from, and be named jupyter-notebook-[jobid].log where jobid is the Slurm ID for your job. #### MacOS and Linux diff --git a/docs/clusters-at-yale/guides/mathematica.md b/docs/clusters-at-yale/guides/mathematica.md index c1fabb0ef..cf652c194 100644 --- a/docs/clusters-at-yale/guides/mathematica.md +++ b/docs/clusters-at-yale/guides/mathematica.md @@ -47,7 +47,7 @@ Mathematica & ## Parallel Jobs -Mathematica installed on Yale HPC clusters includes our proprietary scripts to run parallel jobs in SLURM environments. These scripts are designed in a way to allow users to access up to 450 parallel kernels. +Mathematica installed on Yale HPC clusters includes our proprietary scripts to run parallel jobs in Slurm environments. These scripts are designed in a way to allow users to access up to 450 parallel kernels. When a user asks for a specific number of kernels, the wait time to get them might differ dramatically depending on requested computing resources as well as on how busy the HPC cluster is at that moment. To reduce waiting time, our scripts try to launch as many kernels as possible at the moment the user asks for them. Most of the time you will not get launched with the same number of kernels as you requested. We recommend checking the final number of parallel kernels you’ve gotten after the launching command has completed no matter if you run a Front End Mathematica session or execute Wolfram script. One of the ways to check this would be the Mathematica command `Length[Kernels[]]`. @@ -61,7 +61,7 @@ where `n` is the number of kernels you want to use. This command launches as man ``` LaunchSlurmKernels[n,"SlurmOptions"] ``` -where `SlurmOptions` specifies the [job request options for SLURM](https://docs.ycrc.yale.edu/clusters-at-yale/job-scheduling/#common-job-request-options). Here are some examples: `LaunchSlurmKernels[40,"-t 12:00:00"]` launches 40 kernels for 12 hours in the default partition; `LaunchSlurmKernels[30,"-p week -t 2-12:00:00"]` launches 30 kernels for 2 days and 12 hours in the week partition; and `LaunchSlurmKernels[100,"--mem 30G"]` launches 100 kernels with 30GB of RAM per kernel and default runtime in the default partition. If the SLURM options violate the restrictions on the partition, it will result in an error. The wall time for your parallel kernels should not exceed the remaining wall time of your main Mathematica session. Since the parallel kernels are child processes of your main session, they will be terminated when your session ends. +where `SlurmOptions` specifies the [job request options for Slurm](https://docs.ycrc.yale.edu/clusters-at-yale/job-scheduling/#common-job-request-options). Here are some examples: `LaunchSlurmKernels[40,"-t 12:00:00"]` launches 40 kernels for 12 hours in the default partition; `LaunchSlurmKernels[30,"-p week -t 2-12:00:00"]` launches 30 kernels for 2 days and 12 hours in the week partition; and `LaunchSlurmKernels[100,"--mem 30G"]` launches 100 kernels with 30GB of RAM per kernel and default runtime in the default partition. If the Slurm options violate the restrictions on the partition, it will result in an error. The wall time for your parallel kernels should not exceed the remaining wall time of your main Mathematica session. Since the parallel kernels are child processes of your main session, they will be terminated when your session ends. You can also manually close the parallel kernels during your session with the following command: ``` @@ -71,7 +71,7 @@ which will shut down all currently launched parallel kernels. If you need to ter ![file_browser](/img/kernel_object.png){: .medium} -When the process running your parallel kernel is terminated by SLURM due to exceeding its wall time, there may not be any indication of this in your main session. However, you will receive error messages when you try to run any Parallel commands. If this is the case, make sure to close the terminated kernels with `CloseKernels[]` command. +When the process running your parallel kernel is terminated by Slurm due to exceeding its wall time, there may not be any indication of this in your main session. However, you will receive error messages when you try to run any Parallel commands. If this is the case, make sure to close the terminated kernels with `CloseKernels[]` command. You can add more parallel kernels to the already launched kernels by using the same command `LaunchSlurmKernels[n]`. The termination time of the newly added parallel kernels will be different from that of the existing kernels. diff --git a/docs/clusters-at-yale/guides/namd.md b/docs/clusters-at-yale/guides/namd.md index d409c77eb..01ef7a837 100644 --- a/docs/clusters-at-yale/guides/namd.md +++ b/docs/clusters-at-yale/guides/namd.md @@ -49,7 +49,7 @@ NAMD uses charm++ parallel objects for multinode parallelization and the program ### GPUs -To use the GPU-accelerated version, request GPU resources for your SLURM job using salloc or via a submission script, and load a CUDA-enabled version of NAMD: +To use the GPU-accelerated version, request GPU resources for your Slurm job using salloc or via a submission script, and load a CUDA-enabled version of NAMD: ``` bash module load NAMD/2.13-multicore-CUDA diff --git a/docs/clusters-at-yale/guides/nextflow.md b/docs/clusters-at-yale/guides/nextflow.md index 4017a9af5..9a7ec59d2 100644 --- a/docs/clusters-at-yale/guides/nextflow.md +++ b/docs/clusters-at-yale/guides/nextflow.md @@ -1,9 +1,9 @@ # Nextflow [Nextflow](https://www.nextflow.io/) is a very popular workflow tool, especially in bioinformatics. It automates workflow processing, is very portable, and has excellent reporting. -Nextflow is able to make effective use of slurm when running on our clusters, using slurm submissions for running processes and achieving a high level of parallelism. However, there are a few gotchas and things to know about. +Nextflow is able to make effective use of Slurm when running on our clusters, using Slurm submissions for running processes and achieving a high level of parallelism. However, there are a few gotchas and things to know about. -First, to specify slurm as the executor, add the following executor default to the process specification in nextflow.config. +First, to specify Slurm as the executor, add the following executor default to the process specification in nextflow.config. ``` bash Process { @@ -13,7 +13,7 @@ Process { } ``` -You can add other slurm-related options, for example: +You can add other Slurm-related options, for example: ``` bash process { @@ -25,12 +25,12 @@ process { } ``` -This sets the initial default for the slurm partition, memory, cpus, and time. Note that nextflow uses different names for many of these values than slurm. +This sets the initial default for the Slurm partition, memory, cpus, and time. Note that nextflow uses different names for many of these values than Slurm. These same options can be added to specific processes or labels to customize processes more specifically. -Arbitrary slurm options can be added using clusterOptions, e.g. clusterOptions = '--qos priority' +Arbitrary Slurm options can be added using clusterOptions, e.g. clusterOptions = '--qos priority' More information can be found on nextflow's [slurm](https://www.nextflow.io/docs/latest/executor.html#slurm) page. -To respect our SLURM queue limits, we recommend adding an executor block to your nextflow.config: +To respect our Slurm queue limits, we recommend adding an executor block to your nextflow.config: ``` bash executor { @@ -43,7 +43,7 @@ executor { This limits Nextflow to 50 queued jobs at a time and a maximum of 190 submissions per 60 minutes, keeping you safely under the cluster's submission threshold. -Setting executor to slurm will cause all processes to be submitted as slurm jobs, unless otherwise specified (see below). +Setting executor to Slurm will cause all processes to be submitted as Slurm jobs, unless otherwise specified (see below). ## Nextflow installation @@ -55,18 +55,18 @@ It is common for nextflow pipelines to use a containerization to manage code, su ## Scheduling quirks -When running nextflow with slurm executor, you may notice some scheduling oddities. This is due to the fact that multiple barriers can pend jobs. +When running nextflow with Slurm executor, you may notice some scheduling oddities. This is due to the fact that multiple barriers can pend jobs. -Internally, nextflow limits the number of submitted slurm jobs to the value of ‘queueSize’, by default 100. This can be modified in the configuration or using the -qs command line option. This is why nextflow can report a large number of pending jobs, while squeue only shows 100. Slurm (squeue) will not show the jobs pended by queueSize, since nextflow has not actually submitted them yet. +Internally, nextflow limits the number of submitted Slurm jobs to the value of 'queueSize', by default 100. This can be modified in the configuration or using the -qs command line option. This is why nextflow can report a large number of pending jobs, while squeue only shows 100. Slurm (squeue) will not show the jobs pended by queueSize, since nextflow has not actually submitted them yet. -Once jobs are actually submitted, the usual slurm queuing will occur. This can include: per user or group limits, lack of available resources, etc. These pended jobs will show in squeue as PD. +Once jobs are actually submitted, the usual Slurm queuing will occur. This can include: per user or group limits, lack of available resources, etc. These pended jobs will show in squeue as PD. ## Submission threshold -In order to prevent abusive job submission, most of our partitions have a limit of 200 individual submissions per user per hour. Normally not a problem, this limit can cause problems with nextflow, since by default when using slurm all processes are submissions, and many workflows submit hundreds of very short jobs. When the threshold is exceeded, subsequent submissions fail, and typically the nextflow workflow fails. In addition, running very short processes as slurm submissions is very inefficient. +In order to prevent abusive job submission, most of our partitions have a limit of 200 individual submissions per user per hour. Normally not a problem, this limit can cause problems with nextflow, since by default when using Slurm all processes are submissions, and many workflows submit hundreds of very short jobs. When the threshold is exceeded, subsequent submissions fail, and typically the nextflow workflow fails. In addition, running very short processes as Slurm submissions is very inefficient. -We recommend that you configure your workflow to specify small, short jobs as using the local executor, leaving the larger and longer running jobs to the slurm executor. +We recommend that you configure your workflow to specify small, short jobs as using the local executor, leaving the larger and longer running jobs to the Slurm executor. The cleanest way to do this is to give those processes a specific ‘label’, e.g. process_local, and then set the executor for that label to be ‘local’. For example: @@ -102,7 +102,7 @@ Process { ## Running a hybrid workflow -When running a workflow that uses both slurm and local executors, you should submit the run as a batch job of a single task and multiple cpus. Give the batch job a reasonable number of cpus, depending on how many local processes you want to run simultaneously (e.g. 32). The local processes will all run within the main batch job, and the slurm processes will be submitted as separate slurm jobs. If done correctly this should keep you below the 200 submissions/hour threshold. +When running a workflow that uses both Slurm and local executors, you should submit the run as a batch job of a single task and multiple cpus. Give the batch job a reasonable number of cpus, depending on how many local processes you want to run simultaneously (e.g. 32). The local processes will all run within the main batch job, and the Slurm processes will be submitted as separate Slurm jobs. If done correctly this should keep you below the 200 submissions/hour threshold. You may also try reducing queueSize to a value less than 100. If you cannot find a way to reduce your submission rate to an acceptable level, we will consider turning off the submission threshold. Contact us for more information. diff --git a/docs/clusters-at-yale/guides/parallel.md b/docs/clusters-at-yale/guides/parallel.md index 3ec83baf8..7d7a5d1cf 100644 --- a/docs/clusters-at-yale/guides/parallel.md +++ b/docs/clusters-at-yale/guides/parallel.md @@ -26,7 +26,7 @@ e f ``` -To achieve the same result, `parallel` starts some number of workers and then runs tasks on them. The number of workers and tasks need not be the same. You specify the number of workers with `-j`. The tasks can be generated with a list of arguments specified after the separator `:::`. For parallel to perform well, you should allocate at least the same number of CPUs as workers with the slurm option `--cpus-per-task` or more simply `-c`. +To achieve the same result, `parallel` starts some number of workers and then runs tasks on them. The number of workers and tasks need not be the same. You specify the number of workers with `-j`. The tasks can be generated with a list of arguments specified after the separator `:::`. For parallel to perform well, you should allocate at least the same number of CPUs as workers with the Slurm option `--cpus-per-task` or more simply `-c`. ``` bash salloc -c 4 diff --git a/docs/clusters-at-yale/job-scheduling/dependency.md b/docs/clusters-at-yale/job-scheduling/dependency.md index 91a98838a..872d4ce75 100644 --- a/docs/clusters-at-yale/job-scheduling/dependency.md +++ b/docs/clusters-at-yale/job-scheduling/dependency.md @@ -1,6 +1,6 @@ # Jobs with Dependencies -SLURM offers a tool which can help string jobs together via dependencies. +Slurm offers a tool which can help string jobs together via dependencies. When submitting a job, you can specify that it should wait to run until a specified job has finished. This provides a mechanism to create simple pipelines for managing complicated workflows. @@ -72,8 +72,8 @@ This last job will wait to run until all previous jobs with name `JobName` finis ## Further Reading -SLURM provides a number of options for logic controlling dependencies. +Slurm provides a number of options for logic controlling dependencies. Most common are the two discussed above, but `--dependency=afternotok:` can be useful to control behavior if a job fails. -Full discussion of the options can be found on the SLURM manual page for `sbatch` (https://slurm.schedmd.com/sbatch.html). +Full discussion of the options can be found on the Slurm manual page for `sbatch` (https://slurm.schedmd.com/sbatch.html). A very detailed overview, with examples in both bash and python, can also be found at the NIH computing reference: https://hpc.nih.gov/docs/job_dependencies.html. diff --git a/docs/clusters-at-yale/job-scheduling/dsq.md b/docs/clusters-at-yale/job-scheduling/dsq.md index 19b2899ec..a2902819d 100644 --- a/docs/clusters-at-yale/job-scheduling/dsq.md +++ b/docs/clusters-at-yale/job-scheduling/dsq.md @@ -73,7 +73,7 @@ You can monitor the status of your jobs in Slurm by using `squeue -u ` an First, you'll need to generate a job file. Each line of this job file needs to specify exactly what you want run for each job, including any modules that need to be loaded or modifications to your environment variables. Empty lines or lines that begin with `#` will be ignored when submitting your job array. -**Note:** slurm jobs start in the directory from which your job was submitted. +**Note:** Slurm jobs start in the directory from which your job was submitted. Create a file with the jobs you want to run, one per line. A simple loop that prints your jobs should usually suffice. @@ -121,7 +121,7 @@ Optional Arguments: Name of your job array. Defaults to dsq-jobfile --max-jobs number Maximum number of simultaneously running jobs from the job array. -o fmt_string, --output fmt_string - Slurm output file pattern. There will be one file per line in your job file. To suppress slurm out files, set this to /dev/null. Defaults to dsq-jobfile-%A_%a-%N.out + Slurm output file pattern. There will be one file per line in your job file. To suppress Slurm out files, set this to /dev/null. Defaults to dsq-jobfile-%A_%a-%N.out --status-dir dir Directory to save the job_jobid_status.tsv file to. Defaults to working directory. --suppress-stats-file Don't save job stats to job_jobid_status.tsv --submit Submit the job array on the fly instead of creating a submission script. diff --git a/docs/clusters-at-yale/job-scheduling/index.md b/docs/clusters-at-yale/job-scheduling/index.md index 8feb3fa06..07abf4586 100644 --- a/docs/clusters-at-yale/job-scheduling/index.md +++ b/docs/clusters-at-yale/job-scheduling/index.md @@ -121,14 +121,14 @@ See our page of [Submission Script Examples](/clusters-at-yale/job-scheduling/sl ``` !!! warning "No Space After #SBATCH" - When writing SLURM directives, make sure there is no space between the `#` and `SBATCH`. + When writing Slurm directives, make sure there is no space between the `#` and `SBATCH`. ```bash #SBATCH --option=value # Correct # SBATCH --option=value # Incorrect - will be ignored ``` - Directives with a space after the `#` will be treated as comments and ignored by SLURM. + Directives with a space after the `#` will be treated as comments and ignored by Slurm. diff --git a/docs/clusters-at-yale/job-scheduling/scrontab.md b/docs/clusters-at-yale/job-scheduling/scrontab.md index 68b0b56a2..f43b5fbd9 100644 --- a/docs/clusters-at-yale/job-scheduling/scrontab.md +++ b/docs/clusters-at-yale/job-scheduling/scrontab.md @@ -22,7 +22,7 @@ EDITOR=nano scrontab -e Lines that start with `#SCRON` are treated like the beginning of a new batch job, and work like `#SBATCH` directives for batch jobs. Slurm will ignore `#SBATCH` directives in scripts you run as `scrontab` jobs. You can use most [common `sbatch` options](/clusters-at-yale/job-scheduling/#common-job-request-options) just as you would [using sbatch on the command line](https://slurm.schedmd.com/sbatch.html). The first line after your `SCRON` directives specifies the schedule for your job and the command to run. !!! info "Note" - All of your `scrontab` jobs will start with your home directory as the working directory. You can change this with the `--chdir` slurm option. + All of your `scrontab` jobs will start with your home directory as the working directory. You can change this with the `--chdir` Slurm option. ### Cron syntax @@ -48,7 +48,7 @@ To your script to ensure that your environment is set up correctly. python my_script.py > $(date +"%Y-%m-%d")_myjob_scrontab.out ``` -If you want to see slurm accounting of a job handled by scrontab, for example job `12345` run: +If you want to see Slurm accounting of a job handled by scrontab, for example job `12345` run: ``` bash sacct --duplicates --jobs 12345 diff --git a/docs/data/ycga-data.md b/docs/data/ycga-data.md index 447eee0ad..476d9608d 100644 --- a/docs/data/ycga-data.md +++ b/docs/data/ycga-data.md @@ -5,7 +5,7 @@ Data associated with YCGA projects and sequencers are located on the YCGA storag ## YCGA Access Policy The McCleary high-performance computing system has specific resources that are dedicated to -YCGA users. This includes a slurm partition (‘ycga’) and a large parallel storage system +YCGA users. This includes a Slurm partition ('ycga') and a large parallel storage system (/gpfs/ycga). The following policy guidelines govern the use of these resources on McCleary for data storage and analysis.