Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/clusters-at-yale/guides/cesm.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ In your case directory, there will be a file that looks like `slurm-<job_id>.log

#### CESM Run Logs

If the last few lines of the slurm log direct you to look at `cpl.log.<some_number>` file, change directory to your case run directory (usually in your project directory):
If the last few lines of the Slurm log direct you to look at `cpl.log.<some_number>` file, change directory to your case "run" directory (usually in your project directory):

``` bash
cd ~/project/CESM/$CASE/run
Expand Down
2 changes: 1 addition & 1 deletion docs/clusters-at-yale/guides/clustershell.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ nodeset -f @job:1234567

#### State group

List expanded node names that are idle according to slurm
List expanded node names that are idle according to Slurm

``` bash
# similar to sinfo -t IDLE -o "%N"
Expand Down
2 changes: 1 addition & 1 deletion docs/clusters-at-yale/guides/cryoem.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ We have GPU-enabled versions of RELION available on McCleary as [software module

#### Example Job Parameters

RELION reserves one worker (slurm task) for orchestrating an MPI-based job, which
RELION reserves one worker (Slurm task) for orchestrating an MPI-based job, which
they call the "master". This can lead to inefficient jobs where there are tasks that could be using a GPU but are stuck being the master process. You can request a better layout for your job with a [heterogeneous job](https://slurm.schedmd.com/heterogeneous_jobs.html), allocating CPUs on a cpu-only compute node for the task that will not use GPUs. Here is an example 3D refinement job submission script (replace `choose_a_version` with the version you want to use):

``` bash
Expand Down
2 changes: 1 addition & 1 deletion docs/clusters-at-yale/guides/cryosparc.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ wait

#### b. Adjust the script contents as desired for memory, CPU, time, and partition.

#### c. Submit the script to SLURM.
#### c. Submit the script to Slurm.

``` bash
sbatch YourScriptName
Expand Down
4 changes: 2 additions & 2 deletions docs/clusters-at-yale/guides/jupyter_ssh.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The main steps are:
### Start the Server

Here is a template for submitting a jupyter-notebook server as a batch job.
You may need to edit some of the slurm options, including the time limit or the partition.
You may need to edit some of the Slurm options, including the time limit or the partition.
You will also need to either load a module that contains `jupyter-notebook`.

!!! tip
Expand Down Expand Up @@ -69,7 +69,7 @@ jupyter lab --no-browser --port=${port} --ip=${node}
Once you have submitted your job and it starts, your notebook server will be ready for you to connect.
You can run `squeue -u${USER}` to check. You will see an "R" in the ST or status column for your notebook job if it is running.
If you see a "PD" in the status column, you will have to wait for your job to start running to connect.
The log file with information about how to connect will be in the directory you submitted the script from, and be named jupyter-notebook-[jobid].log where jobid is the slurm id for your job.
The log file with information about how to connect will be in the directory you submitted the script from, and be named jupyter-notebook-[jobid].log where jobid is the Slurm ID for your job.

#### MacOS and Linux

Expand Down
6 changes: 3 additions & 3 deletions docs/clusters-at-yale/guides/mathematica.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Mathematica &

## Parallel Jobs

Mathematica installed on Yale HPC clusters includes our proprietary scripts to run parallel jobs in SLURM environments. These scripts are designed in a way to allow users to access up to 450 parallel kernels.
Mathematica installed on Yale HPC clusters includes our proprietary scripts to run parallel jobs in Slurm environments. These scripts are designed in a way to allow users to access up to 450 parallel kernels.

When a user asks for a specific number of kernels, the wait time to get them might differ dramatically depending on requested computing resources as well as on how busy the HPC cluster is at that moment. To reduce waiting time, our scripts try to launch as many kernels as possible at the moment the user asks for them. Most of the time you will not get launched with the same number of kernels as you requested. We recommend checking the final number of parallel kernels you’ve gotten after the launching command has completed no matter if you run a Front End Mathematica session or execute Wolfram script. One of the ways to check this would be the Mathematica command `Length[Kernels[]]`.

Expand All @@ -61,7 +61,7 @@ where `n` is the number of kernels you want to use. This command launches as man
```
LaunchSlurmKernels[n,"SlurmOptions"]
```
where `SlurmOptions` specifies the [job request options for SLURM](https://docs.ycrc.yale.edu/clusters-at-yale/job-scheduling/#common-job-request-options). Here are some examples: `LaunchSlurmKernels[40,"-t 12:00:00"]` launches 40 kernels for 12 hours in the default partition; `LaunchSlurmKernels[30,"-p week -t 2-12:00:00"]` launches 30 kernels for 2 days and 12 hours in the week partition; and `LaunchSlurmKernels[100,"--mem 30G"]` launches 100 kernels with 30GB of RAM per kernel and default runtime in the default partition. If the SLURM options violate the restrictions on the partition, it will result in an error. The wall time for your parallel kernels should not exceed the remaining wall time of your main Mathematica session. Since the parallel kernels are child processes of your main session, they will be terminated when your session ends.
where `SlurmOptions` specifies the [job request options for Slurm](https://docs.ycrc.yale.edu/clusters-at-yale/job-scheduling/#common-job-request-options). Here are some examples: `LaunchSlurmKernels[40,"-t 12:00:00"]` launches 40 kernels for 12 hours in the default partition; `LaunchSlurmKernels[30,"-p week -t 2-12:00:00"]` launches 30 kernels for 2 days and 12 hours in the week partition; and `LaunchSlurmKernels[100,"--mem 30G"]` launches 100 kernels with 30GB of RAM per kernel and default runtime in the default partition. If the Slurm options violate the restrictions on the partition, it will result in an error. The wall time for your parallel kernels should not exceed the remaining wall time of your main Mathematica session. Since the parallel kernels are child processes of your main session, they will be terminated when your session ends.

You can also manually close the parallel kernels during your session with the following command:
```
Expand All @@ -71,7 +71,7 @@ which will shut down all currently launched parallel kernels. If you need to ter

![file_browser](/img/kernel_object.png){: .medium}

When the process running your parallel kernel is terminated by SLURM due to exceeding its wall time, there may not be any indication of this in your main session. However, you will receive error messages when you try to run any Parallel commands. If this is the case, make sure to close the terminated kernels with `CloseKernels[]` command.
When the process running your parallel kernel is terminated by Slurm due to exceeding its wall time, there may not be any indication of this in your main session. However, you will receive error messages when you try to run any Parallel commands. If this is the case, make sure to close the terminated kernels with `CloseKernels[]` command.

You can add more parallel kernels to the already launched kernels by using the same command `LaunchSlurmKernels[n]`. The termination time of the newly added parallel kernels will be different from that of the existing kernels.

Expand Down
2 changes: 1 addition & 1 deletion docs/clusters-at-yale/guides/namd.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ NAMD uses charm++ parallel objects for multinode parallelization and the program

### GPUs

To use the GPU-accelerated version, request GPU resources for your SLURM job using salloc or via a submission script, and load a CUDA-enabled version of NAMD:
To use the GPU-accelerated version, request GPU resources for your Slurm job using salloc or via a submission script, and load a CUDA-enabled version of NAMD:

``` bash
module load NAMD/2.13-multicore-CUDA
Expand Down
26 changes: 13 additions & 13 deletions docs/clusters-at-yale/guides/nextflow.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Nextflow

[Nextflow](https://www.nextflow.io/) is a very popular workflow tool, especially in bioinformatics. It automates workflow processing, is very portable, and has excellent reporting.
Nextflow is able to make effective use of slurm when running on our clusters, using slurm submissions for running processes and achieving a high level of parallelism. However, there are a few gotchas and things to know about.
Nextflow is able to make effective use of Slurm when running on our clusters, using Slurm submissions for running processes and achieving a high level of parallelism. However, there are a few gotchas and things to know about.

First, to specify slurm as the executor, add the following executor default to the process specification in nextflow.config.
First, to specify Slurm as the executor, add the following executor default to the process specification in nextflow.config.

``` bash
Process {
Expand All @@ -13,7 +13,7 @@ Process {
}
```

You can add other slurm-related options, for example:
You can add other Slurm-related options, for example:

``` bash
process {
Expand All @@ -25,12 +25,12 @@ process {
}
```

This sets the initial default for the slurm partition, memory, cpus, and time. Note that nextflow uses different names for many of these values than slurm.
This sets the initial default for the Slurm partition, memory, cpus, and time. Note that nextflow uses different names for many of these values than Slurm.
These same options can be added to specific processes or labels to customize processes more specifically.
Arbitrary slurm options can be added using clusterOptions, e.g. clusterOptions = '--qos priority'
Arbitrary Slurm options can be added using clusterOptions, e.g. clusterOptions = '--qos priority'
More information can be found on nextflow's [slurm](https://www.nextflow.io/docs/latest/executor.html#slurm) page.

To respect our SLURM queue limits, we recommend adding an executor block to your nextflow.config:
To respect our Slurm queue limits, we recommend adding an executor block to your nextflow.config:

``` bash
executor {
Expand All @@ -43,7 +43,7 @@ executor {

This limits Nextflow to 50 queued jobs at a time and a maximum of 190 submissions per 60 minutes, keeping you safely under the cluster's submission threshold.

Setting executor to slurm will cause all processes to be submitted as slurm jobs, unless otherwise specified (see below).
Setting executor to Slurm will cause all processes to be submitted as Slurm jobs, unless otherwise specified (see below).

## Nextflow installation

Expand All @@ -55,18 +55,18 @@ It is common for nextflow pipelines to use a containerization to manage code, su

## Scheduling quirks

When running nextflow with slurm executor, you may notice some scheduling oddities. This is due to the fact that multiple barriers can pend jobs.
When running nextflow with Slurm executor, you may notice some scheduling oddities. This is due to the fact that multiple barriers can pend jobs.

Internally, nextflow limits the number of submitted slurm jobs to the value of queueSize, by default 100. This can be modified in the configuration or using the -qs command line option. This is why nextflow can report a large number of pending jobs, while squeue only shows 100. Slurm (squeue) will not show the jobs pended by queueSize, since nextflow has not actually submitted them yet.
Internally, nextflow limits the number of submitted Slurm jobs to the value of 'queueSize', by default 100. This can be modified in the configuration or using the -qs command line option. This is why nextflow can report a large number of pending jobs, while squeue only shows 100. Slurm (squeue) will not show the jobs pended by queueSize, since nextflow has not actually submitted them yet.

Once jobs are actually submitted, the usual slurm queuing will occur. This can include: per user or group limits, lack of available resources, etc. These pended jobs will show in squeue as PD.
Once jobs are actually submitted, the usual Slurm queuing will occur. This can include: per user or group limits, lack of available resources, etc. These pended jobs will show in squeue as PD.


## Submission threshold

In order to prevent abusive job submission, most of our partitions have a limit of 200 individual submissions per user per hour. Normally not a problem, this limit can cause problems with nextflow, since by default when using slurm all processes are submissions, and many workflows submit hundreds of very short jobs. When the threshold is exceeded, subsequent submissions fail, and typically the nextflow workflow fails. In addition, running very short processes as slurm submissions is very inefficient.
In order to prevent abusive job submission, most of our partitions have a limit of 200 individual submissions per user per hour. Normally not a problem, this limit can cause problems with nextflow, since by default when using Slurm all processes are submissions, and many workflows submit hundreds of very short jobs. When the threshold is exceeded, subsequent submissions fail, and typically the nextflow workflow fails. In addition, running very short processes as Slurm submissions is very inefficient.

We recommend that you configure your workflow to specify small, short jobs as using the local executor, leaving the larger and longer running jobs to the slurm executor.
We recommend that you configure your workflow to specify small, short jobs as using the local executor, leaving the larger and longer running jobs to the Slurm executor.
The cleanest way to do this is to give those processes a specific ‘label’, e.g. process_local, and then set the executor for that label to be ‘local’.

For example:
Expand Down Expand Up @@ -102,7 +102,7 @@ Process {

## Running a hybrid workflow

When running a workflow that uses both slurm and local executors, you should submit the run as a batch job of a single task and multiple cpus. Give the batch job a reasonable number of cpus, depending on how many local processes you want to run simultaneously (e.g. 32). The local processes will all run within the main batch job, and the slurm processes will be submitted as separate slurm jobs. If done correctly this should keep you below the 200 submissions/hour threshold.
When running a workflow that uses both Slurm and local executors, you should submit the run as a batch job of a single task and multiple cpus. Give the batch job a reasonable number of cpus, depending on how many local processes you want to run simultaneously (e.g. 32). The local processes will all run within the main batch job, and the Slurm processes will be submitted as separate Slurm jobs. If done correctly this should keep you below the 200 submissions/hour threshold.

You may also try reducing queueSize to a value less than 100. If you cannot find a way to reduce your submission rate to an acceptable level, we will consider turning off the submission threshold. Contact us for more information.

Expand Down
2 changes: 1 addition & 1 deletion docs/clusters-at-yale/guides/parallel.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ e
f
```

To achieve the same result, `parallel` starts some number of workers and then runs tasks on them. The number of workers and tasks need not be the same. You specify the number of workers with `-j`. The tasks can be generated with a list of arguments specified after the separator `:::`. For parallel to perform well, you should allocate at least the same number of CPUs as workers with the slurm option `--cpus-per-task` or more simply `-c`.
To achieve the same result, `parallel` starts some number of workers and then runs tasks on them. The number of workers and tasks need not be the same. You specify the number of workers with `-j`. The tasks can be generated with a list of arguments specified after the separator `:::`. For parallel to perform well, you should allocate at least the same number of CPUs as workers with the Slurm option `--cpus-per-task` or more simply `-c`.

``` bash
salloc -c 4
Expand Down
6 changes: 3 additions & 3 deletions docs/clusters-at-yale/job-scheduling/dependency.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Jobs with Dependencies

SLURM offers a tool which can help string jobs together via dependencies.
Slurm offers a tool which can help string jobs together via dependencies.
When submitting a job, you can specify that it should wait to run until a specified job has finished.
This provides a mechanism to create simple pipelines for managing complicated workflows.

Expand Down Expand Up @@ -72,8 +72,8 @@ This last job will wait to run until all previous jobs with name `JobName` finis

## Further Reading

SLURM provides a number of options for logic controlling dependencies.
Slurm provides a number of options for logic controlling dependencies.
Most common are the two discussed above, but `--dependency=afternotok:<job_id>` can be useful to control behavior if a job fails.
Full discussion of the options can be found on the SLURM manual page for `sbatch` (https://slurm.schedmd.com/sbatch.html).
Full discussion of the options can be found on the Slurm manual page for `sbatch` (https://slurm.schedmd.com/sbatch.html).
A very detailed overview, with examples in both bash and python, can also be found at the NIH computing reference: https://hpc.nih.gov/docs/job_dependencies.html.

4 changes: 2 additions & 2 deletions docs/clusters-at-yale/job-scheduling/dsq.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ You can monitor the status of your jobs in Slurm by using `squeue -u <netid>` an
First, you'll need to generate a job file.
Each line of this job file needs to specify exactly what you want run for each job, including any modules that need to be loaded or modifications to your environment variables.
Empty lines or lines that begin with `#` will be ignored when submitting your job array.
**Note:** slurm jobs start in the directory from which your job was submitted.
**Note:** Slurm jobs start in the directory from which your job was submitted.

Create a file with the jobs you want to run, one per line.
A simple loop that prints your jobs should usually suffice.
Expand Down Expand Up @@ -121,7 +121,7 @@ Optional Arguments:
Name of your job array. Defaults to dsq-jobfile
--max-jobs number Maximum number of simultaneously running jobs from the job array.
-o fmt_string, --output fmt_string
Slurm output file pattern. There will be one file per line in your job file. To suppress slurm out files, set this to /dev/null. Defaults to dsq-jobfile-%A_%a-%N.out
Slurm output file pattern. There will be one file per line in your job file. To suppress Slurm out files, set this to /dev/null. Defaults to dsq-jobfile-%A_%a-%N.out
--status-dir dir Directory to save the job_jobid_status.tsv file to. Defaults to working directory.
--suppress-stats-file Don't save job stats to job_jobid_status.tsv
--submit Submit the job array on the fly instead of creating a submission script.
Expand Down
4 changes: 2 additions & 2 deletions docs/clusters-at-yale/job-scheduling/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,14 +121,14 @@ See our page of [Submission Script Examples](/clusters-at-yale/job-scheduling/sl
```

!!! warning "No Space After #SBATCH"
When writing SLURM directives, make sure there is no space between the `#` and `SBATCH`.
When writing Slurm directives, make sure there is no space between the `#` and `SBATCH`.

```bash
#SBATCH --option=value # Correct
# SBATCH --option=value # Incorrect - will be ignored
```

Directives with a space after the `#` will be treated as comments and ignored by SLURM.
Directives with a space after the `#` will be treated as comments and ignored by Slurm.



Expand Down
4 changes: 2 additions & 2 deletions docs/clusters-at-yale/job-scheduling/scrontab.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ EDITOR=nano scrontab -e
Lines that start with `#SCRON` are treated like the beginning of a new batch job, and work like `#SBATCH` directives for batch jobs. Slurm will ignore `#SBATCH` directives in scripts you run as `scrontab` jobs. You can use most [common `sbatch` options](/clusters-at-yale/job-scheduling/#common-job-request-options) just as you would [using sbatch on the command line](https://slurm.schedmd.com/sbatch.html). The first line after your `SCRON` directives specifies the schedule for your job and the command to run.

!!! info "Note"
All of your `scrontab` jobs will start with your home directory as the working directory. You can change this with the `--chdir` slurm option.
All of your `scrontab` jobs will start with your home directory as the working directory. You can change this with the `--chdir` Slurm option.

### Cron syntax

Expand All @@ -48,7 +48,7 @@ To your script to ensure that your environment is set up correctly.
python my_script.py > $(date +"%Y-%m-%d")_myjob_scrontab.out
```

If you want to see slurm accounting of a job handled by scrontab, for example job `12345` run:
If you want to see Slurm accounting of a job handled by scrontab, for example job `12345` run:

``` bash
sacct --duplicates --jobs 12345
Expand Down
Loading