Using the HPC queuing systems¶
Slurm is the primary queuing system available on the Wildwood HPC infrastructure. If you have used the CQLS compute infrastructure for some time, you may be comfortable using the SGE commands to start and stop jobs, check on progress, and identify available compute resources. This guide will help you become more accustomed to the Slurm commands and hopefully reduce downtime when switching your workflows over to Slurm.
Note
SGE is available for legacy pipelines on specific compute nodes, but we recommend moving over to Slurm whenever
possible. hpcman
commands will continue to support both SGE and Slurm whenever possible.
Command overview¶
Purpose | SGE | Slurm | hpcman | Notes |
---|---|---|---|---|
Non-interactive job submission | qsub |
sbatch |
hqsub |
SGE_Batch and SGE_Avail previously worked for this purpose |
Interactive jobs | qrsh |
salloc |
N/A |
srun --pty $SHELL is also acceptable |
Terminating jobs | qdel |
scancel |
N/A |
|
Monitoring job status | qstat |
squeue |
hqstat |
|
Checking job details | qstat -j $JOBID |
scontrol show job $JOBID |
N/A |
The scontrol show job option is more informative than the sstat command, which is available in Slurm |
Getting available compute resources | qstat -f |
sinfo -Nl |
hqavail |
Submitting jobs¶
We recommend most users transition from SGE_Batch
and SGE_Array
workflows to using hqsub
for job submission.
hqsub
is part of the hpcman
software developed at the CQLS to help manage HPC environments, software, and jobs.
Under the hood, hqsub
can submit scripts to both SGE and Slurm queueing systems, using qsub
and sbatch
,
respectively.
Advanced users and those who previously wrote their own qsub
scripts should find migration to sbatch
relatively
painless. See the Rosetta Stone of Workload Managers for more information.
Tip
The translation for number of cores/cpus seems to be -c
rather than -n
Interactive jobs¶
In order to check out an interactive session on a node using Slurm, instead of using qrsh
, users can use the
salloc
command (or srun --pty $SHELL
). You can also specify a queue (called partitions in Slurm, and for the
remainder of this document) and/or a specific node.
Here's an example:
I have checked out 8 CPUs (-c
flag), on the core partition (-p
flag), on the node chrom1 (-w
flag).
Tip
Unlike SGE, Slurm does not limit job submission to a submit host. Because of this, you can check out a node
interactively, then submit jobs to/from that node (using sbatch
or hqsub
). You can monitor job outputs and
resource usage more directly using this job submission paradigm.
Link to Slurm reference regarding interactive jobs.
Terminating jobs¶
Users have a few options for filtering and/or selecting jobs to terminate using scancel
compared to the qdel
command
of SGE. The general protocol is the same, with scancel $JOBID
canceling a single job by specified jobid.
In general, I suggest always providing the -v
flag to scancel
as scancel
provides no user feedback by default.
Option | Purpose |
---|---|
--me |
Restrict canceling to your own jobs |
-t STATE |
Cancel jobs in a particular STATE, i.e. pending, running, or suspended |
-p PARTITION |
Restrict to the specified partition |
-w NODE |
Restrict to the specified node |
-n NAME |
Cancel jobs with the specified name |
So, as an example, if I wanted to verbosely terminate all of my pending jobs in the all.q partition, I would run:
Job status¶
The squeue
command prints out information about all running jobs on the infrastructure. squeue
acts as a
replacement for the qstat
command. In general, the hqstat
command should suffice for most of your needs in terms
of job status monitoring.
Tip
Use the hqstat --watch
command to watch job status over time. Use ctrl+c to cancel the watch.
Job details¶
While we previously could get additional information about job details using the qstat -j $JOBID
command, the closest
command to replicate this functionality in Slurm is the scontrol show job $JOBID
command.
Tip
To monitor job status and details programmatically, you can use either the squeue --json -j $JOBID
or
scontrol show job --json $JOBID
commands. The outputs are nearly identical.
Node status and availability¶
In order to monitor node availability, we previously provided the SGE_Avail
command, which ran qstat -j
and qhost
and aggregated the results in a table. In Slurm, these details can be gathered using the sinfo
command, and are
wrapped using the hqavail
command of hpcman
. To find out more information about a specific node or partition, you
can use the scontrol show node $NODE
or scontrol show partition $PARTITON
, respectively. Add the --json
flag to
either of those commands for programmatic access using json.
Accounts and partitions¶
In Slurm, membership of users in Slurm accounts, which are unrelated to your linux groups, are what controls access to the Slurm partitions. To see what accounts you have access to, you can run this command:
➜ sacctmgr show user -s davised format=User,DefaultAccount%15,Account%15
User Def Acct Account
---------- --------------- ---------------
davised core grace
davised core dmplx
davised core jackson
davised core cqls_gpu
davised core core
davised core cqls
davised core ceoas
Note
The -s
flag is required to show different associations between users and accounts. If you are missing access
to a partition that you think you should have access to, you can see which accounts are allowed for a partition
using scontrol show partition $PARTITION
.
Guidance for fair use of the queueing system¶
As we migrate our workflows to Slurm, we want to ensure everyone has access to the compute resources they need to complete their research projects. Please be mindful of how many resources your jobs are using at any given time on the infrastructure, especially on shared partitions.
Our primary goal for fair use should be that jobs that can complete within 24-48h are provided the resources to do so. Lack of resources should not inhibit jobs from finishing in a timely manner.
Here are some general guidelines that you can follow when submitting your jobs:
- Use array jobs to group your submission of related jobs when processing multiple samples at once. If you are submitting tens-thousands of jobs at a time in a loop, please convert your scripts to submit an array job.
- When using array jobs, control the concurrency (
-b
flag ofhqsub
). Concurrency is the setting that controls the maximum number of tasks in that array that will run at once. Multiply the cpus*concurrency to see the potential CPU usage of your array job. Leave space on the partition for other folks to use. - When using multiple CPUs (
-p
flag ofhqsub
), make sure to set the CPU usage in your command as well. Most programs will not automatically use the number of CPUs provided by Slurm. - Use the local drives (/scratch) when possible. Using more CPUs does not always lead to an increase in compute speed. Not all programs support using multiple CPUs. Often, using the local drives can lead to reduced runtime due to the speed-up in program I/O. If the CPUs are waiting for data, then providing more CPUs will never speed up your processing.
- If you have a processing job that will require the majority of a partition's resources, submit the jobs during lower use times, i.e. after hours or on weekends. This will ensure jobs are moving through the queuing system more quickly during the work day.
- Use the departmental partitions rather than lab partitions for most of your jobs, and only use lab partitions for high priority jobs so that priority queuing can work.
Priority queuing¶
In order to facilitate our shared goals of fair use, we have enabled the Slurm priority plugin. The weights for the priority are currently:
➜ sprio -w
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION
Weights 1 1000 2000 1000 10000
As you can see, we have implemented a system where partition has more weight than default. In this way, you can choose which partition you submit your jobs to in order to control the priority.
The departmental partitions (e.g. bpp
) will have a lower priority than the lab-specific partitions within that
department. In general, your standard/lower priority jobs should be targeted to the departmental queue, with your high-
priority jobs targeted to your lab-specific partition.
We have set the preempt mode to GANG/SUSPEND
, meaning that lower priority
jobs may be suspended or fail to start if higher priority jobs are already scheduled to run. Suspended jobs will remain
in memory on the node so that they can later be resumed (so memory will not be freed). CPUs and the remaining memory
available on a machine will be available for the higher priority job.
Please let us know or contact me (Ed) over email, slack, teams, etc. if these settings appear to be working (or not!).
Note
If your jobs do list a reason with squeue
of (PRIORITY)
then it means priority queuing is affecting the job
start/stop.