Skip to content

Using the HPC queuing systems

Slurm is the primary queuing system available on the Wildwood HPC infrastructure. If you have used the CQLS compute infrastructure for some time, you may be comfortable using the SGE commands to start and stop jobs, check on progress, and identify available compute resources. This guide will help you become more accustomed to the Slurm commands and hopefully reduce downtime when switching your workflows over to Slurm.

Note

SGE is available for legacy pipelines on specific compute nodes, but we recommend moving over to Slurm whenever possible. hpcman commands will continue to support both SGE and Slurm whenever possible.

Command overview

Purpose SGE Slurm hpcman Notes
Non-interactive job submission qsub sbatch hqsub SGE_Batch and SGE_Avail previously worked for this purpose
Interactive jobs qrsh salloc N/A srun --pty $SHELL is also acceptable
Terminating jobs qdel scancel N/A
Monitoring job status qstat squeue hqstat
Checking job details qstat -j $JOBID scontrol show job $JOBID N/A The scontrol show job option is more informative than the sstat command, which is available in Slurm
Getting available compute resources qstat -f sinfo -Nl hqavail

Submitting jobs

We recommend most users transition from SGE_Batch and SGE_Array workflows to using hqsub for job submission. hqsub is part of the hpcman software developed at the CQLS to help manage HPC environments, software, and jobs. Under the hood, hqsub can submit scripts to both SGE and Slurm queueing systems, using qsub and sbatch, respectively.

Advanced users and those who previously wrote their own qsub scripts should find migration to sbatch relatively painless. See the Rosetta Stone of Workload Managers for more information.

Tip

The translation for number of cores/cpus seems to be -c rather than -n

Interactive jobs

In order to check out an interactive session on a node using Slurm, instead of using qrsh, users can use the salloc command (or srun --pty $SHELL). You can also specify a queue (called partitions in Slurm, and for the remainder of this document) and/or a specific node.

Here's an example:

salloc -c 8 -p core -w chrom1

I have checked out 8 CPUs (-c flag), on the core partition (-p flag), on the node chrom1 (-w flag).

Tip

Unlike SGE, Slurm does not limit job submission to a submit host. Because of this, you can check out a node interactively, then submit jobs to/from that node (using sbatch or hqsub). You can monitor job outputs and resource usage more directly using this job submission paradigm.

Link to Slurm reference regarding interactive jobs.

Terminating jobs

Users have a few options for filtering and/or selecting jobs to terminate using scancel compared to the qdel command of SGE. The general protocol is the same, with scancel $JOBID canceling a single job by specified jobid.

In general, I suggest always providing the -v flag to scancel as scancel provides no user feedback by default.

Option Purpose
--me Restrict canceling to your own jobs
-t STATE Cancel jobs in a particular STATE, i.e. pending, running, or suspended
-p PARTITION Restrict to the specified partition
-w NODE Restrict to the specified node
-n NAME Cancel jobs with the specified name

So, as an example, if I wanted to verbosely terminate all of my pending jobs in the all.q partition, I would run:

scancel -v --me -p all.q -t pending

Job status

The squeue command prints out information about all running jobs on the infrastructure. squeue acts as a replacement for the qstat command. In general, the hqstat command should suffice for most of your needs in terms of job status monitoring.

Tip

Use the hqstat --watch command to watch job status over time. Use ctrl+c to cancel the watch.

Job details

While we previously could get additional information about job details using the qstat -j $JOBID command, the closest command to replicate this functionality in Slurm is the scontrol show job $JOBID command.

Tip

To monitor job status and details programmatically, you can use either the squeue --json -j $JOBID or scontrol show job --json $JOBID commands. The outputs are nearly identical.

Node status and availability

In order to monitor node availability, we previously provided the SGE_Avail command, which ran qstat -j and qhost and aggregated the results in a table. In Slurm, these details can be gathered using the sinfo command, and are wrapped using the hqavail command of hpcman. To find out more information about a specific node or partition, you can use the scontrol show node $NODE or scontrol show partition $PARTITON, respectively. Add the --json flag to either of those commands for programmatic access using json.

Accounts and partitions

In Slurm, membership of users in Slurm accounts, which are unrelated to your linux groups, are what controls access to the Slurm partitions. To see what accounts you have access to, you can run this command:

sacctmgr user
➜ sacctmgr show user -s davised format=User,DefaultAccount%15,Account%15
      User        Def Acct         Account
---------- --------------- ---------------
   davised            core           grace
   davised            core           dmplx
   davised            core         jackson
   davised            core        cqls_gpu
   davised            core            core
   davised            core            cqls
   davised            core           ceoas

Note

The -s flag is required to show different associations between users and accounts. If you are missing access to a partition that you think you should have access to, you can see which accounts are allowed for a partition using scontrol show partition $PARTITION.

Guidance for fair use of the queueing system

As we migrate our workflows to Slurm, we want to ensure everyone has access to the compute resources they need to complete their research projects. Please be mindful of how many resources your jobs are using at any given time on the infrastructure, especially on shared partitions.

Our primary goal for fair use should be that jobs that can complete within 24-48h are provided the resources to do so. Lack of resources should not inhibit jobs from finishing in a timely manner.

Here are some general guidelines that you can follow when submitting your jobs:

  1. Use array jobs to group your submission of related jobs when processing multiple samples at once. If you are submitting tens-thousands of jobs at a time in a loop, please convert your scripts to submit an array job.
  2. When using array jobs, control the concurrency (-b flag of hqsub). Concurrency is the setting that controls the maximum number of tasks in that array that will run at once. Multiply the cpus*concurrency to see the potential CPU usage of your array job. Leave space on the partition for other folks to use.
  3. When using multiple CPUs (-p flag of hqsub), make sure to set the CPU usage in your command as well. Most programs will not automatically use the number of CPUs provided by Slurm.
  4. Use the local drives (/scratch) when possible. Using more CPUs does not always lead to an increase in compute speed. Not all programs support using multiple CPUs. Often, using the local drives can lead to reduced runtime due to the speed-up in program I/O. If the CPUs are waiting for data, then providing more CPUs will never speed up your processing.
  5. If you have a processing job that will require the majority of a partition's resources, submit the jobs during lower use times, i.e. after hours or on weekends. This will ensure jobs are moving through the queuing system more quickly during the work day.
  6. Use the departmental partitions rather than lab partitions for most of your jobs, and only use lab partitions for high priority jobs so that priority queuing can work.

Priority queuing

In order to facilitate our shared goals of fair use, we have enabled the Slurm priority plugin. The weights for the priority are currently:

sprio -w
➜ sprio -w
          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE    JOBSIZE  PARTITION
        Weights                               1       1000       2000       1000      10000

As you can see, we have implemented a system where partition has more weight than default. In this way, you can choose which partition you submit your jobs to in order to control the priority.

The departmental partitions (e.g. bpp) will have a lower priority than the lab-specific partitions within that department. In general, your standard/lower priority jobs should be targeted to the departmental queue, with your high- priority jobs targeted to your lab-specific partition.

We have set the preempt mode to GANG/SUSPEND, meaning that lower priority jobs may be suspended or fail to start if higher priority jobs are already scheduled to run. Suspended jobs will remain in memory on the node so that they can later be resumed (so memory will not be freed). CPUs and the remaining memory available on a machine will be available for the higher priority job.

Please let us know or contact me (Ed) over email, slack, teams, etc. if these settings appear to be working (or not!).

Note

If your jobs do list a reason with squeue of (PRIORITY) then it means priority queuing is affecting the job start/stop.