Slurm¶

The CEOAS HPC servers use the Slurm job scheduler for scheduling, allocating, and controlling jobs. This document aims to provide basic information, guidelines, and a reference for the core portions used most often by CEOAS researchers. For more in-depth information, the official documentation can be found at https://slurm.schedmd.com/documentation.html.

The first two sections form a basic reference for commands, with the first section being for interacting with jobs and the second for viewing information about jobs and the cluster. The rest is additional reference on aspects such as batch scripts, general usage notes, and an appendix on common jobs options,

Allocating, running, and cancelling jobs¶

Slurm treats allocation and execution as separate but dependent on each other. The three ways to request an allocation and execute a job are srun, salloc, and sbatch. The three of them mostly have the same options, although there are a few options that are valid only for one or two of them. Due to the sheer volume of useful options available, the condensed reference is deferred to the end of this document instead of in-line with the commands.

sbatch¶

sbatch is the most fundamental way to run a job on a cluster by submitting it as a batch script. Batch scripts consist of one or more directives starting with #SBATCH followed by the commands to be executed. The script is queued in a partition until sufficient resources are available for allocation. After the resources are allocated, the script is ran on one of the allocated nodes. Most sbatch scripts contain one or more sruns to to execute the actual program, particularly for multinode jobs. Since the script contains all the complexity, sbatch is generally executed simply as:

sbatch <script>

Any option that can be defined in the batch script can also be passed directly to sbatch on the command line and vice-versa. For example, including #SBATCH --partition=ceoas is equivalent to calling sbatch with the --partition=ceoas option.

salloc¶

For interactive work, salloc is the preferred option. It works similarly to sbatch by providing an allocation and running on one of the nodes, but instead of running asynchronously in the background it runs interactively and allows user input. This makes it quite useful for writing and debugging the steps in an sbatch script by allowing them to be executed one-by-one manually. It can be ran with an optional command to execute, and will otherwise run the user's default shell. An example salloc command is:

salloc -p ceoas -N 4 --ntasks 32

This requests an interactive allocation of four nodes and thirty-two tasks, which will then run the user's default shell on the first of the nodes. After that, regular setup commands can be ran and the core program can be executed with srun to run on all four of the nodes.

srun¶

srun executes jobsteps within an allocation provided by salloc or sbatch. For people familiar with MPI, it's the Slurm equivalent of mpirun. Oftentimes mpirun and srun are interchangeable, but it's recommended to use srun by default for better compatibility with the scheduler. In the special case of running srun outside of an allocation (such as directly from a login node), it will request an allocation and then execute inside of it. Within an salloc or sbatch, srun is generally just executed as:

srun <program>

You should never nest srun within mpirun or vice-versa because they replicate each other's functionality and will cause exponentially more copies of your application to run, which in turn will consume resources and clobber each other's output.

NOTE: Historically using the --pty option with srun was the recommended way to get an interactive shell. Due to changes in how Slurm handles nested sruns, it is highly recommended to not use this functionality because it can cause unexpected behavior when trying to run an sbatch or srun inside of the shell. It is always recommended to use salloc to request an interactive shell.

scancel¶

scancel, as you might have guessed, is used to cancel jobs. By default it sends SIGKILL, but it can also send alternative signals. Since scancel by default has no output, it's recommended to run with -v:

[olsont@shell-hpc ~]$ scancel -v 4010489
scancel: Terminating job 4010489

Some useful options are:

-n, --name=<jobname>
    Cancel jobs with the given name
-p, --partition=<partitions>
    Cancel jobs on the specific partition(s)
-w, --nodelist=<nodes>
    Cancel jobs on the specific node(s)
-u, --user=<username>
    Cancel jobs by the listed user
--me
    Cancel jobs by the current users
-t, --states=<states>
    Cancel jobs in the given state. Valid states are PENDING, RUNNING, and SUSPENDED
-s, --signal=<signame>
    Sends the given signal to the job(s) instead of SIGKILL
-v, --verbose
    Show additional information when ran

When ran with no options and no jobid, scancel does nothing. Otherwise, scancel will cancel all jobs matching all the given options. For example, running with --user=<user> and --state=PENDING will cancel all jobs by that user that are pending, but not any of the user's other jobs. Of course, regular users cannot cancel other users jobs.

Job/partition/node/account management¶

This is the summary of commands used for viewing and managing jobs, nodes, and partitions. Some examples and lists of common options are provided. Aside from the listed information, all of these (except for sprio) also accept the --json and --yaml options for structured output, convenient for scripting and processing.

sinfo¶

sinfo is used to view the general state of partitions and nodes in the cluster. The most common form is sinfo -p <partition>, which gives a summary of all nodes in the partition, grouped by state:

[olsont@shell-hpc ~]$ sinfo -p ceoas,ceoas-lowprio
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
ceoas            up   infinite      1  down* hina02
ceoas            up   infinite      7    mix hina01,kawashiro[03-04,07],yatagarasu[02,04,06]
ceoas            up   infinite     26  alloc amaterasu[01-16],kawashiro[01-02],yatagarasu[01,03,05,07-09,11-12]
ceoas            up   infinite      4   idle kawashiro[05-06,08],yatagarasu10
ceoas-lowprio    up   infinite      1    mix brizo03
ceoas-lowprio    up   infinite      6   idle brizo[01-02,04-07]

Some of the more common and useful options are:

-p, --partition=<partition>
    Filters output to the specific partition(s)
-n, --nodes=<nodes>
    Filters output to the specific node(s)
-t, --states=<states>
    Filters output to nodes in a specific state or group of states, such as IDLE, MIX, and ALLOC
-e, --exact
    Groups output by nodes with matching configurations
-S, --sort=<fields>
    Sort output by one or more fields (check man sinfo for field specifiers)
-O, --Format=<format>
    Allows specifying a format string for information about specific fields
--helpFormat
    Provides a list of fields that can be specified with --Format

The following example command gets all the nodes in the ceoas and ceoas-gpu partitions; groups nodes with the exact same configurations; outputs the nodelist, memory, sockets, cores, threads, and GPUs; and sorts them in descending order of memory capacity:

[olsont@shell-hpc ~]$ sinfo -p ceoas,ceoas-gpu -e -O "NodeList:20,Memory:10,Sockets:8,Cores:7,CPUs:7,Gres" -S -m
NODELIST            MEMORY    SOCKETS CORES  CPUS   GRES
hina[01-02]         770000    2       18     72     (null)
aerosmith           384000    2       16     64     gpu:a100:2(S:1)
amaterasu[01-16]    250000    2       8      32     (null)
yatagarasu[01-12]   250000    2       12     48     (null)
ayaya[01,03-06]     190000    2       16     32     gpu:nvidia_geforce_g
ayaya02             190000    2       16     32     gpu:nvidia_geforce_g
kawashiro[01-08]    190000    2       18     72     (null)

squeue¶

The basic command for inspecting the state of the job queue. It takes many of the same options as sinfo does, with the notable difference of using -w instead of -n for filtering on nodes:

olsont@shell-hpc ~]$ squeue -p ceoas-gpu
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           3999197 ceoas-gpu amip-aes    user1 PD       0:00      1 (Resources)
           4001021 ceoas-gpu     bash    user2  R 2-09:38:45      1 aerosmith
           4006505 ceoas-gpu yolo_exp    user3  R   21:36:21      1 aerosmith
           3999667 ceoas-gpu     bash    user4  R 2-23:44:52      1 aerosmith
           4004127 ceoas-gpu misr-ps-    user5  R 1-22:29:09      1 ayaya01
           3999133 ceoas-gpu     bash    user4  R 3-08:15:36      1 aerosmith

Aside from the previously listed options given for sinfo, the other useful option is --me, which will filter the output to only the executing user's jobs.

scontrol¶

scontrol is the catch-all tool for managing and viewing aspects of the cluster. For regular users, the most common subcommand will be show for displaying information about specific parts of the cluster. Scontrol generally outputs large amounts of information, so in the interest of brevity examples of output will not be given here. Instead, it is encouraged that users try the following commands themselves to see the output:

scontrol show node <nodes>
    Shows information about the listed node(s)
scontrol show partition <partitions>
    Shows information about the listed partition(s)
scontrol show job <jobids>
    Shows information about the listed job(s)
scontrol show config
    Shows the configuration of the Slurm cluster

sacct¶

sacct is used for getting detailed information about past and current jobs from Slurm's accounting database. It can be useful for seeing historical information about usage of certain partitions, nodes, and users:

[olsont@shell-hpc ~]$ sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
4009455      interacti+      all.q      ceoas          1  COMPLETED      0:0
4009455.int+ interacti+                 ceoas          1  COMPLETED      0:0
4009455.ext+     extern                 ceoas          1  COMPLETED      0:0
4009463      interacti+      ceoas      ceoas          2    RUNNING      0:0
4009463.int+ interacti+                 ceoas          2    RUNNING      0:0
4009463.ext+     extern                 ceoas          2    RUNNING      0:0

By default it shows jobs from the current user from today. As with the other Slurm commands, sacct has a vast array of options as well. The ones I've found the most useful are:

-j, --jobid=<jobid>
    Specify a specific job to get accounting information for
-r, --partition=<partitions>
    Restricts output to jobs ran on the listed partition(s)
-u, --user=<username>
    Select jobs from the given user
-S, --starttime=<time>
    Select jobs that were active at or after the given time
-E, --endtime=<time>
    Select jobs that were active before the given time
-X, --allocations
    Only show allocations instead of individual steps
    Prevents showing step-specific information, such as memory usage
-o, --format=<format>
    Specifies the fields to be displayed for each job
--helpformat
    Get the list of fields available to use with --format

An example command that I regularly use and adjust as necessary for getting the jobs for a specific user on a partition between two dates:

[olsont@shell-hpc ~]$ sacct -u olsont -r ceoas -S 2026-05-01 -E 2026-05-29 -o JobID,JobName,Start,End,State,Elapsed -X
JobID           JobName               Start                 End      State    Elapsed
------------ ---------- ------------------- ------------------- ---------- ----------
3966828            bash 2026-05-21T22:07:44 2026-05-21T22:08:14  COMPLETED   00:00:30
3966894            bash 2026-05-21T22:08:20 2026-05-21T22:08:21  COMPLETED   00:00:01
3966897            bash 2026-05-21T22:18:47 2026-05-21T22:23:21  COMPLETED   00:04:34
3967163            bash                None 2026-05-21T22:41:35 CANCELLED+   00:00:00

Another useful command for getting certain post-job information:

[olsont@hpc ~]$ sacct -j 2963013 -o JobName,Start,Elapsed,AllocCPUs,AllocNodes,AveCPU,MaxRSS,MaxDiskRead,MaxDiskWrite,Nodelist%20
   JobName               Start    Elapsed  AllocCPUS AllocNodes     AveCPU     MaxRSS  MaxDiskRead MaxDiskWrite             NodeList
---------- ------------------- ---------- ---------- ---------- ---------- ---------- ------------ ------------ --------------------
NeverWorl+ 2026-03-20T11:54:34   01:21:35        576          8                                                     kawashiro[01-08]
     batch 2026-03-20T11:54:34   01:21:35         72          1   00:00:01     13084K       17.42M        0.38M          kawashiro01
    extern 2026-03-20T11:54:34   01:21:35        576          8   00:00:00       256K        0.01M        0.00M     kawashiro[01-08]
hydra_bst+ 2026-03-20T11:54:34   01:21:35        576          8 4-01:17:14  11477724K     3409.48M     3880.06M     kawashiro[01-08]

Of particular note is the MaxRSS field, which is the maximum memory used by a single task across all tasks in that step. Knowing this can be useful for fine-tuning memory allocation for jobs, allowing more efficient packing on jobs on unused resources.

sacctmgr¶

sacctmgr is used for viewing and modifying information about Slurm accounts and associations. Similar to scontrol, most users will only have read-only access and will likely only use the list and show subcommands. Similar to scontrol, it can also output an abundance of information that is omitted for brevity, but here are some generally useful commands:

sacctmgr show associations user=<username> format=Account
    Shows the accounts that a user is associated with
sacctmgr show associations account=<account> format=User
    Shows the users that an account is associated with
sacctmgr show accounts
    Show all the accounts in the cluster
sacctmgr show qos format=Name%-20
    Show all the QoS in the cluster
sacctmgr show qos <name> format=Name,MaxTRESPU%20,MaxJobsPU,UsageFactor,Flags
    Show a specific QoS in the cluster with some particular information

sprio¶

sprio is used to view the priority of jobs that are in the queue and waiting to be scheduled. If multiple jobs are sitting in the queue with the state PD, this lets you see what their priorities are and how they're calculated.

[olsont@shell-hpc ~]$ sprio
          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE    JOBSIZE  PARTITION
        3855697 phyc_lab       11225          0        471        751          4      10000
        3962493 coe-arm        13003          0       1000       2000          3      10000
        3962518 coe-arm        13003          0       1000       2000          3      10000
        3962554 coe-arm        13003          0       1000       2000          3      10000
        3998778 coe-arm        12474          0        471       2000          3      10000
        3999197 ceoas-gpu      10880          0        479        396          5      10000

There are only a handful of notable options for sprio:

-p, --partition=<partitions>
    Filters output to jobs queued on the listed partition(s)
-u, --user=<usernames>
    Select jobs from the given user(s)
-j, --jobs<jobs>
    Show only the specified job(s)
-l, --long
    Show additional information, such as usernames and additional priority factors
-w, --weights
    List the current weights as configured for the cluster

As of this writing (May 31st, 2026), here are the current weights on the cluster:

[olsont@shell-hpc ~]$ sprio -w
          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE    JOBSIZE  PARTITION
        Weights                               1       1000       2000       1000      10000

Detailed information on how priorities are calculated with the given weights can be found at https://slurm.schedmd.com/priority_multifactor.html.

sshare¶

sshare is used to view details on the FairShare portion of priority calculations. For a quick look at a user's FairShare portion on their partitions:

[olsont@shell-hpc ~]$ sshare -U olsont
Account                    User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
ceoas                    olsont          1    0.000751    19012658      0.001978   0.160980
brizo                    olsont          1    0.000063           0      0.000000   1.000000
coe-arm                  olsont          1    0.004444           0      0.000000   1.000000
coe                      olsont          1    0.002849           0      0.000000   1.000000
cqls                     olsont          1    0.000055           0      0.000000   1.000000

Jobs submitted by a user to a given partition use that user's FairShare value for that partition to calculate its contribution to their job's priority. Information on how FairShare is calculated can be found at https://slurm.schedmd.com/fair_tree.html.

Batch scripts¶

Batch scripts used with sbatch are the core of getting the most out of Slurm. They consist of one of more #SBATCH options at the top which are passed to the scheduler and runtime, followed by the rest of the script to be executed on the allocated nodes. Batch scripts generally look like the following

#!/bin/bash
#SBATCH --job-name=mpijob
#SBATCH --output=mpijob-%j.out
#SBATCH --nodes=4
#SBATCH --ntasks=32
#SBATCH --time=1:00:00
#SBATCH --partition ceoas
#SBATCH --exclusive
#SBATCH --mem=0

# Set up environment
. /local/ceoas/opt/spack/current/share/spack/setup-env.sh
spack env activate intel-intelmpi-cpu-x86
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so

# Execute command
srun mpijob

# Copy results to other storage
cp /storage/ceoas-scratch/workdir/output /ceoas/mydir/mpijob/

When executing a batch script, Slurm will read each line starting with #SBATCH as a directive to the job scheduler/running environment. It will continue until it finds the first non-blank line that does not start with a #. Any #SBATCH directives placed after a non-blank non-comment line will be ignored (so you can't set #SBATCH directives using shell commands within the batch script itself).

Most batch scripts can generally be broken down into sections for directives, pre-execution/setup, execution, and post-execution/cleanup. Keeping this general structure is recommended to improve readability and maintainability. It's also best to put as much in the batch script itself as possible for ease of replication and distribution instead of relying on external scripts or options passed to the sbatch command.

There are a vast variety of sbatch options available, covering everything from administrative tasks to core binding masks. Most of those options are also applicable to salloc and sbatch. The official list of them can be found at https://slurm.schedmd.com/sbatch.html, and the appendix at the end of this document covers the ones most commonly used by CEOAS researchers.

Job Arrays¶

WIP. Job Arrays are super convenient for breaking down large jobs into short jobs and submitting many short jobs to the cluster (which improves cluster operation as long as each job is more than a few minutes long).

Miscellaneous notes and issues¶

Checkpointing¶

(TODO: Need to look into DMTCP for agnostic checkpointing; https://dmtcp.sourceforge.io/, https://docs.nersc.gov/development/checkpoint-restart/dmtcp/, https://github.com/mpickpt/mana seems interesting.)

FAQ¶

Q: Why do I get "Requested node configuration is not available" when I submit my job?¶

A: The most common reason by far is that you requested many or all the CPUs on a system but didn't specify how much memory to allocate with --mem or --mem-per-cpu. This is especially common with the --exclusive option, because Slurm will allocate all the CPUs on the node even if you don't use them. By default Slurm will request ~4GB of RAM per CPU (see DefMemPerCPU in scontrol show config). For example, if you submit a job with 16 tasks to a node with 384 threads and use --exclusive, Slurm will try to allocate 1.5TB (4GB/CPU * 384 CPUs) of RAM on that server. The simple fix is to either add --mem=0 if you're running in exclusive, or adjust --mem-per-cpu to better match what your job actually uses.

Appendix A: Options for sbatch/salloc/srun¶

WIP, please wait warmly

Shared options

-A, --account
    Select the account to use for submitting to the partition
-a, --array=<indices>
    Specify indexes for running a job array. See the section on job arrays for details
-D, --chdir=<directory>
    Change to <directory> before executing the batch script
-c, --cpus-per-task=<ncpus>
    Specify how many CPUs to allocate per task. Used for multithreaded jobs (OpenMP, pthreads, etc.)and almost never used for MPI jobs.
-m, --distribution=<distribution>
    Used to specify how tasks are allocated across nodes and CPUs are allocated across sockets.
    See https://slurm.schedmd.com/sbatch.html#OPT_distribution for detail
-e, --error=<filename_pattern>
    Set the file to attach standard error (stderr) to for the job
-x, --exclude=<nodelist>
    Exclude the listed nodes from execution
--exclusive
    Run the job exclusively on a node, preventing others from using it simultaneously
    This allocates all the cores but does not allocate all the memory
    Highly recommended to use --mem=0 with --exclusive to prevent memory issues
-B, --extra-node-info=<sockets>[:cores[:threads]]
    A more compact way of specifying the --(sockets/cores/threads)-per-(node/socket/core) options
-G, --gpus=[type:]<number>
    Specify how many GPUs are required for the job, with an optional type specification
--gpus-per-task=[type:]<number>
    Specify how many GPUs are required for each task in the job
-i, --input=<filename_pattern>
    Connects the given file to the script's input. Useful for some things that may require some level of interactive input
--mail-type=<types>
    Specify on which events to send an e-mail. Common options are BEGIN, END, FAIL, and ALL. Defaults to NONE
--mail-user=<address>
    Address to send e-mail notifications to
--mem=<size>[units]
    Specifies how much memory is required per node, defaulting to megabytes. --mem=0 reserves all the memory on an allocated node
--mem-per-cpu=<size>[units]
    Specifies how much memory is required per allocated CPU. Defaults to DefMemPerCPU in the cluster config (4000MB as of May 31st, 2026)
-w, --nodelist=<nodelist>
    Requests the specific nodes. The job will contain these nodes as a *minimum* but can contain additional hosts if needed to satisfy requirements.
-F, --nodefile=<file>
    The same as --nodelist but in a file
-N, --nodes=<minnodes>[-maxnodes]|<size_string>
    Requests the given number of nodes. If maxnodes is specified, allocates a number of nodes between minnodes and maxnodes, otherwise allocates minnodes.
-n, --ntasks=<num>
    For srun, specifies the number of tasks to be ran. For sbatch, informs the scheduler
    how many tasks will be ran for scheduling purposes.
--ntasks-per-node=<ntasks>
    Specifies the number of tasks to run on each node. If used with --ntasks, specifies a maximum number of tasks per node.
--ntasks-per-socket=<ntasks>
    Specifies the maximum number of tasks to be ran on each socket.
-o, --output=<filename_pattern>
    Set the file to attach standard output (stdout) to for the job
-p, --partition=<partitions>
    Specify the partition to run the job on. Multiple partitions can be specified, separated by commas, which is useful for certain types of batch jobs.
--propagate[=rlimit[,rlimit...]]
    Specify which resource limits to propagate to nodes. Some jobs will require --propogate=NONE.
-q, --qos=<qos>
    Specify a QOS to be applied to a job. No QOS are currently available but they may in the future.
--requeue
    Specify that a job can be requeued in case of node failure or preemption. Currently the cluster default.
    Generally only used for software that supports resume or short jobs that can be restarted with no issue.
    Useful for running jobs that can be restarted on lowprio queues, since it can allow them to use those resources and automatically resume when available.
--no-requeue
    Specify that a job cannot be requeued
-t, --time=<time>
    Specify how long the job is expected to run for. Jobs running longer than this will be cancelled.
    Currently we don't impose time limits so it's not strictly necessary, but still useful for preventing jobs from accidentally running indefinitely.

srun-specific

--async

--cpu-bind
    Specify how tasks are bound to CPUs.
    See https://slurm.schedmd.com/srun.html#OPT_cpu-bind for detailed usage
--exclusive
    TODO: Note how it's different from sbatch exclusive
--mpi=<mpi_type>
    Specify the type of MPI for process communication.
    Generally not needed, but sometimes required for using Intel MPI.