Slurm¶
The CEOAS HPC servers use the Slurm job scheduler for scheduling, allocating, and controlling jobs. This document aims to provide basic information, guidelines, and a reference for the core portions used most often by CEOAS researchers. For more in-depth information, the official documentation can be found at https://slurm.schedmd.com/documentation.html.
The first two sections form a basic reference for commands, with the first section being for interacting with jobs and the second for viewing information about jobs and the cluster. The rest is additional reference on aspects such as batch scripts, general usage notes, and an appendix on common jobs options,
Allocating, running, and cancelling jobs¶
Slurm treats allocation and execution as separate but dependent on each other. The three ways to request an allocation and execute a job are srun, salloc, and sbatch. The three of them mostly have the same options, although there are a few options that are valid only for one or two of them. Due to the sheer volume of useful options available, the condensed reference is deferred to the end of this document instead of in-line with the commands.
sbatch¶
sbatch is the most fundamental way to run a job on a cluster by submitting it as a batch script. Batch scripts consist of one or more directives starting with #SBATCH followed by the commands to be executed. The script is queued in a partition until sufficient resources are available for allocation. After the resources are allocated, the script is ran on one of the allocated nodes. Most sbatch scripts contain one or more sruns to to execute the actual program, particularly for multinode jobs. Since the script contains all the complexity, sbatch is generally executed simply as:
Any option that can be defined in the batch script can also be passed directly to sbatch on the command line and vice-versa. For example, including #SBATCH --partition=ceoas is equivalent to calling sbatch with the --partition=ceoas option.
salloc¶
For interactive work, salloc is the preferred option. It works similarly to sbatch by providing an allocation and running on one of the nodes, but instead of running asynchronously in the background it runs interactively and allows user input. This makes it quite useful for writing and debugging the steps in an sbatch script by allowing them to be executed one-by-one manually. It can be ran with an optional command to execute, and will otherwise run the user's default shell. An example salloc command is:
This requests an interactive allocation of four nodes and thirty-two tasks, which will then run the user's default shell on the first of the nodes. After that, regular setup commands can be ran and the core program can be executed with srun to run on all four of the nodes.
srun¶
srun executes jobsteps within an allocation provided by salloc or sbatch. For people familiar with MPI, it's the Slurm equivalent of mpirun. Oftentimes mpirun and srun are interchangeable, but it's recommended to use srun by default for better compatibility with the scheduler. In the special case of running srun outside of an allocation (such as directly from a login node), it will request an allocation and then execute inside of it. Within an salloc or sbatch, srun is generally just executed as:
You should never nest srun within mpirun or vice-versa because they replicate each other's functionality and will cause exponentially more copies of your application to run, which in turn will consume resources and clobber each other's output.
NOTE: Historically using the --pty option with srun was the recommended way to get an interactive shell. Due to changes in how Slurm handles nested sruns, it is highly recommended to not use this functionality because it can cause unexpected behavior when trying to run an sbatch or srun inside of the shell. It is always recommended to use salloc to request an interactive shell.
scancel¶
scancel, as you might have guessed, is used to cancel jobs. By default it sends SIGKILL, but it can also send alternative signals. Since scancel by default has no output, it's recommended to run with -v:
-n, --name=<jobname>
Cancel jobs with the given name
-p, --partition=<partitions>
Cancel jobs on the specific partition(s)
-w, --nodelist=<nodes>
Cancel jobs on the specific node(s)
-u, --user=<username>
Cancel jobs by the listed user
--me
Cancel jobs by the current users
-t, --states=<states>
Cancel jobs in the given state. Valid states are PENDING, RUNNING, and SUSPENDED
-s, --signal=<signame>
Sends the given signal to the job(s) instead of SIGKILL
-v, --verbose
Show additional information when ran
When ran with no options and no jobid, scancel does nothing. Otherwise, scancel will cancel all jobs matching all the given options. For example, running with --user=<user> and --state=PENDING will cancel all jobs by that user that are pending, but not any of the user's other jobs. Of course, regular users cannot cancel other users jobs.
Job/partition/node/account management¶
This is the summary of commands used for viewing and managing jobs, nodes, and partitions. Some examples and lists of common options are provided. Aside from the listed information, all of these (except for sprio) also accept the --json and --yaml options for structured output, convenient for scripting and processing.
sinfo¶
sinfo is used to view the general state of partitions and nodes in the cluster. The most common form is sinfo -p <partition>, which gives a summary of all nodes in the partition, grouped by state:
[olsont@shell-hpc ~]$ sinfo -p ceoas,ceoas-lowprio
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
ceoas up infinite 1 down* hina02
ceoas up infinite 7 mix hina01,kawashiro[03-04,07],yatagarasu[02,04,06]
ceoas up infinite 26 alloc amaterasu[01-16],kawashiro[01-02],yatagarasu[01,03,05,07-09,11-12]
ceoas up infinite 4 idle kawashiro[05-06,08],yatagarasu10
ceoas-lowprio up infinite 1 mix brizo03
ceoas-lowprio up infinite 6 idle brizo[01-02,04-07]
-p, --partition=<partition>
Filters output to the specific partition(s)
-n, --nodes=<nodes>
Filters output to the specific node(s)
-t, --states=<states>
Filters output to nodes in a specific state or group of states, such as IDLE, MIX, and ALLOC
-e, --exact
Groups output by nodes with matching configurations
-S, --sort=<fields>
Sort output by one or more fields (check man sinfo for field specifiers)
-O, --Format=<format>
Allows specifying a format string for information about specific fields
--helpFormat
Provides a list of fields that can be specified with --Format
The following example command gets all the nodes in the ceoas and ceoas-gpu partitions; groups nodes with the exact same configurations; outputs the nodelist, memory, sockets, cores, threads, and GPUs; and sorts them in descending order of memory capacity:
[olsont@shell-hpc ~]$ sinfo -p ceoas,ceoas-gpu -e -O "NodeList:20,Memory:10,Sockets:8,Cores:7,CPUs:7,Gres" -S -m
NODELIST MEMORY SOCKETS CORES CPUS GRES
hina[01-02] 770000 2 18 72 (null)
aerosmith 384000 2 16 64 gpu:a100:2(S:1)
amaterasu[01-16] 250000 2 8 32 (null)
yatagarasu[01-12] 250000 2 12 48 (null)
ayaya[01,03-06] 190000 2 16 32 gpu:nvidia_geforce_g
ayaya02 190000 2 16 32 gpu:nvidia_geforce_g
kawashiro[01-08] 190000 2 18 72 (null)
squeue¶
The basic command for inspecting the state of the job queue. It takes many of the same options as sinfo does, with the notable difference of using -w instead of -n for filtering on nodes:
olsont@shell-hpc ~]$ squeue -p ceoas-gpu
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3999197 ceoas-gpu amip-aes user1 PD 0:00 1 (Resources)
4001021 ceoas-gpu bash user2 R 2-09:38:45 1 aerosmith
4006505 ceoas-gpu yolo_exp user3 R 21:36:21 1 aerosmith
3999667 ceoas-gpu bash user4 R 2-23:44:52 1 aerosmith
4004127 ceoas-gpu misr-ps- user5 R 1-22:29:09 1 ayaya01
3999133 ceoas-gpu bash user4 R 3-08:15:36 1 aerosmith
--me, which will filter the output to only the executing user's jobs.
scontrol¶
scontrol is the catch-all tool for managing and viewing aspects of the cluster. For regular users, the most common subcommand will be show for displaying information about specific parts of the cluster. Scontrol generally outputs large amounts of information, so in the interest of brevity examples of output will not be given here. Instead, it is encouraged that users try the following commands themselves to see the output:
scontrol show node <nodes>
Shows information about the listed node(s)
scontrol show partition <partitions>
Shows information about the listed partition(s)
scontrol show job <jobids>
Shows information about the listed job(s)
scontrol show config
Shows the configuration of the Slurm cluster
sacct¶
sacct is used for getting detailed information about past and current jobs from Slurm's accounting database. It can be useful for seeing historical information about usage of certain partitions, nodes, and users:
[olsont@shell-hpc ~]$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
4009455 interacti+ all.q ceoas 1 COMPLETED 0:0
4009455.int+ interacti+ ceoas 1 COMPLETED 0:0
4009455.ext+ extern ceoas 1 COMPLETED 0:0
4009463 interacti+ ceoas ceoas 2 RUNNING 0:0
4009463.int+ interacti+ ceoas 2 RUNNING 0:0
4009463.ext+ extern ceoas 2 RUNNING 0:0
-r, --partition=<partitions>
Restricts output to jobs ran on the listed partition(s)
-u, --user=<username>
Select jobs from the given user
-S, --starttime=<time>
Select jobs that were active at or after the given time
-E, --endtime=<time>
Select jobs that were active before the given time
-X, --allocations
Only show allocations instead of individual steps
-o, --format=<format>
Specifies the fields to be displayed for each job
--helpformat
Get the list of fields available to use with --format
[olsont@shell-hpc ~]$ sacct -u olsont -r ceoas -S 2026-05-01 -E 2026-05-29 -o JobID,JobName,Start,End,State,Elapsed -X
JobID JobName Start End State Elapsed
------------ ---------- ------------------- ------------------- ---------- ----------
3966828 bash 2026-05-21T22:07:44 2026-05-21T22:08:14 COMPLETED 00:00:30
3966894 bash 2026-05-21T22:08:20 2026-05-21T22:08:21 COMPLETED 00:00:01
3966897 bash 2026-05-21T22:18:47 2026-05-21T22:23:21 COMPLETED 00:04:34
3967163 bash None 2026-05-21T22:41:35 CANCELLED+ 00:00:00
sacctmgr¶
sacctmgr is used for viewing and modifying information about Slurm accounts and associations. Similar to scontrol, most users will only have read-only access and will likely only use the list and show subcommands. Similar to scontrol, it can also output an abundance of information that is omitted for brevity, but here are some generally useful commands:
sacctmgr show associations user=<username> format=Account
Shows the accounts that a user is associated with
sacctmgr show associations account=<account> format=User
Shows the users that an account is associated with
sacctmgr show accounts
Show all the accounts in the cluster
sacctmgr show qos format=Name%-20
Show all the QoS in the cluster
sacctmgr show qos <name> format=Name,MaxTRESPU%20,MaxJobsPU,UsageFactor,Flags
Show a specific QoS in the cluster with some particular information
sprio¶
sprio is used to view the priority of jobs that are in the queue and waiting to be scheduled. If multiple jobs are sitting in the queue with the state PD, this lets you see what their priorities are and how they're calculated.
[olsont@shell-hpc ~]$ sprio
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION
3855697 phyc_lab 11225 0 471 751 4 10000
3962493 coe-arm 13003 0 1000 2000 3 10000
3962518 coe-arm 13003 0 1000 2000 3 10000
3962554 coe-arm 13003 0 1000 2000 3 10000
3998778 coe-arm 12474 0 471 2000 3 10000
3999197 ceoas-gpu 10880 0 479 396 5 10000
There are only a handful of notable options for sprio:
-p, --partition=<partitions>
Filters output to jobs queued on the listed partition(s)
-u, --user=<usernames>
Select jobs from the given user(s)
-j, --jobs<jobs>
Show only the specified job(s)
-l, --long
Show additional information, such as usernames and additional priority factors
-w, --weights
List the current weights as configured for the cluster
[olsont@shell-hpc ~]$ sprio -w
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION
Weights 1 1000 2000 1000 10000
Detailed information on how priorities are calculated with the given weights can be found at https://slurm.schedmd.com/priority_multifactor.html.
sshare¶
sshare is used to view details on the FairShare portion of priority calculations. For a quick look at a user's FairShare portion on their partitions:
[olsont@shell-hpc ~]$ sshare -U olsont
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
ceoas olsont 1 0.000751 19012658 0.001978 0.160980
brizo olsont 1 0.000063 0 0.000000 1.000000
coe-arm olsont 1 0.004444 0 0.000000 1.000000
coe olsont 1 0.002849 0 0.000000 1.000000
cqls olsont 1 0.000055 0 0.000000 1.000000
Batch scripts¶
Batch scripts used with sbatch are the core of getting the most out of Slurm. They consist of one of more #SBATCH options at the top which are passed to the scheduler and runtime, followed by the rest of the script to be executed on the allocated nodes. Batch scripts generally look like the following
#!/bin/bash
#SBATCH --job-name=mpijob
#SBATCH --output=mpijob-%j.out
#SBATCH --nodes=4
#SBATCH --ntasks=32
#SBATCH --time=1:00:00
#SBATCH --partition ceoas
#SBATCH --exclusive
#SBATCH --mem=0
# Set up environment
. /local/ceoas/opt/spack/current/share/spack/setup-env.sh
spack env activate intel-intelmpi-cpu-x86
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so
# Execute command
srun mpijob
# Copy results to other storage
cp /storage/ceoas-scratch/workdir/output /ceoas/mydir/mpijob/
#SBATCH as a directive to the job scheduler/running environment. It will continue until it finds the first non-blank line that does not start with a #. Any #SBATCH directives placed after a non-blank non-comment line will be ignored (so you can't set #SBATCH directives using shell commands within the batch script itself).
Most batch scripts can generally be broken down into sections for directives, pre-execution/setup, execution, and post-execution/cleanup. Keeping this general structure is recommended to improve readability and maintainability. It's also best to put as much in the batch script itself as possible for ease of replication and distribution instead of relying on external scripts or options passed to the sbatch command.
There are a vast variety of sbatch options available, covering everything from administrative tasks to core binding masks. Most of those options are also applicable to salloc and sbatch. The official list of them can be found at https://slurm.schedmd.com/sbatch.html, and the appendix at the end of this document covers the ones most commonly used by CEOAS researchers.
Job Arrays¶
WIP. Job Arrays are super convenient for breaking down large jobs into short jobs and submitting many short jobs to the cluster (which improves cluster operation as long as each job is more than a few minutes long).
Miscellaneous notes and issues¶
Checkpointing¶
(TODO: Need to look into DMTCP for agnostic checkpointing; https://dmtcp.sourceforge.io/, https://docs.nersc.gov/development/checkpoint-restart/dmtcp/, https://github.com/mpickpt/mana seems interesting.)
FAQ¶
Q: Why do I get "Requested node configuration is not available" when I submit my job?¶
A: The most common reason by far is that you requested many or all the CPUs on a system but didn't specify how much memory to allocate with --mem or --mem-per-cpu. This is especially common with the --exclusive option, because Slurm will allocate all the CPUs on the node even if you don't use them. By default Slurm will request ~4GB of RAM per CPU (see DefMemPerCPU in scontrol show config). For example, if you submit a job with 16 tasks to a node with 384 threads and use --exclusive, Slurm will try to allocate 1.5TB (4GB/CPU * 384 CPUs) of RAM on that server. The simple fix is to either add --mem=0 if you're running in exclusive, or adjust --mem-per-cpu to better match what your job actually uses.
Appendix A: Options for sbatch/salloc/srun¶
WIP, please wait warmly
-A, --account
Select the account to use for submitting to the partition
-a, --array=<indices>
-D, --chdir=<directory>
Change to <directory> before executing the batch script
-c, --cpus-per-task=<ncpus>
Specify how many CPUs to allocate per task. Used for multithreaded jobs (OpenMP, pthreads, etc.) and almost never used for MPI jobs.
-m, --distribution=<distribution>
Used to specify how tasks are allocated across nodes and CPUs are allocated across sockets.
See https://slurm.schedmd.com/sbatch.html#OPT_distribution for detail
-e, --error=<filename_pattern>
-x, --exclude=<nodelist>
Exclude the listed nodes from execution
--exclusive
-B, --extra-node-info=<sockets>[:cores[:threads]]
A more compact way of specifying the --(sockets/cores/threads)-per-(node/socket/core) options
-G, --gpus=[type:]<number>
Specify how many GPUs are required for the job, with an optional type specification
--gpus-per-task=[type:]<number>
Specify how many GPUs are required for each task in the job
-i, --input=<filename_pattern>
Connects the given file to the script's input. Useful for some things that may require some level of interactive input
--mail-type=<types>
Specify on which events to send an e-mail. Common options are BEGIN, END, FAIL, and ALL. Defaults to NONE
--mail-user=<address>
Address to send e-mail notifications to
--mem=<size>[units]
Specifies how much memory is required per node, defaulting to megabytes.
--mem-per-cpu=<size>[units]
Specifies how much memory is required per allocated CPU. Defaults to DefMemPerCPU in the cluster config (4000MB as of May 31st, 2026)
-w, --nodelist=<nodelist>
Requests the specific nodes. The job will contain these nodes as a *minimum* but can contain additional hosts if needed to satisfy requirements.
-F, --nodefile=<file>
The same as --nodelist but in a file
-N, --nodes=<minnodes>[-maxnodes]|<size_string>
-n, --ntasks=<num>
--ntasks-per-node=<ntasks>
--ntasks-per-socket=<ntasks>
-o, --output=<filename_pattern>
-p, --partition=<partitions>
--propagate[=rlimit[,rlimit...]]
-q, --qos=<qos>
--requeue
--no-requeue
-t, --time=<time>