Using local drives on the compute nodes¶
The Wildwood HPC cluster has local drives mapped to the /scratch volume. This is a change from the old infrastructure,
where local drives were mapped as /data. We made this change for two primary reasons. For one, /scratch makes more
intuitive sense when thinking about naming conventions, as the files stored in /scratch should be considered temporary
and should be somewhat routinely cleaned up, while /data should be considered non-temporary. And second, the /data
volume was previously used for different purposes on different types of machines. For webservers, the /data drive is
used to store resources necessary to run the websites that are hosted on the device. Therefore, the /data drive was
unavailable to users for writing in some cases, causing software error messages, including on the login and file nodes
(previously shell and files, now shell-hpc and files-hpc). For compute nodes, the /data drive was expected to
be available for writing temporary outputs. Now, we have this general rule for local disk mounting:
| volume | purpose | notes |
|---|---|---|
/tmp |
Small OS temporary space | Used for OS temp work, and temp space on login and file nodes |
/data |
Non-temporary data storage | Used for storing web-server resources. On compute nodes with secondary disks, leftover OS-drive space is mapped as /data |
/scratch |
Temporary processing space | Used for temporarily storing processing output. Linked as /data if no secondary drive is available. |
Note
/scratch should be preferentially used as processing space. /data may be present, and may even point to the same
physical drive to reduce legacy issues. /scratch should be preferred for new pipelines and procedures.
Dynamic $TMPDIR settings to facilitate usage of /scratch¶
If you are using the updated dotfiles, then you may
have already noticed that $TMPDIR is set dynamically for you, depending on what drive is available to you. On login
and file-transfer nodes, your $TMPDIR will be /tmp. This is because /data is used for hosting websites and other
data on these nodes, and /scratch is unavailable to discourage accidental data processing on these nodes. If you log
in to a compute node with salloc you should find your $TMPDIR variable automatically updates to /scratch.
This happens because the updated configuration files have this little section of code to find which directory is
writeable and update the $TMPDIR on login:
# Point compilers and System facilities to use the tmpfs for temp files
#
if [ -w "/scratch" ]; then
TMPDIR=/scratch
elif [ -w "/data" ]; then
TMPDIR=/data
else
TMPDIR=/tmp
fi
export TMPDIR
The problem with Slurm batch jobs and $TMPDIR¶
One change that we have had to overcome regarding Slurm vs SGE is how each queuing system inherits from the submitting
environment. SGE, even on batch jobs, would re-load a users environment and the dynamic $TMPDIR setting as above would
find the appropriate $TMPDIR on the compute node. Slurm batch jobs, submitted using sbatch do not re-load the users
environment, and inherit from the submit environment, where the $TMPDIR setting is /tmp.
A solution, using hqsub¶
hqsub, starting in version 1.5.0, has an autotmp setting on by default, that enables the automatic $TMPDIR code as
shown above to be run on the compute node, such that the $TMPDIR is updated before processing of data occurs. This
setting can be disabled using --no-autotmp on an individual run, or you can set HQSUB_AUTOTMP environment variable
to False or 0 in your shell config file to permanently disable the feature.
In this way, interactive jobs started with salloc and batch/array jobs started with hqsub will have the same
expected $TMPDIR settings.
Using the --local-drive option of hqsub¶
The --local-drive option of hqsub also benefits from the updated autotmp setting, assuming the feature has not been
disabled. You can still specify a non-dynamic prefix for your hqsub --local-drive using the --local-prefix option,
but by default, --local-drive pertask and --local-drive shared will work as expected with the dynamic $TMPDIR
updates.