GPU / high-memory jobs#
The cluster currently includes 1 node with 2x A100 Nvidia GPUs and 4 TB of RAM, and 3 nodes with 2x H100 Nvidia GPUs and 2 TB of RAM. We refer to these as the GPU and/or high-memory nodes, and this page describes how to use them.
These nodes are intended for tasks that make use of GPUs, and for individual jobs that require more than the 2 TB of RAM available on the regular compute nodes.
Running jobs on the GPU / high-memory node#
By default, jobs submitted via Slurm will only run on regular nodes,
even if you ask for more than 2 TB of RAM or ask for a GPU. Attempting
to run such a task will instead result in a Requested node
configuration is not available
error message.
This is because the GPU / high-memory node is located on its own queue,
to in order prevent normal use of the cluster from blocking access to
these resources. You must therefore select use the option
--partition=gpuqueue
to select the correct queue. This might look as
follows in a sbatch script:
#!/bin/bash
#SBATCH --partition=gpuqueue
my-memory-hungry-command
While running on the GPU queue, you can reserve up to 3920 GB of RAM and up to two GPUs (see below) per job. The GPU / high-memory nodes otherwise use the same defaults as the other nodes (~16 GB of RAM per CPU reserved).
For example, to run a job using 2.5 TB of RAM on the GPU / high-memory node:
#!/bin/bash
#SBATCH --partition=gpuqueue
#SBATCH --mem 2560G
my-memory-hungry-command
This script can then be submitted as usual:
$ sbatch my_hi_mem_job.sh
Submitted batch job 217217
See the Basic Slurm jobs and Advanced Slurm jobs pages for information about reserving additional CPUs, more RAM, and for setting other Slurm settings for your jobs.
We ask that you do not reserve all available CPUs or all RAM on the GPU / high-memory node, unless it is actually required for your analyses, since leaving some unused resources permits other users to utilize the GPUs while your tasks are running.
Reserving GPUs#
Requesting GPUs is done with the --gres
option and also requires
that using the --partition=gpuqueue
option to select the correct
queue, as described above. This might look as follows in a sbatch
script:
#!/bin/bash
#SBATCH --partition=gpuqueue --gres=gpu:1
nvidia-smi -L
The --gres=gpu:1
in the above asks Slurm to make 1 GPU available to
our job. This can be increased to 2
to reserve both GPUs on the
node, but because of the limited number of GPUs we ask that you only
reserve 1 GPU per job, which is normally also more efficient.
This script can then be submitted as usual:
$ sbatch my_gpu_job.sh
Submitted batch job 217218
$ cat slurm-217218.out
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-4f2ff8df-0d18-a99b-9fb8-67aa0867f7a3)
Requesting specific GPUs#
As indicated above, the GPU nodes includes both Nvidia H100 and A100 GPUs. By default, your job will be assigned to the first idle GPU(s), but it is also possible to request a specific GPU type.
To request an A100 GPU, replace the --gres=gpu:1
option with
--gres=gpu:a100:1
, and to request an H100 GPU, replace the
--gres=gpu:1
option with --gres=gpu:h100:1
. For example,
#!/bin/bash
#SBATCH --partition=gpuqueue --gres=gpu:h100:1
nvidia-smi -L
This script can then be submitted as usual:
$ sbatch my_h100_job.sh
Submitted batch job 217219
$ cat slurm-217219.out
GPU 0: NVIDIA H100 NVL (UUID: GPU-c43d0655-2d15-7e66-90b3-9b732a1d13ba)
We recommend looking at current GPU utilization before submitting your
job, as any time saved by running on a faster (H100) GPU may be lost
from having to wait for them to be idle. See slurmboard
utility
described in the Monitoring the cluster section provides a simple way
to see GPU reservations.
Running an interactive session#
While it is possible to run an interactive session on the GPU /
high-memory node, we ask that you limit the usage of such sessions as
much as possible. If at all possible, prefer using sbatch
or
non-interactive srun
instead. This ensures that the resources are
available for use when you (or other users) are not actively using them.
To start an interactive session using a GPU you simply apply the same
--partition
and (optionally) the same --gres
options as above if
you need a GPU, as well as other resource options described in the
Reserving resources section:
$ srun --pty --partition=gpuqueue -- /bin/bash
See the Interactive sessions section for information about interactive sessions, including information about running programs with graphical interfaces.
Warning
Interactive sessions left running on the GPU node may be terminated without warning.
Monitoring GPU utilization#
Please see the Monitoring GPU utilization section on the Monitoring slurm jobs page.
Troubleshooting#
Error: Requested node configuration is not available#
See the Slurm Basics Troubleshooting section.