Monitoring slurm jobs#
This section describes the techniques for monitoring jobs running through slurm, the amount of resources they are currently using (CPUs, RAM, and GPUs), the overall amount of resources used by completed jobs, and whether jobs have started running, have finished running, or have failed.
Additionally, it is described how to monitor the overall activity level of the cluster, to help inform how many resources you can reasonably reserve for a set of jobs. See the Best practice for reserving resources section.
E-mail notifications on job completion#
In addition to actively monitoring your jobs using squeue
, it is
possible to receive email notifications when your jobs are started,
finish, fail, are re-queued, or some combination. This is accomplished
by using the --mail-user
and --mail-type
options:
$ sbatch --mail-user=abc123@ku.dk --mail-type=END,FAIL my_script.sh
Submitted batch job 8503
These options can naturally also be embedded in your sbatch script:
#!/bin/bash
#SBATCH --mail-user=abc123@ku.dk --mail-type=END,FAIL
my-commands
and queued as usual:
$ sbatch my-script.sh
Submitted batch job 8504
When these options are enabled, Slurm will send a notification to
abc123@ku.dk
account when the job is completed or if it fails. The
possible values for --mail-type
are NONE
(the default),
BEGIN
, END
, FAIL
, REQUEUE
, ALL
, or some combination
as shown above.
Warning
Remember to use your own @ku.dk
email address as the recipient,
instead of abc123@ku.dk
. It is possible to use email addresses
outside @ku.dk
, but some providers will silently block these
emails, and we therefore recommend using your @ku.dk
address.
Monitoring overall resource usage by jobs#
The sacct
command may be used to review the average CPU usage, the
peak memory usage, disk I/O, and more for completed jobs. This makes it
easier to verify that you are not needlessly reserving resources:
$ sacct -o JobID,Elapsed,State,AllocCPUS,AveCPU,ReqMem,MaxVMSize
A full description of the data printed by sacct
command can be found
in the sacct manual, but briefly, this prints the job ID, the amount
of time the job has been running, the state of the job (queued, running,
completed, etc.), number of CPUs allocated, the CPU utilization
(preferably this should be the number of CPUs allocated times the
elapsed time), the amount of memory requested, and the peak virtual
memory size.
Alternatively, we provide a helper that summarizes some of this information in a more easily readable form:
$ module load sacct-usage
$ sacct-usage
Age User Job State Elapsed CPUs CPUsWasted ExtraMem ExtraMemWasted CPUHoursWasted
13:32:04s abc123 1 FAILED 252:04:52s 8 6.9 131.4 131.4 4012.14
10:54:32s abc123 2[1] COMPLETED 02:49:25s 32 15.7 0.0 0.0 44.38
01:48:43s abc123 3 COMPLETED 01:00:53s 24 2.4 0.0 0.0 2.43
The important information is found in the CPUsWasted
column and the
ExtraMemWasted
column, which show the number CPUs that went unused
on average, and the amount of extra memory that went unused. Note that
ExtraMem
only counts memory above the default allocation of ~16 GB
of RAM per CPU, as our policy is that you shouldn't have to worry about
using less than that. If you want to see the full memory usage, then use
the --verbose
option.
The final column indicates that number of CPU hours your job wasted,
calculated as the length of time your job ran multiplied by the number
of reserved CPUs and the number of CPUs that would have been able to get
the default 16 GB of RAM had ExtraMemWasted
been zero.
Aim for your jobs to resemble the third job, not the second job and especially not the first job in the example!
Warning
The Wasted
statistics are based on snapshots of resource usage
produced by Slurm and are therefore not 100% accurate. Notably, the
memory usage statistics are based on maximum memory usage of
individual processes, rather than the maximum cumulative memory
usage, and may therefore greatly overestimate wasted memory if you
are running multiple simultaneous processes in a pipeline.
Monitoring individual processes in a job#
While sacct
can report on the overall resource usage of you job, it
can also be helpful to track resource usage for individual commands that
you are running. This is particularly useful when attempting to optimize
the number of CPUs used commands run in a job.
One way of doing this is via the time
command, which can report the
efficiency from using multiple threads and to show how much memory a
program used. This is acoomplished by pre-pending /usr/bin/time -f
"CPU = %P, MEM = %MKB"
to the command that you want to measure, as
shown in this example, where we wish to measure the resource usage of
the my-command
program:
$ /usr/bin/time -f "CPU = %P, MEM = %M" my-command --threads 1 ...
CPU = 99%, MEM = 840563KB
$ /usr/bin/time -f "CPU = %P, MEM = %M" my-command --threads 4 ...
CPU = 345%, MEM = 892341KB
$ /usr/bin/time -f "CPU = %P, MEM = %M" my-command --threads 8 ...
CPU = 605%, MEM = 936324KB
In this example, increasing the number of threads/CPUs to 4 did not result in a 4x increase in CPU usage, but only an 3.5x increase with 4 CPUs and only a 6x increase with 8 CPUs. This means that it would be more efficient to run two tasks with 4 CPUs in parallel, rather than running one task with 8 CPUs.
Live monitoring of processes in jobs#
In addition to monitoring jobs at a high level, it is possible to
actively monitor the processes running in your jobs via (interactive)
shells running on the same node as the job you wish to monitor. This
allows us to estimate resource usage before a job has finished
running. In this example we will use the htop
command to monitor our
jobs, but you can use basic top
, a bash
shell, or any other
command you prefer.
The first option for directly monitoring jobs is to request a job on the
same server using the --nodelist
option to specify the node your job
is running on. However, this will not work if all resources on the node
are reserved, and for that reason we recommend running htop
inside
your existing job.
This is done using the --overlap
and --jobid
command-line
options for srun
, which tells Slurm that your new job should overlap
an existing job, and the ID of the job to overlap. The job ID can obtain
using for example the squeue --me
command (from the JOBID
column), as shown here:
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8503 standardq my_scrip abc123 R 0:02 1 esrumcmpn03fl
$ srun --pty --overlap --jobid 8503 --gres=none htop
The --pty
option gives us an interactive session, which allows us to
interact directly with htop
. See the Interactive sessions
section for more information. The --gres=none
option is required to
overlap jobs that reserve GPUs, since Slurm does not permit those to be
shared, even for overlapping jobs. See below for instructions on how to
monitor GPU utilization.
Monitoring GPU utilization#
Monitoring of GPU utilization is highly recommended when you run jobs on the GPU node: To make full use of the hardware you want to keep GPU utilization at 100% and to do so you typically want to load as much data into GPU memory as possible. The exact way in which you can accomplish this depends on the software you are running, but can often be accomplished by increasing the size of the batches you are processing.
The way in which you are using the GPUs will affect how you can monitor them, depending on whether you have reserved a GPU for an interactive session:
Monitoring an interactive session#
If you are running a job in an interactive session, then you can monitor the reserved GPU(s)
directly using the nvidia-smi
command:
$ nvidia-smi -l 5
Thu Apr 4 14:30:46 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:27:00.0 Off | 0 |
| N/A 57C P0 307W / 300W | 52357MiB / 81920MiB | 99% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe On | 00000000:A3:00.0 Off | 0 |
| N/A 56C P0 298W / 300W | 58893MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2807877 C dorado 52344MiB |
| 1 N/A N/A 2807849 C dorado 58880MiB |
+---------------------------------------------------------------------------------------+
This will print resource usage for the GPUs you have reserved for your
interactive session (and only for those GPUs), and continue to print it
every 5 seconds afterwards via the -l 5
option. Other monitoring
tools are available (for example gpustat
), but are outside the scope
of this documentation.
Monitoring a Slurm job#
If you have started a standard (non-interactive) job via Slurm, then you
will not be able to directly run nvidia-smi
nor will you be able to
join the running job using srun -j
due to the way Slurm handles
special resources. We have therefore set up log-files on the GPU nodes
node that contains the output from the nvidia-smi
command as shown
above.
To watch the content of this log-file, firstly determine the job ID of your job running on the GPU node:
$ squeue --me --partition=gpuqueue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
570316 gpuqueue bash abc123 R 13:55 1 esrumgpun01fl
Then we use srun
with the --overlap
option to run a command
inside this job, which we specify using the --jobid 570316
option.
The --gres=none
option is required, since otherwise Slurm would try
to reserve the GPU our job already uses and eventually time out.
$ srun --overlap --jobid 570316 --gres=none --pty -- watch -n 15 -d cat /scratch/gpus/nvidia-smi.txt
Warning
Remember to replace the 570316
with the ID of your job!
This prints the contents of the log-file every 15 seconds (which is how
often the files are updated) and optionally highlights the changes since
the last nvidia-smi
run. To disable the highlighting, simply remove
the -d
option from the command.
This command does not take up additional resources on the GPU node and will automatically exit when your job finishes. See the Live monitoring of processes in jobs for more information.
Monitoring the cluster#
The slurmboard utility is made available as part of the cbmr_shared
project folder, in order to make it easy to monitor activity on the
cluster, for example to decide how many resources you can reasonably use
for a job (see Best practice for reserving resources):
$ module load slurmboard
$ slurmboard

Briefly, this utility displays every node in the cluster, their status, and available resources for each of these. The resources (CPUs, Memory, and GPUs) columns are colored as follows: Yellow indicates resources that have been reserved; green indicates resources that are actively being used; purple indicates resources that may be inaccessible due to other resources being reserved (e.g. RAM being inaccessible due to all CPUs being reserved vice versa); and black indicates resources that are unavailable due to nodes being offline or under maintenance.
Note
The Data Analytics Platform uses this utility to monitor how busy the cluster is and how job are performing. In particular, we may reach out to you if we notice that your jobs consistently use significantly fewer resources than the amount reserved, in order to optimize resource utilization on the cluster.