Batching commands in bash#
This section describes way to batch commands without using Slurm. This is intended to be used both with singular jobs and in conjunction with Slurm job arrays (see the Monitoring your jobs section), for example when you need to run a large number of small jobs.
Running tasks as described on this page has a lower overhead than jobs scheduled tasks via Slurm, and is therefore well suited for running many, small tasks. However, for larger jobs you should still prefer job arrays if possible.
This section covers basic loops in bash and the parallel command. Other options include xargs, make, snakemake, and much, much more.
Running commands sequentially in bash#
If you need to run a number of (very) short running commands, then it is
likely more efficient to do so with a simple loop in your sbatch
script. For example, this script runs the plonk
command on a number
of population VCF files:
#!/bin/bash
module load plonk/3.14
for pop in CHB FIN GBR JPT PUR YRI; do
plonk --input "./my_data/${pop}.vcf" --output "./my_results/${pop}.out"
done
The following command indexes a number of (small) BAM files in the current directory:
#!/bin/bash
module load samtools/1.17
for filename in ./*.bam; do
samtools index ${filename}
done
However, it is important to remember that the total runtime will be the sum of run-times for each task, since they are run one after the other. It is therefore not recommended to use loops like this for commands that take more than a few of minutes to complete!
Running commands in parallel in bash#
The GNU parallel commands offers a range of options for running commands in parallel.
#!/bin/bash
module load plonk/3.14
module load parallel/20230822
parallel -P ${SLURM_CPUS_PER_TASK} \
plonk --input "./my_data/{}.vcf" --output "./my_results/{}.out" \
::: CHB FIN GBR JPT PUR YRI
The parallel
command will then execute plonk
once for each of
the values we specified after the :::
and replace the text {}
with the current value.
The second xargs
example above can be run in parallel as follows:
#!/bin/bash
module load samtools/1.17
module load parallel/20230822
parallel -P ${SLURM_CPUS_PER_TASK} \
samtools index "{}" \
::: ./*.bam
If no {}
is specified the value will be appended to the command.
Additionally, parallel
can read values from STDIN, meaning that the
above could also be written as
#!/bin/bash
module load samtools/1.17
module load parallel/20230822
ls ./*.bam | parallel -P ${SLURM_CPUS_PER_TASK} samtools index
Each line on STDIN is treated as one value.
Best practices for reserving resources#
Note that when you reserve resources for a job using parallel
that
you generally should not reserve enough cores to run all jobs at once.
This is because tasks are likely to take different amount of times to
run, sometimes significantly so, resulting in a (potentially large)
number of CPUs being idle until the last task has finished.
For this reason we advise that you do not reserve more CPUs than what is
needed to run 1/3 to 1/2 of your jobs at once. This also allows you to
queue that many more simultaneous jobs on Slurm, and will typically
result in a overall greater throughput than simply using the maximum
number of processes with parallel
.
Using the plonk
example from above:
#!/bin/bash
module load plonk/3.14
module load parallel/20230822
parallel -P ${SLURM_CPUS_PER_TASK} \
plonk --input "./my_data/{}.vcf" --output "./my_results/{}.out" \
::: CHB FIN GBR JPT PUR YRI
Let's say that plonk
is able to use multiple threads and that I
decide to use 4 threads per process. In that case, I could reserve 12
threads for my job and then run 3 instances of plonk
using
parallel
.
#!/bin/bash
#SBATCH --cpus-per-task=12
module load plonk/3.14
module load parallel/20230822
parallel -P 3 \
plonk --threads 4 --input "./my_data/{}.vcf" --output "./my_results/{}.out" \
::: CHB FIN GBR JPT PUR YRI
This however has the disadvantage that you have to make sure that
--cpus-per-task
, -P
, and --threads
(or whatever option your
software uses) all line up.