Slurm Batch system usage¶
Simple job submission and the concept of queues (partitions)¶
On the Tier-3 you run jobs by submitting them to the Slurm job scheduler. The cluster's worker nodes are only accessible through this batch system.
The user interface nodes t3ui0* provide an environment to compose, test, and submit batch jobs to Slurm.
Jobs are submitted to job queues, which are called partitions in Slurm. We provide the following partitions (often also called batch queues), which differ in maximal runtime, features, and available worker nodes.
You can get the available queues, their properties and nodes by running the following command on one of the UI nodes.
sinfo -o "%.12P %.16F %.16C %.14l %.16L %.20G %N"
This table captures a snapshot of that state
| PARTITION | NODES(A/I/O/T) | CPUS(A/I/O/T) | TIMELIMIT | DEFAULTTIME | GRES | NODELIST |
|---|---|---|---|---|---|---|
| short | 5/8/1/14 | 81/3119/128/3328 | 1:00:00 | 45:00 | (null) | t3wn[72-73,80-91] |
| standard* | 5/8/1/14 | 81/3119/128/3328 | 12:00:00 | 12:00:00 | (null) | t3wn[72-73,80-91] |
| long | 5/8/1/14 | 81/3119/128/3328 | 7-00:00:00 | 1-00:00:00 | (null) | t3wn[72-73,80-91] |
| qgpu | 2/0/0/2 | 32/48/0/80 | 1:00:00 | 30:00 | gpu:geforce_gtx_1080 | t3gpu[01-02] |
| gpu | 2/0/0/2 | 32/48/0/80 | 7-00:00:00 | 1-00:00:00 | gpu:geforce_gtx_1080 | t3gpu[01-02] |
Explanations:
- A/I/O/T: abbreviations for Active/Idle/Offline/Total.
- Timelimit: maximal runtime in the format of
d-hh:mm:ss - GRES: Generic resources of the nodes. This is where GPUs get listed
Submitting a batch job¶
To launch a batch job, you first prepare a batch script, a normal shell script that launches your executable and provides some additional information to the slurm scheduler.
You then use the sbatch command to submit it to Slurm. You also need
to provide an account name (t3 for multicore nodes, gpu_gres for
GPU partitions). Here we submit to the short multicore queue and we
therefore use the account t3.
sbatch -p short --account=t3 my-script.sh
For submitting GPU jobs to the gpu queues you need to also specify the gpu_gres account
sbatch -p qgpu --account=gpu_gres my-gpu-script.sh
The sbatch command supports a lot of additional configuration options
(refer to its man page), e.g. you may want to specify the memory
requirements of your job in MBs.
sbatch -p standard --account=t3 --mem=3000 job.py
Instead of passing all these options on the command line you can also
pass them inside of the batch script itself (usually in the header part),
by prefixing an option line with an #SBATCH comment like here
# Example slurm script with Slurm options
#SBATCH --mem=3000
#SBATCH --account=t3
#SBATCH --time=04:00:00
#SBATCH --partition=standard
# now start our executable
myexecutable
We provide a list of useful Slurm commands to check jobs and the status of nodes.
The detailed slurm configuration can be examined on any Slurm node by
listing the configuration file /etc/slurm/slurm.conf.
Storage best practices within a batch job¶
You should try to do all intensive I/O on the local storage of the
node which is available under /scratch. A typical strategy is
- For intensive read I/O, prefer reading input files from our SE (dCache, under
/pnfs). - If you have to read files from remote sites, this may work for low
intensity I/O, but it is better to either first bring them to our SE (e.g.
by rucio), or download a whole file to
/scratchat the beginning of the job.- Explanation: if you know that your application will read the
whole file, it is more efficient to transfer it fully in one go
by
xrdcporgfal-copyto/scratch(or to our SE using rucio) instead of reading it piecewise in many small packets from a remote site.
- Explanation: if you know that your application will read the
whole file, it is more efficient to transfer it fully in one go
by
- Create result files in
/scratchand move them to the SE (/pnfs) only at the end of of your job. /workarea. You can use this area for mid intensity I/O. It has the advantage that it shared with all nodes, while/scratchis only local to each node. But be aware that/workis on a central NFS share, and if too many jobs start to use it intensively, it may get overloaded and block, affecting all users.
For using /scratch, it is best that you create a user and job specific
directory like in this example.
JOB_SCRATCH=/scratch/$USER/$SLURM_JOB_ID
mkdir -p "$JOB_SCRATCH"
export TMPDIR="$JOB_SCRATCH"
#########################################################
# Here comes your code
#########################################################
# cleaning of temporary working dir after job is completed:
rm -rf "$JOB_SCRATCH**
IMPORTANT
We also set TMPDIR, a variable which instructs compliant applications
to use that location for temporary files. If you do not set this variable,
your jobs may overfill /tmp which is also needed by system programs, and
this may cause the node to fail.
Job priorities and fair share¶
Slurm regularly calculates priorities of queued jobs. The job with the highest priority is the next job to be run when there is a free slot. The goal is to provide a system that is efficiently filling the resources, but that is also fair. The priority calculation takes into account a number of factors, among them
- FairShare: based on past cluster usage by the user and a decay function.
- Age of Job: time the job has been waiting in the queue. Priority increases with the waiting time.
- Job Size: size of resource request CPU, Memory
Example Job submission scripts¶
Slurm FAQ¶
Is there a way to increase the maximum time of a job while it is running?¶
Jobs can in general be modified by the "scontrol update" command. A job that is still in the queue can be updated e.g. with
scontrol update jobid=7798 TimeLimit=48:00:00 scontrol update jobid=7798 partition=long
But as soon as the job is running, only an admin user is allowed to change settings. The reasoning for this is easily explained: The maximal runtime of a job is used in fitting the job into "holes" within the scheduling plan. So, if users were allowed to extend the runtime, they could start submitting them with super-short times, and once they run, increase the time.