Slurm Batch system usage¶

Simple job submission and the concept of queues (partitions)¶

On the Tier-3 you run jobs by submitting them to the Slurm job scheduler. The cluster's worker nodes are only accessible through this batch system.

The user interface nodes t3ui0* provide an environment to compose, test, and submit batch jobs to Slurm.

Jobs are submitted to job queues, which are called partitions in Slurm. We provide the following partitions (often also called batch queues), which differ in maximal runtime, features, and available worker nodes.

You can get the available queues, their properties and nodes by running the following command on one of the UI nodes.

sinfo -o "%.12P %.16F %.16C %.14l %.16L %.20G %N"

This table captures a snapshot of that state

PARTITION	NODES(A/I/O/T)	CPUS(A/I/O/T)	TIMELIMIT	DEFAULTTIME	GRES	NODELIST
short	5/8/1/14	81/3119/128/3328	1:00:00	45:00	(null)	t3wn[72-73,80-91]
standard*	5/8/1/14	81/3119/128/3328	12:00:00	12:00:00	(null)	t3wn[72-73,80-91]
long	5/8/1/14	81/3119/128/3328	7-00:00:00	1-00:00:00	(null)	t3wn[72-73,80-91]
qgpu	2/0/0/2	32/48/0/80	1:00:00	30:00	gpu:geforce_gtx_1080	t3gpu[01-02]
gpu	2/0/0/2	32/48/0/80	7-00:00:00	1-00:00:00	gpu:geforce_gtx_1080	t3gpu[01-02]

Explanations:

A/I/O/T: abbreviations for Active/Idle/Offline/Total.
Timelimit: maximal runtime in the format of d-hh:mm:ss
GRES: Generic resources of the nodes. This is where GPUs get listed

Submitting a batch job¶

To launch a batch job, you first prepare a batch script, a normal shell script that launches your executable and provides some additional information to the slurm scheduler.

You then use the sbatch command to submit it to Slurm. You also need to provide an account name (t3 for multicore nodes, gpu_gres for GPU partitions). Here we submit to the short multicore queue and we therefore use the account t3.

sbatch -p short --account=t3 my-script.sh

For submitting GPU jobs to the gpu queues you need to also specify the gpu_gres account

sbatch -p qgpu --account=gpu_gres my-gpu-script.sh

The sbatch command supports a lot of additional configuration options (refer to its man page), e.g. you may want to specify the memory requirements of your job in MBs.

sbatch -p standard --account=t3 --mem=3000 job.py

Instead of passing all these options on the command line you can also pass them inside of the batch script itself (usually in the header part), by prefixing an option line with an #SBATCH comment like here

# Example slurm script with Slurm options
#SBATCH --mem=3000
#SBATCH --account=t3
#SBATCH --time=04:00:00
#SBATCH --partition=standard

# now start our executable
myexecutable

We provide a list of useful Slurm commands to check jobs and the status of nodes.

The detailed slurm configuration can be examined on any Slurm node by listing the configuration file /etc/slurm/slurm.conf.

Storage best practices within a batch job¶

You should try to do all intensive I/O on the local storage of the node which is available under /scratch. A typical strategy is

For intensive read I/O, prefer reading input files from our SE (dCache, under /pnfs).
If you have to read files from remote sites, this may work for low intensity I/O, but it is better to either first bring them to our SE (e.g. by rucio), or download a whole file to /scratch at the beginning of the job.
- Explanation: if you know that your application will read the whole file, it is more efficient to transfer it fully in one go by xrdcp or gfal-copy to /scratch (or to our SE using rucio) instead of reading it piecewise in many small packets from a remote site.
Create result files in /scratch and move them to the SE (/pnfs) only at the end of of your job.
/work area. You can use this area for mid intensity I/O. It has the advantage that it shared with all nodes, while /scratch is only local to each node. But be aware that /work is on a central NFS share, and if too many jobs start to use it intensively, it may get overloaded and block, affecting all users.

For using /scratch, it is best that you create a user and job specific directory like in this example.

JOB_SCRATCH=/scratch/$USER/$SLURM_JOB_ID
mkdir -p "$JOB_SCRATCH"
export TMPDIR="$JOB_SCRATCH"

#########################################################
# Here comes your code                        
#########################################################

# cleaning of temporary working dir after job is completed:
rm  -rf "$JOB_SCRATCH**

IMPORTANT

We also set TMPDIR, a variable which instructs compliant applications to use that location for temporary files. If you do not set this variable, your jobs may overfill /tmp which is also needed by system programs, and this may cause the node to fail.

Slurm regularly calculates priorities of queued jobs. The job with the highest priority is the next job to be run when there is a free slot. The goal is to provide a system that is efficiently filling the resources, but that is also fair. The priority calculation takes into account a number of factors, among them

FairShare: based on past cluster usage by the user and a decay function.
Age of Job: time the job has been waiting in the queue. Priority increases with the waiting time.
Job Size: size of resource request CPU, Memory

Example Job submission scripts¶

GPU Example

CPU Example

Slurm FAQ¶

Is there a way to increase the maximum time of a job while it is running?¶

Jobs can in general be modified by the "scontrol update" command. A job that is still in the queue can be updated e.g. with

scontrol update jobid=7798 TimeLimit=48:00:00 scontrol update jobid=7798 partition=long

But as soon as the job is running, only an admin user is allowed to change settings. The reasoning for this is easily explained: The maximal runtime of a job is used in fitting the job into "holes" within the scheduling plan. So, if users were allowed to extend the runtime, they could start submitting them with super-short times, and once they run, increase the time.