Slurm Workload Manager (Fimm)

From HPC documentation portal
Revision as of 21:06, 8 November 2015 by Qug001 (talk | contribs)

Overview

Slurm is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Commands

sinfo - reports the state of partitions and nodes managed by SLURM.

squeue - reports the state of jobs or job steps.

scontrol show partition

sbatch is used to submit a job script for later execution.

scancel is used to cancel a pending or running job or job step

srun is used to submit a job for execution or initiate job steps in real time

For more information regarding to slurm command please check man page.

man <command>

Job submission

Job status

sequential job submission

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=1G
#SBATCH --time=30:00     # default time is 15 minutes 
#SBATCH --output=my.stdout
#SBATCH --mail-user=saerda@uib.no
#SBATCH --mail-type=ALL
#SBATCH --job-name="slurm_job" 
#
# Put commands for executing job below this line
# 
sleep 30 
hostname


Interpreting scontrol show job information

[saerda@login3 SLURM_TEST]$ sbatch test_slurm.sh
Submitted batch job 13763010
[saerda@login3 SLURM_TEST]$ scontrol show job 13763010
JobId=13763010 JobName=slurm_test
  UserId=saerda(52569) GroupId=hpcadmin(1999)
  Priority=4294113377 Nice=0 Account=t1 QOS=normal
  JobState=RUNNING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  RunTime=00:00:06 TimeLimit=00:30:00 TimeMin=N/A
  SubmitTime=2015-11-08T22:01:51 EligibleTime=2015-11-08T22:01:51
  StartTime=2015-11-08T22:01:51 EndTime=2015-11-08T22:31:51
  PreemptTime=None SuspendTime=None SecsPreSuspend=0
  Partition=hpc AllocNode:Sid=login3:10120
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=compute-3-7
  BatchHost=compute-3-7
  NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  TRES=cpu=1,mem=1024,node=1
  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
  MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0
  Features=(null) Gres=(null) Reservation=(null)
  Shared=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=/fimm/home/vd/saerda/SLURM_TEST/test_slurm.sh
  WorkDir=/fimm/home/vd/saerda/SLURM_TEST
  StdErr=/fimm/home/vd/saerda/SLURM_TEST/my.stdout
  StdIn=/dev/null
  StdOut=/fimm/home/vd/saerda/SLURM_TEST/my.stdout
  Power= SICP=0


MPI job submission

You can get this simple "Hello World" MPI test program written in C and save it as wiki_mpi_example.c

compile it as :

module load openmpi
mpicc wiki_mpi_example.c -o hello_world_wiki.mpi
#!/bin/bash
#CPU accounting is not enforced currently
#SBATCH -N 2
#use --exclusive to get the whole nodes exclusively for this job
#SBATCH --exclusive
#SBATCH --time=01:00:00
#SBATCH -c 2
srun -n 10 ./hello_world_wiki.mpi

Idle queue

To efficiently use the computing resources we have set up a special "idle" queue in the cluster which includes all computing nodes - including those nodes which are normally dedicated to specific groups.

Jobs submitted to the "idle" queue will be able to run on dedicated nodes if they are free.

Important: if the dedicated nodes are needed by the groups that own them (they submit a job to them) the "idle queue"-jobs using the needed nodes will be killed and re-queued to try to run at a later time.

The "idle" queue is accessible to everyone who has an account on fimm.bccs.uib.no.

The "idle" queue gives you access to the following extra resources:

Number of nodes CPU type Cores per node Memory per node
2 Quad-Core Intel(R) Xeon(R) CPU E5420 @ 2.50GHz 8 32GB
30 Quad-Core Intel(R) Xeon(R) CPU E5420 @ 2.50GHz 8 16GB
32 Quad-Core Intel(R) Xeon(R) CPU L5430 @ 2.66GHz 8 16GB
12 Six-Core AMD Opteron(tm) Processor 2431 12 32GB
21 Quad-Core Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz 32 128

The best situation to use the "idle" queue is:

  • The "default" queue is fully utilized and special queues are free.
  • You have short jobs which need high resource specification.
  • Your jobs are re-runnable without manual intervention. If not please set the "#PBS -r n" flag.

Please keep in mind that when you submit your job to the "idle" queue it is not guaranteed that your job will finish successfully since the owner of the hardware can "take the resources back" any time they submit a job to their specific queues.

Job Array

In many cases the computational analysis job contains a number of similar independent subtasks. The user may have several datasets that will be analyzed in the same way or same simulation code is executed with a number of different parameters. These kind of tasks are often called as "embarrassingly parallel" jobs as the task can be in principle distributed to as many processors as there are subtasks to be run. In Taito this kind of tasks can be effectively run by using the array job function of the SLURM batch job system.

#!/bin/sh
#SBATCH --array=0-31
#SBATCH --time=03:15:00          # Run time in hh:mm:ss
#SBATCH --mem-per-cpu=1024       # Minimum memory required per CPU (in megabytes)
#SBATCH --job-name=hello-world
#SBATCH --error=job.%J.out
#SBATCH --output=job.%J.out
 
 

echo "I am task $SLURM_ARRAY_TASK_ID on node `hostname`" sleep 60

Job dependency