Job execution (Hexagon): Difference between revisions

Revision as of 13:26, 5 December 2017

Batch system

To ensure a fair use of the clusters, all users have to run their computations via the batch system. A batch system is a program that manages the queuing, scheduling, starting and stopping of jobs/programs users run on the cluster. Usually it is divided into a resource-manager part and a scheduler part. To start jobs, users specify to the batch system which executable(s) they want to run, the amount of processors and memory needed, and the maximum amount of time the execution should take.

Hexagon uses "SLURM" as the resource manager. In addition hexagon uses aprun to execute jobs on the compute nodes, independent of the job being a MPI job or sequential job. The user therefore has to make sure to call "aprun ./executable", or "srun ./excutable"and not just the executable if it is to run on the compute part instead of the login-node part of the Cray.

Node configuration

Each Hexagon node has the following configuration:

2 x 16 cores Interlagos CPUs
32 GB of RAM

For core to memory allocation please refer to the following illustration:

NOTE: There is only one job can run on each of the nodes (nodes are dedicated). Therefore, for better node utilization, please try to specify in the job as few limitations as possible and leave the rest to be decided by the batch system.

Batch job submission

There are essentially two ways to execute jobs via the batch system.

Interactive. The batch system allocates the requested resources or waits until these are available. Once the resources are allocated, interaction with these resources and your application is via the command-line and very similar to what you normally would do on your local (Linux) desktop. Note that you will be charged for the entire time your interactive session is open, not just during the time your application is running.
Batch. One writes a job script that specifies the required resources and executables and arguments. This script is then given to the batch system that will then schedule this job and start it as soon as the resources are available.

Running jobs in batch is the more common way on a compute cluster. Here, one can e.g. log off and log on again later to see what the status of a job is. We recommend running jobs in batch mode.

Create a job (scripts)

Jobs are normally submitted to the batch system via shell scripts, and are often called job scripts or batch scripts. Lines in the scripts that start with #SBATCH are interpreted by SLURM as instructions for the batch system. (Please note that these lines are interpreted as comments when the script is run in the shell, so there is no magic here: a batch script is a shell script.)

Script can be created in any text editor, like e.g. vim and emacs.

Job script should start with an interpreter line, like:

#!/bin/bash

Next it should contain directives to queue system, at least execution time and how many cpus are requested:

#SBATCH --time=30:00
#SBATCH --ntasks=64

The rest is the regular shell commands.Please note: all other #SBATCH directives will be ignored after first regular command.
All commands written in script will be executed on login node. This is important to remember for several reasons:

Commands like gzip/bzip2 or even cp for many files can create heavy load on CPU and network interface. This will result in low or unstable performance for such operations.
Overuse of memory or CPU resources on login node can crash it. This means all jobs (from all users) which were started from that login node will crash.

Taking this in mind all IO/CPU intensive tasks should be prefixed with aprun command. aprun will execute the command on compute nodes resulting in higher performance. Note that this should improve the charging of the job since the total time the script is running should be less (charging does not take into account whether the compute nodes are used or not during the time the script is run).

Real computational tasks (the main program) should of course be prefixed with aprun as well.

You can find examples below.

Manage a job (submission, monitoring, suspend/resume, canceling)

Please find below the most important batch system job management commands:

To submit a job use the sbatch command.

sbatch job.sh   # submit the script job.pbs

Queues and priorities are chosen automatically by the system. The command sabtch returns a job identifier (number) that can be used to monitor the status of the job (in the queue and during execution). This number may also be requested by the support staff.

sinfo - reports the state of partitions and nodes.

squeue - reports the state of jobs or job steps.

sbatch - submit a job script for later execution.

scancel - cancel a pending or running job or job step

srun - submit a job for execution or initiate job steps in real time

apstat - Provides status information for Cray XT systems applications

xtnodestat Shows information about compute and service partition processors and the jobs running in each partition

For more information regarding to slurm command please check man page.

General commands

Get documentation on a command:

man sbatch
man squeue

Information on jobs

List all current jobs for a user:

squeue -u <username>

List all running jobs for a user:

squeue -u <username> -t RUNNING

List all pending jobs for a user:

squeue -u <username> -t PENDING

List detailed information for a job:

scontrol show jobid  <jobid> -dd

List status info for a currently running job:

sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps

When your job is competed you get more information includes run time, memory used, etc. To get statistics on completed jobs:

sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed

To view the same information for all jobs of a user:

sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed

Controlling jobs

To cancel one job:

scancel <jobid>

To temporarily hold job:

scontrol hold <jobid>

Then you can resume it by:

scontrol resume <jobid>

List of useful job script parameters

-A : a job script must specify a valid project name for accounting, otherwise it will not be possible to submit jobs to the batch system. -t: a job script must specify proper time that job needs , otherwise it will run for 15 minutes which is default.

For additional sbatch switches please refer to:

man sbatch

srun and aprun

On Hexagon user has two different way of starting executables in slurm script. user can start excutables with srun which requite user to fine tune --ntaks and --ntaks-per-node prameters in sbatch script, forexamle :

#!/bin/bash
#SBATCH --comment="MPI"
#SBATCH --ntasks=21
#SBATCH --ntasks-per-node=7
cd /work/user/jack
srun   ./mpitest

And user can also start excutables with aprun which user only has to specify total number of cores user need with ntasks, then can use aprun to fine tune rest of the parameters.

#!/bin/bash
#SBATCH --account=vd
#SBATCH --time=00:10:00
#SBATCH --comment="MPI"
#SBATCH --ntasks=21
cd /work/users/jack
aprun -n3 -N1    ./mpitest

APRUN arguments

The resources you requested in PBS has to match the arguments for aprun. So if you ask for "#PBS -l mppmem=900mb" you will need to add the argument "-m 900M" to aprun, or use the new "-B" option.

-B use parameters from batch-system (mppwidth,mppnppn,mppmem,mppdepth)
-N processors per node	should be equal to the value of mppnppn
-n processing elements	should be equal to the value of mppwidth
-d number of threads	should be equal to the value of mppdepth
-m memory per element suffix	should be equal to the amount of memory requested by mppmem. Suffix should be M.

A complete list of aprun arguments can be found on the man page of aprun.

List of classes/queues, incl. short description and limitations

Hexagon uses a default batch queue named "hpc". It is a routing queue which based on job attributes can forward jobs to the debug, small or normal queues. Therefore there is no need to specify any execution queue in the sbatch script.

Please keep in mind that we have priority based job scheduling. This means that based on requested amount of CPU and time job, as well as previous usage history, jobs will get higher or lower priority in the queue. Please find a more detailed explanation in Job execution (Hexagon)#Scheduling policy on the machine.

Relevant examples

We illustrate the use of sbatch job scripts and submission with a few examples.

Sequential jobs

To use 1 processor (CPU) for at most 60 hours wall-clock time and all memory on 1 node (32000mb) the PBS job script must contain the line:

#SBATCH  --ntasks=1,
#SBATCH --time=60:00:00

Please note that Stallo and Titan are much better suited for sequential jobs. Hexagon should therefore only be used for parallel jobs.

Below is a complete example of a PBS script for executing a sequential job.

#!/bin/bash
#SBATCH -J "seqjob"                                                      #Give the job a name (optional)
#SBATCH -A vd                                                                #Specify the project the job should be accounted on (obligatory)
#SBATCH -n 1                                                                  # one core needed 
#SBATCH -t 60:00:00                                                     #run time in hh:mm:ss
#SBATCH --mail-type=ALL                                            # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=forexample@uib.no                  # email to user
#SBATCH --output=serial_test_%j.out                          #Standard output and error log
 cd /work/$USER/                                                           # Make sure I am in the correct directory                                         
 aprun -B ./program

Parallel/MPI jobs

To use 512 CPUs (cores) for at most 60 hours wall-clock time, below is an example:

#!/bin/bash
#SBATCH -J "mpijob"                               #  Give the job a name
#SBATCH -n 512                                       # we need 512 cores 
#SBATCH -t 60:00:00                               # time needed

SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL)

#SBATCH --mail-user=forexample@uib.no                  # email to user
#SBATCH --output=serial_test_%j.out                          #Standard output and error log
cd /work/$USER/
aprun -B ./program

Please refer to the Job execution (Hexagon)#Parallel/OpenMP jobs paragraph IMPORTANT statements.

Creating dependencies between jobs

By default, basic single step job dependencies are supported through completed/failed step evaluation. Basic dependency support does not require special configuration and is activated by default. For TORQUE's qsub and the Moab msub command, the semantics listed in the table below can be used with the -W depend=<DEPENDENCY>:<STRING> flag.

Job Dependency Syntax:

Dependency	Format	Description
after	after:<job>[:<job>]...	The job may start at any time after the specified jobs have started execution.
afterany	afterany:<job>[:<job>]...	The job may start at any time after all the specified jobs have completed regardless of completion status.
afterok	afterok:<job>[:<job>]...	The job may be start at any time after all the specified jobs have successfully completed.
afternotok	afternotok:<job>[:<job>]...	The job may start at any time after all the specified jobs have completed unsuccessfully.
before	before:<job>[:<job>]...	The job may start at any time before the specified jobs have started execution.
beforeany	beforeany:<job>[:<job>]...	The job may start at any time before all the specified jobs have completed regardless of completion status.
beforeok	beforeok:<job>[:<job>]...	The job may start at any time before all the specified jobs have successfully completed.
beforenotok	beforenotok:<job>[:<job>]...	The job may start at any time before any of the specified jobs have completed unsuccessfully.
on	on:<count>	The job may start after <count> dependencies on other jobs have been satisfied.

where <job>={jobname|jobid}

Any of the dependencies containing "before" must be used in conjunction with the "on" dependency. So, if job A must run before job B, job B must be submitted with depend=on:1, as well as job A having depend=before:B. This means job B cannot run until one dependency of another job on job B has been fulfilled. This prevents job B from running until job A can be successfully submitted.

Combining multiple tasks in a single job

In some cases it is preferable to combine several aprun's inside one batch script. This can be useful in the following cases:

Several executions must be started one after each other with same amount of CPUs (better use of resources for this can be to use dependencies in your qsub script).
Runtime of each aprun is shorter than e.g. one minute. By combining several of these short tasks together you avoid that the job will spend more time waiting in the queue and starting up than being executed.

It should be written like this in the script:

aprun -B ./cmd args
aprun -B ./cmd args
...

Interactive job submission

PBS allows the user to use a compute nodes for interactive use. Interactive jobs are typically used for:

testing batch scripts
debugging applications/code
code development that involves e.g. a sequence of short test jobs

A job run with the interactive option will run normally, but stdout and stderr will be connected directly to the users terminal. This also allows stdin from the user to be sent directly to the application.

To request one processor, 1000MB memory and reserve it for 2 hours, use the command:

qsub -l mppwidth=1,walltime=2:00:00,mppmem=1000mb -A  replace_with_correct_cpuaccount -I

To request 2 CPUs, with 1000mb per process/cpu and reserve them for 2 hours, use the command:

qsub -l mppwidth=2,walltime=2:00:00,mppmem=1000mb -A replace_with_correct_cpuaccount -I

Where, replace_with_correct_cpuaccount must be replaced with your project name for accounting. The option that specifies the interactive use is -I.

You can use it with scripts aswell:

qsub -I ~/myscript

Note that you will be charged for the full time this job allocates the CPUs/nodes, even if you are not actively using these resources. Therefore, exit the job (shell) as soon as the interactive work is done. To launch your program on the compute node you go to /work/$USER and then you HAVE to use "aprun". If "aprun" is omitted the program is executed on the login node, which in the worst case can crash the login node. Since /home is not mounted on the compute node, the job has to be started from /work/$USER.

General job limitations

The default values if not specified are 1 CPU with 1000mb of memory for 60 minutes.

Maximum amount of resources per user:

4096 cpu cores, total number used by all running jobs
8-22 running (active) jobs (depending on load)
2 idle jobs

Asking for an amount of resources greater than specified above will result in the jobs being blocked. You can see this by the fact that the job is in the "showq -b" output. You can also use the "checkjob jobnumber" command to get the status and reason for any blocking.

Default CPU and job maximums may be changed by sending an application to Support.

Recommended environment variable settings

All regular shell recommended environment variables are loaded automatically. Exceptions is if your default shell is tcsh and you have job script with the header #!/bin/bash or #!/bin/sh in that case you have to add into job script:

#PBS -S /bin/bash

If your job script is in tcsh you don't need to apply the procedure above.
Please avoid using PBS -S in all other situations, using this directive can lead to some issues with the environment variables, especially with /bin/ksh.

Sometimes there can be the problem with proper export of module functions, if you get module: command not found, try to add into your job script:

export -f module

If you still can get module functions in your job script try to add this:

#PBS -V
source ${MODULESHOME}/init/REPLACE_WITH_YOUR_SHELL_NAME
# ksh example:
# source ${MODULESHOME}/init/ksh

MPI on hexagon is highly tuneable. Sometimes you can receive messages that some MPI variables have to be adjusted. In this case just add the recommended export line into you job script on a line before the aprun command. Normally recommended messages are quite verbose. For example (bash syntax):

export MPICH_UNEX_BUFFER_SIZE=90000000

Redirect output of running application to the /work file system. See: Data (Hexagon)#Disk quota and accounting:

aprun .... >& /work/$USER/combined.out
aprun .... >/work/$USER/app.out 2>/work/$USER/app.err
aprun .... >/work/$USER/app.out 2>/dev/null

Scheduling policy on the machine

Scheduler on Hexagon has fairshare setup in place. This ensures that all users will get adjusted priorities, based on initial and historical data from running jobs. Please check Queue priorities (Hexagon) for better understanding of the queuing system on Hexagon

Types of jobs that are prioritized

Jobs with a high number of cores are always prioritized.

Exceptions are short small and debugging jobs during working days and working hours:

up to 128 cores and walltime up to 20 minutes
up to 512 cores and walltime up to 1 hour

These type of jobs will then get a higher priority.

Types of jobs that are discouraged

All types of serial or sequential jobs. Hexagon is optimized for massively parallel MPI jobs.

If you have serial or sequential code to run, please consider applying for other NOTUR resources at http://www.notur.no.

Types of jobs that are not allowed (will be rejected or never start)

The following type of jobs will never start:

The CPU account requested has not enough resources (cpuhour credits).
The user has no access to the specified CPU account.
The CPU account requested doesn't exist on Hexagon.
The memory requested per node is higher than memory available on high-mem node, see Job execution (Hexagon)#Use of large memory nodes.
The total memory requested is more than available, see Job execution (Hexagon)#Use of large memory nodes.
Due to other limits, e.g. total number of cores requested, see Job execution (Hexagon)#General job limitations.

CPU-hour quota and accounting

To execute jobs on the supercomputer facilities one needs a user account plus password. People working at UiB, Uni Research AS or IMR can apply via this link. Others at http://www.notur.no.

Each user account is connected to at least one project (account). Each project has allocated a number of CPU hours (or quota). CPU-hour usage is defined as the elapsed (wall-clock) time of the user's job multiplied by the number of processors that is used. The quota is the maximum number of CPU hours that all the users connected to this project together can consume. After the quota is exhausted, it is no longer possible to submit jobs and one needs to apply for additional quota first.

How to list quota and usage per user and per project

Before using these commands please load the "mam" module (module load mam).

You can use the command gstatement to generate usage report per project:

gstatement -h -a my_projectname

A report for only a specified period can be generated as well (-s - start time, -e - end time):

gstatement -h -a my_projectname -s 2010-01-01 -e 2010-02-01

To display project usage per user use:

glsusage -h -a my_projectname

Example with time-period limit:

glsusage -h -a my_projectname -s 2010-01-01 -e 2010-02-01

You can see your available cpuaccounts and how much quota they have by using either the "cost" command or:

gbalance -h -u $USER

Cost command

Cost command will show you used CPU-hours for all accounts connected to the current user in the current allocation period.

cost

To show total used CPU hours of your connected projects:

cost -p

To show usage for another NOTUR period add -P, e.g.:

cost -P 2010.1 -p

Other commands

Command	Description
cost	will show CPU-hour accounting for the current user
gbalance	will show resources per project
gstatement	is used to generate statement for account activities
glsusage	will show charge totals for each user that completed jobs
greserve	is used to obtain a lien for usage

For all commands mentioned above module mam should be loaded.

FAQ / trouble shooting

Please refer to our general FAQ (Hexagon)

Anonymous

Search