Old job execution (Fimm)

From HPC documentation portal
Revision as of 13:28, 27 June 2017 by St03718 (talk | contribs) (St03718 moved page Job execution (Fimm) to Old job execution (Fimm))

Batch system

To ensure a fair use of the clusters, all users have to run their computations via the batch system. A batch system is a program that manages the queuing, scheduling, starting and stopping of jobs/programs users run on the cluster. Usually it is divided into a resource-manager part and a scheduler part. To start jobs, users specify to the batch system which executable(s) they want to run, the amount of processors and memory needed, and the maximum amount of time the execution should take.

Fimm uses "Torque" as the resource manager, which is the same as on the Hexagon cluster. To schedule jobs fimm uses "maui" where hexagon uses "moab", commercial version of "maui " .

Batch job submission

There are essentially two ways to execute jobs via the batch system.

  • Interactive. The batch system allocates the requested resources or waits until these are available. Once the resources are allocated, interaction with these resources and your application is via the command-line and very similar to what you normally would do on your local (Linux) desktop. Note that you will be charged for the entire time your interactive session is open, not just during the time your application is running.
  • Batch. One writes a job script that specifies the required resources and executables and arguments. This script is then given to the batch system that will then schedule this job and start it as soon as the resources are available.

Running jobs in batch is the more common way on a compute cluster. Here, one can e.g. log off and log on again later to see what the status of a job is. We recommend running jobs in batch mode.

Create a job (scripts)

Jobs are normally submitted to the batch system via shell scripts, and are often called job scripts or batch scripts. Lines in the scripts that start with #PBS are interpreted by Torque as instructions for the batch system. (Please note that these lines are interpreted as comments when the script is run in the shell, so there is no magic here: a batch script is a shell script.)

Script can be created in any text editor, like e.g. vim and emacs.

Job script should start with an interpreter line, like:

#!/bin/bash

Next it should contain directives to queue system, at least execution time and how many cpus are requested:

#PBS -l walltime=00:60:00
#PBS -l  nodes=1:ppn=1, mem=500mb

The rest is the regular shell commands. All commands written in script will be executed on login node. This is important to remember for several reasons:

  1. Commands like gzip/bzip2 or even cp for many files can create heavy load on CPU and network interface. This will result in low or unstable performance for such operations.
  2. Overuse of memory or CPU resources on login node can crash it. This means all jobs (from all users) which were started from that login node will crash.

Taking this in mind all IO/CPU intensive tasks should be prefixed with aprun command. aprun will execute the command on compute nodes resulting in higher performance. Note that this should improve the charging of the job since the total time the script is running should be less (charging does not take into account whether the compute nodes are used or not during the time the script is run).

Real computational tasks (the main program) should of course be prefixed with aprun as well.

You can find examples bellow.

Manage a job (submission, monitoring, suspend/resume, canceling)

Please find below the most important batch system job management commands:

To submit a job use the qsub command.

qsub job.pbs   # submit the script job.pbs

Queues and priorities are chosen automatically by the system. The command qsub returns a job identifier (number) that can be used to monitor the status of the job (in the queue and during execution). This number may also be requested by the support staff.

To monitor job status use the ""qstat"" command.

qstat          # display a listing of jobs
qstat -a      # display all jobs in alternative format
qstat -f       # display full status of all jobs (long output) 
qstat -f <jobid> # display full status of a specific job 

To cancel job use the ""qdel"" command.

qdel jobid     # delete a specific batch job

To display the actual job ordering for the scheduler, separated in three list; active, eligible, and blocked jobs:

showq          # display jobs
showq -u $USER # display jobs for $USER
showq -i      # display only jobs in eligible (idle) queue waiting for execution

To display a detailed job state information use:

checkjob <jobid> # display status for job

List of useful commands, incl. short description

Here is a list of the most important commands in tabular form (manual pages are recommended):

qdelmine || Delete all user jobs
PBS Purpose
qsub Submit a job
qdel Cancel a job
qstat Get job status
qstat -Q Get available queues
qstat -Q -f Show queue information
qstat -B -f Show PBS Server status
qhold Temporarily stop job
qrls Resume job
qhold Checkpoint job
qrls Restart from checkpoint
qcat <jobid> Displays the standard output or the standard error of a running job
showq Displays the job ordering of the scheduler
showq -u $USER Displays the job ordering for $USER
showstart <jobid> Displays estimated start time of job
checkjob <jobid> Displays status for job

List of useful job script parameters

-A : a job script must specify a valid project name for accounting, otherwise it will not be possible to submit jobs to the batch system.

-l : resources are specified with the -l option (lowercase L). There are a number of resources that can be specified. See the example above for the correct syntax. Jobs must specify the number of processors (CPUs), and the maximum allowed wall-clock time for execution. Make sure that you specify a correct amount of memory or you will risk crashing the node for lack of memory. Note that mppmem=XXXmb is a per-process amount and that the nodes have 4000 or 8000mb total (not 4096 and 8192). You can find all attributes and their description in:

man pbs_resources_linux

Below are the most important attributes:


-l walltime : the maximum allowed wall-clock time for execution of the job. If the specified time is too short, the job will be killed before it completes.

-l mppmem : an upper limit for the memory usage per process for a job. An explanation of how to request more memory can be found here. If the memory requirement is exceeded, the job may get killed by the system.

-l mppdepth : is the number of OpenMP threads per node. This must match the -d argument for aprun.

-o, -e : see example bellow. If the attributes are not used and thus filenames are not specified, the standard output and standard error from the job will be stored in the files mpijob.o## and mpijob.e## where ## is the job number assigned by PBS when submitting the job.

-c enabled enable checkpoint feature for the job. When this option specified job can be checkpointed during execution and switched to hold state, later it can be "unpaused" and execution can continue from place where it was stopped (or after a machine/node crash). To use this option the application must be compiled with checkpointing libraries. See Application development (Hexagon)#Checkpoint and restart of applications and man qsub for more info.

-c periodic,interval=120,depth=2 this option will enable periodic checkpoints for the job, with an interval of 2 hours and will keep only the two latest checkpoint images, see man qsub and Application development (Hexagon)#Checkpoint and restart of applications for more info.

For additional PBS switches please refer to:

man qsub

List of classes/queues, incl. short description and limitations

Fimm uses a default batch queue named "batch". It is a routing queue which based on job attributes can forward jobs to the debug, small or normal queues. Therefore there is no need to specify any execution queue in the PBS script.


List of queues:
Name Description Limitations
idle default queue for all user for short jobs
t1 queue for normal jobs limited only by job limitations
kjem queue for small jobs, jobs will get higher priority max 256 CPUs, max 1 hour walltime
nanobasic queue for debugging, jobs will get higher priority max 16 CPUs, max 20 minutes walltime

NOTE: There is no need to specify a queue in the job script, the correct queue will automatically be selected.

Relevant examples

We illustrate the use of batch job scripts and submission with a few examples.

Sequential jobs

To use 1 processor (CPU) for at most 60 hours wall-clock time and 900MB of memory the PBS job script must contain the line :

#PBS -l nodes=1:ppn=1,walltime=60:00:00,mem=900mb

Please note that Fimm, Stallo and Titan are much better suited for sequential jobs. Hexagon should therefore only be used for parallel jobs.

Below is a complete example of a PBS script for executing a sequential job.

#! /bin/sh -
#
# Give the job a name (optional)
#PBS -N "seqjob"
#
# Specify the project the job should be accounted on (obligatory)
#PBS -A account_no ("cost " command will tell you which account you should use )
#
# The job needs at most 60 hours wall-clock time on 1 core on one node. (obligatory)
#PBS -l nodes=1:ppn=1,walltime=60:00:00
#
# The job needs at most 900mb of memory (obligatory)
#
#PBS -l mem=900mb
#
# Write the standard output of the job to file 'seqjob.out' (optional)
#PBS -o seqjob.out
#
# Write the standard error of the job to file 'seqjob.err' (optional)
#PBS -e seqjob.err
#
# Make sure I am in the correct directory
cd /work/janfrode/seqwork
# Invoke the (sequential!) executable
./program

Parallel/MPI jobs

To use 6 processors (CPUs) for at most 60 hours wall-clock time, the PBS job script must contain the line

#PBS -l nodes=3:ppn=2,walltime=60:00:00


Below is an example of a PBS script for the execution of an MPI job.

#! /bin/sh -
#
#  Make sure I use the correct shell.
#
#PBS -S /bin/sh
#
#  Give the job a name
#
#PBS -N "mpijob"
#
#  Specify the project the job belongs to
#
#PBS -A nn2117k
#PBS -l nodes=1:ppn=2 
#
#  We want 60 hours on 6 cpu's:
#
#PBS -l walltime=60:00:00,nodes=3:ppn=2
#
#  The job needs 900 MB memory per process:
#PBS -l pmem=900mb
#
#  Send me an email on  a=abort, b=begin, e=end
#
#PBS -m abe
#
#  Use this email address (check that it is correct):
#PBS -M your@email.address.com
#
#  Write the standard output of the job to file 'mpijob.out' (optional)
#PBS -o mpijob.out
#
#  Write the standard error of the job to file 'mpijob.err' (optional)
#PBS -e mpijob.err
#
#  Make sure I am in the correct directory
mkdir -p /work/$USER/mpiwork
cd /work/$USER/mpiwork
# For fimm use:
/usr/bin/mpiexec ./program
# Return output at end to mpiexec as exit status:
exit $?

Interactive job submission

PBS allows the user to use a compute nodes for interactive use. Interactive jobs are typically used for:

  • testing batch scripts
  • debugging applications/code
  • code development that involves e.g. a sequence of short test jobs

A job run with the interactive option will run normally, but stdout and stderr will be connected directly to the users terminal. This also allows stdin from the user to be sent directly to the application.

To request one processor, 1000MB memory and reserve it for 2 hours, use the command:

qsub -l nodes=1:ppn=1,walltime=2:00:00,pmem=1000mb -A  replace_with_correct_cpuaccount -I

To request 2 CPUs, with 1000mb per process/cpu and reserve them for 2 hours, use the command:

qsub -l nodes=1:ppn=2,walltime=2:00:00,pmem=1000mb -A replace_with_correct_cpuaccount -I

Where, replace_with_correct_cpuaccount must be replaced with your project name for accounting. The option that specifies the interactive use is -I.

You can use it with scripts aswell:

qsub -I ~/myscript

Note that you will be charged for the full time this job allocates the CPUs/nodes, even if you are not actively using these resources. Therefore, exit the job (shell) as soon as the interactive work is done. To launch your program on the compute node you go to /work/$USER and then you HAVE to use "aprun". If "aprun" is omitted the program is executed on the login node, which in the worst case can crash the login node. Since /home is not mounted on the compute node, the job has to be started from /work/$USER.

General job limitations

The default values if not specified are 1 CPU with 500mb of memory for 15 minutes.

Maximum amount of resources per user: In default queue :

  • 64 running (active) jobs (depending on load )
  • 10 Idle jobs
  • 400 blocked jobs

In Idle queue :

  • 512 running (active ) jobs (depending on load )
  • 10 Idle jobs
  • 400 blocked jobs


Asking for an amount of resources greater than specified above will result in the jobs being blocked. You can see this by the fact that the job is in the "showq -b" output. You can also use the "checkjob jobnumber" command to get the status and reason for any blocking.

Default CPU and job maximums may be changed by sending an application to Support.

Recommended environment variable settings

All regular shell recommended environment variables are loaded automatically. Exceptions is if your default shell is tcsh and you have job script with the header #!/bin/bash or #!/bin/sh in that case you have to add into job script:

#PBS -S /bin/bash

If your job script is in tcsh you don't need to apply the procedure above.

Sometimes there can be the problem with proper export of module functions, if you get module: command not found, try to add into your job script:

export -f module

If you still can get module functions in your job script try to add this:

#PBS -V
source ${MODULESHOME}/init/REPLACE_WITH_YOUR_SHELL_NAME
# ksh example:
# source ${MODULESHOME}/init/ksh

Scheduling policy on the machine

Scheduler on Hexagon has fairshare setup in place. This ensures that all users will get adjusted priorities, based on initial and historical data from running jobs.

Types of jobs that are not allowed (will be rejected or never start)

The following type of jobs will never start:

  • The CPU account requested has not enough resources (cpuhour credits).
  • The user has no access to the specified CPU account and queue.
  • The CPU account requested doesn't exist on Fimm.
  • The memory requested per node is higher than memory available on Fimm,

CPU-hour quota and accounting

To execute jobs on the supercomputer facilities one needs a user account plus password. People working at UiB, Uni Research AS or IMR can apply via this link. Others at http://www.notur.no.

Each user account is connected to at least one project (account). Each project has allocated a number of CPU hours (or quota). CPU-hour usage is defined as the elapsed (wall-clock) time of the user's job multiplied by the number of processors that is used. The quota is the maximum number of CPU hours that all the users connected to this project together can consume. After the quota is exhausted, it is no longer possible to submit jobs and one needs to apply for additional quota first.

How to list quota and usage per user and per project

One can check number of CPU hours can be used by issuing following command:

 cost  

On fimm we do not have CPU hours per user , CPU hours is considered based on project/group . thus you can only see how many hours left for your CPU account.

You can see your available cpuaccounts and how much quota they have by using either the "cost" command or:

qbalance -a $CPU_ACOUNT

Cost command

Cost command will show you used CPU-hours which is left available in the CPU account which you are belong to.

cost

Idle queue

To efficiently use the computing resources we have set up a special "idle" queue in the cluster which includes all computing nodes - including those nodes which are normally dedicated to specific groups.

Jobs submitted to the "idle" queue will be able to run on dedicated nodes if they are free.

Important: if the dedicated nodes are needed by the groups that own them (they submit a job to them) the "idle queue"-jobs using the needed nodes will be killed and re-queued to try to run at a later time.

The "idle" queue is accessible to everyone who has an account on fimm.bccs.uib.no.

The "idle" queue gives you access to the following extra resources:

Number of nodes CPU type Cores per node Memory per node
1 Dual-Core AMD Opteron(tm) Processor 2220 4 32GB
2 Quad-Core Intel(R) Xeon(R) CPU E5420 @ 2.50GHz 8 32GB
30 Quad-Core Intel(R) Xeon(R) CPU E5420 @ 2.50GHz 8 16GB
32 Quad-Core Intel(R) Xeon(R) CPU L5430 @ 2.66GHz 8 16GB
12 Six-Core AMD Opteron(tm) Processor 2431 12 32GB

The best situation to use the "idle" queue is:

  • The "default" queue is fully utilized and special queues are free.
  • You have short jobs which need high resource specification.
  • Your jobs are re-runnable without manual intervention. If not please set the "#PBS -r n" flag.

You can do the following to check which queues are available on fimm :

qstat -q 

The following will submit your job to the "idle" queue in interactive mode:

qsub -I -q idle 

In your PBS script you can add the following to submit your job to the "idle" queue

#PBS -q idle 

Please keep in mind that when you submit your job to the "idle" queue it is not guaranteed that your job will finish successfully since the owner of the hardware can "take the resources back" any time they submit a job to their specific queues.

Using infiniband in idle queue

We have 16 nodes with Mellanox Technologies MT25204 [InfiniHost III Lx HCA] cards connected to each other with 24 port "MT47396 Infiniscale-III Mellanox " infiniband switch . Those nodes are belong to nanobasic group.

If you are not belong to nanobasic group, the only way to access infiniband nodes are through idle queue.

One can access infiniband nodes through idle queue with following lines in your PBS script :

#PBS -l nodes=2:ppn=8:ib 

All infiniband nodes have "ib" as node futures. when your job landed on infiniband nodes , mpiexec will automatically pickup infiniband connection instead of regular ethernet connection.

Other commands

Command Description
cost will show CPU-hour accounting for the current user
qbalance will show resources per project

For all commands mentioned above module gold should be loaded.

FAQ / trouble shooting

Please refer to our general FAQ (Fimm)