University of California

Running Jobs on ShaRCS with TORQUE and Moab

Jobs to be run on the ShaRCS cluster are submitted via the TORQUE workload manager to one of the batch queues listed below. The Moab scheduler then prioritizes jobs in the queue and schedules them to run on the compute nodes as resources become available.

Table of Job Queues and Characteristics

This table identifies the queues available for user jobs submissions. Four queues are defined on each cluster: express, normal, short and long. Each queue has specific resource and time limits to help balance utilization and timely execution of jobs. There may be other restrictions or requirements as well. Please see the details and examples below.

Queue Name Priority Modifier Wallclock Limit Max Jobs/User Other Attributes
Express 1000 00:30:00 6  
Short 0 06:00:00 4  
Normal 0 24:00:00 6  
Long 0 72:00:00 2 MAX.NODE=96

Global Job Limits: Maximum 192 nodes/user, 192 nodes/job
Default Job Parameters: nodes=1:ppn=1 walltime=00:05:00

Job Scheduling Policies

Due to these queue structure and resource limits, users are encouraged to be as accurate as possible when requesting resources for their jobs to ensure job turnaround efficiency. At minimum, you must specify node count, processor count, and wallclock limits on every job submission, whether using interactive or batch mode.

Job submissions are only permitted through the login nodes. No job submissions are permitted from any compute node. Thus, it is not possible for one job to submit another job.

Of the 272 nodes on each cluster, four are used as login nodes and the remaining 268 nodes are compute nodes accessible via the queues mentioned above. Do not run any production batch jobs on the login nodes. Limit the run time of tests on the login nodes to 15 minutes.

Running Interactive Parallel Jobs

Interactive jobs on batch queues are for developing and debugging parallel codes, as well as for running interactive applications like IDL. In order to run interactive parallel jobs on ShaRCS clusters, use the interactive flag with the msub command:

$ msub -I

This will provide a login prompt to the launch node and create the PBS_NODEFILE file which contains a list of all nodes assigned to the interactive job.

The exit command will end the interactive job.

$ exit

Example

To run an interactive job with a wall clock limit of 30 minutes, using two nodes and two processors per node:

$ msub -I -V -l walltime=00:30:00,nodes=2:ppn=2
msub: waiting for job 57.sched to start
msub: job 57.sched ready

$ echo $PBS_NODEFILE
/var/spool/torque/aux/57.sched

$ more /var/spool/torque/57.sched
n0000.north
n0000.north
n0001.north
n0001.north

$ mpirun -np 4 <job_name>      n0000.north
     n0000.north
     n0001.north
     n0001.north		  
 

Submitting Batch Jobs

To submit a script to TORQUE:

$ msub <batch_script>

The following is an example of a TORQUE batch script for running an MPI job. The line numbers refer to the comments that follow and are not part of the script.

1 #!/bin/csh
2 #PBS -q <queue_name>
3 #PBS -N <job_name>
4 #PBS -l nodes=10:ppn=2
5 #PBS -l walltime=0:50:00
6 #PBS -o <file.out>
7 #PBS -e <file.err>
8 #PBS -V
9 #PBS -M <username@email.address>
10 #PBS -m abe
11 #PBS -A <your-account-number>
12 cd /lustre/<username>
13 mpirun -v -np 20 <./mpi.out>

Comments for the above script:

  1. use queue named batch
  2. current job name is my_job
  3. request 10 nodes and 2 processors per node
  4. reserve the requested nodes for 50 minutes
  5. send standard output to file.out
  6. send standard error to file.err
  7. export all my environment variables to the job
  8. list of users to whom email is sent (comma-separated)
  9. set of conditions under which the execution server will send email about the job: (a)bort, (b)egin, (e)nd
  10. account to be charged for running the job; optional if user has only one account; if more than one account is available and this line is omitted, job will be charged to default account
  11. change to working directory username
  12. run my_job as a parallel job, sending output to the file mpi.out in current working directory

Please note that the following information is required in your submission scripts:

  • Number of nodes
  • Processors per node
  • Wallclock time
  • Account name

To reduce email load on the mailservers, please specify an email address in your TORQUE script. For example,

#!/bin/bash
#PBS -l walltime=00:20:00
#PBS -M <your_username@email.address>
#PBS -m mail_options

or use the mail flags on the command line:

$ msub -m mail_options -M <your_email_address>

These mail_options are available:

n no mail
a mail is sent when the job is aborted by the batch system.
b mail is sent when the job begins execution.
e mail is sent when the job terminates.

Following is a table showing some of the most common PBS scripting commands:

PBS Commands

Option Description
#PBS -N JobName Assigns a job name. The default is the name of PBS job script
#PBS -l nodes=4:ppn=2 The number of nodes and processors per node
#PBS -q queuename Assigns the queue your job will use
#PBS -l walltime=01:00:00 The total maximum wall-clock time during which this job can run (note than =800 means seconds, 80:00 means minutes and seconds and 1:00:00 means hours, minues and seconds)
#PBS -o path/job.out The path and file name for standard output
#PBS -e path/job.err The path and file name for standard error
#PBS -j oe Join option that merges the standard error stream with the standard output stream of the job
#PBS -M username@email.address Define the mail address of the user
#PBS -m b Sends mail to the user when the job begins
#PBS -m e Sends mail to the user when the job ends
#PBS -m a Sends mail to the user when job aborts (with an error)
#PBS -m ba Allows a user to have more than one command with the same flag by grouping the messages together on one line, else only the last command gets executed
#PBS -r n Indicates that a job should not rerun if it fails
#PBS -V Exports all environment variables to the job
#PBS -A AccountProject Project to charge the job to

Monitoring Batch Queues

Users can monitor the queues and their jobs using these commands:

  • canceljob <job_id>
  • showq <-p queue_name>
  • showbf
  • showstart <job_id>
  • checkjob <-v job_id>
  • mshow
  • showstate

Following is a table showing some of the most common TORQUE commands:

TORQUE Commands

Command Description
qstat -a Display the status of batch jobs
qdel <pbs_jobid> Delete (cancel) a queued job
qstat -r Show all running jobs on system
qstat -f <pbs_jobid> Show detailed information of the specified job
qstat -q Show all queues on system
qstat -Q Show queues limits for all queues
qstat -B Show quick information of the server
pbsnodes -a Show node status

View the qstat manpage from the ShaRCS login nodes for more options.