Job Scheduler

This documentation assumes that you have submitted some simple jobs to the POD HPC clusters. If you need a brief introduction on building job scripts for job submission, please see the POD 101: Quick Start for POD.

POD Job Resource Scheduling

All compute node resources on the MT1 and MT2 clusters must be reserved before they can be used for running jobs. Both clusters use the PBS TORQUE resource manager and Moab workload manager to manage, reserve, schedule, and allocate compute node resources for dispatching of user jobs. The documentation on this page describes the capabilities of the job scheduler and provides examples for submitting and monitoring your jobs.

Scheduler Commands

POD MT1 and MT2 uses the PBS TORQUE scheduler for job submission to the computational cluster. The following commands, available to all, allow users to easily submit, monitor and delete jobs on the command-line.

  • Use the qsub command to submit your jobs to the Moab workload manager.

  • Use the qstat command to monitor and display status for all your jobs.

  • Use the qdel command to delete or cancel specific jobs not in a completed state.

Choosing your Queue

Choosing a scheduler queue for a compute job is required as different queues provide different compute node types. The following queues are available on the POD MT1 cluster. All nodes have QDR InfiniBand interconnects for MPI as well as 10 GbE data networks.

Please Note: That submitting to the FREE queue with a resource requests larger than the restrictions will result in a job that will never run.

MT1 Queue

CPU Architecture

Cores/Node

RAM/Node

Restrictions

FREE

Intel® 2.9 GHz Westmere

12

48 GB

24 cores for 5 mins

M40

Intel® 2.9 GHz Westmere

12

48 GB

None

H30

Intel® 2.6 GHz Sandy Bridge

16

64 GB

None

T30

Intel® 2.6 GHz Haswell

20

128 GB

None

H30G

Intel® 2.6 GHz Sandy Bridge
Dual NVIDIA® Tesla® K40 GPUs

16

64 GB

None

The following queues are available on the POD MT2 cluster. All nodes have Intel Omni-Path interconnects for MPI and storage networking.

MT2 Queue

CPU Architecture

Cores/Node

RAM/Node

Restrictions

B30

Intel® 2.4 GHz Broadwell

28

256 GB

None

S30

Intel® 2.4 GHz Skylake

40

384 GB

None

For pricing details, please see here.

PBS Directives

When submitting jobs on both MT1 and MT2 the following directives are required (unless noted). The directives can be specified as command-line options to the qsub command or as comments in your job submission script. For example, the following job submissions are equivalent:

$ qsub -q M40 -l walltime=24:00:00 -l nodes=5:ppn=12 /path/to/binary
#PBS -q M40
#PBS -l walltime=24:00:00
#PBS -l nodes=5:ppn=12

/path/to/binary

Queue Name

The queue name determines your the CPU architecture, node hardware, and per-core-hour pricing. The queue names differ between the MT1 & MT2 clusters. Use the -q option with the queue name to specify your job submission queue. For example, to submit to the FREE queue on MT1 use the following option:

#PBS -q FREE

Processor (Core) Count

The number of available processor cores for each node will depend on the CPU architecture for that node. The syntax for this option specifies the number of nodes and processors-per-node together -l nodes=X:ppn=Y where X is the number of nodes and Y is the number of processor cores-per-node. The total number of processors for the job is the product of the two values. For example, to submit a 96 core job across 8 nodes (8 x 12 cores) use the following option:

#PBS -l nodes=8:ppn=12

Estimated Walltime

Estimated walltime, or the estimated elapsed runtime of your job, helps the scheduler identify appropriate run windows for your job. You can specify this estimated runtime as any combination of hours, minutes, and seconds as -l walltime=HH:MM:SS. See the Walltime section below for more more information for selecting an appropriate walltime estimate. For example, to submit a job with a walltime of 24 hours use the following option:

#PBS -l walltime=24:00:00

Shell Environment (Optional)

If you job submission script needs a different shell for execution then your default login shell then you should specify the shell as part of your job submission. Use the -S option with the path to the shell needed by the submission script. For example, to submit a job to use Bourne-Again Shell (BASH) to execute the submission script set the following option:

#PBS -S /bin/bash

Job Name (Optional)

It may be a good idea to specify a name for your job. This will help identify individual jobs when displaying job status using the qstat command. The job name is also used to name the job’s output and error files. Use the -N option to specify a unique or relevant name for your job. For example, to submit a job with a name of My_Job_Name use the following option:

#PBS -N My_Job_Name

Script Output (Optional)

PBS will capture your job’s output written the Standard Output (STDOUT) and Standard Error (STDERR) streams. This output is useful for examining the results of your job as well as troubleshooting any issues you may encounter. Depending on your application it may make sense to merge these streams into a single output file. Use the -j oe option to have PBS write a single output file containing all output from your job.

#PBS -j oe

If your application is known for writing excessive or unnecessary error messages it may be a good idea to keep your output separate. Use the -o and -e options to specify unique filenames for your job’s STDOUT and STDERR file. For example, to have your job write STDOUT to mystdout.txt and STDERR to mystderr.rtxt use the following options.

#PBS -o mystdout.txt
#PBS -e mystderr.txt

Project Account (Optional)

By default, POD can provide utilization and account billing information broken down to individual jobs. You may find it useful to tag your jobs with a project or account name which can help in accounting for jobs that are related to a specific project, task, case, billing account, end-customer, etc. This option is user-defined and is not used by the scheduler in any way or associated with any system user or group name. Use the -A option to specify a unique or relevant project name for your job. For example, to submit 3 jobs against different project accounts: Project_X, Department A, and BillingAccount-001 run the following commands.

$ qsub -A Project_X job.sub
$ qsub -A "Department A" job.sub
$ qsub -A BillingAccount-001 job.sub

Job Arrays

For jobs where the same executable is going to be run over and over again, with only a change in input parameters, it can be much more efficient to use job arrays. Job arrays allow you to write and submit one submission script to launch multiple jobs. The arguments for each job in the array can be specified using the array ID.

PBS Directive

To submit your job as an array you must specify a range of ID values with each ID corresponding to a instance of your job. Use the -t option to specify the ID range for your jobs. For example, to submit 4 instances of your job using a range of 0-3 producing ID values: 0, 1, 2, and 3 use the following option:

#PBS -t 0-3

Simple Example

Your job submission script can be written to use the ID values to help identify or specify input parameters for your job. Reference the $PBS_ARRAYID environment variable in your script to identify the ID value for the currently running job. For example, to submit a job array of 4 instances that sleep for different intervals use the $PBS_ARRAYID variable to index into a list of sleep times:

#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:05:00
#PBS -j oe
#PBS -N array
#PBS -t 0-3

times=( 0 10 20 30 )

echo $PBS_ARRAYID
echo Sleeping for ${times[$PBS_ARRAYID]}
sleep ${times[$PBS_ARRAYID]}

Submitting and running this job array should result in 4 individual jobs that sleep for progressively longer intervals. With the first instance sleeping for zero seconds, followed by an instance sleeping for 10 seconds, then 20 seconds, and finally the last instance would sleep for 30 seconds. The resulting output from qsub, qstat -t, and STDOUT will look like this:

$ qsub testarrays.sub
195610[].pod

$ qstat -t
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
195610[0].pod             array-0          test            00:00:00 C batch
195610[1].pod             array-1          test            00:00:00 R batch
195610[2].pod             array-2          test                   0 R batch
195610[3].pod             array-3          test                   0 R batch

$ cat array.o195610-*
0
Sleeping for 0
1
Sleeping for 10
2
Sleeping for 20
3
Sleeping for 30

In the example above, the qstat -t command was used to see the individual elements of the array. Without -t, the job array will be condensed into one status for the entire array. Similarly, if your input files were named in a successive fashion, like input-1, input-2, input-3, etc., you could use them as input parameters to the same executable.

Walltime

As of July 30, 2017, POD compute jobs will require a walltime option specified along with your request for nodes, processors-per-node, and queue. A walltime defines the maximum amount of time a job should run and is a hard limit. All jobs will be killed if they exceed their specified walltime. Users are only billed against the actual runtime of a job. A job that completes in 5 minutes, but was submitted with a maximum walltime of 1 hour, will only be charged for the 5 minutes of actual runtime.

Optimize Scheduling

Accurate walltime estimates help optimize job scheduling to ensure that your job starts as quickly as possible. Without specifying a walltime the scheduler would default to estimating a 99-day runtime as part of it accounting. This creates unnecessary wait times in the queue, especially for large core jobs. For example, if a 1,000 core job should run no longer than three hours, not setting a walltime resulted in the scheduler ensuring 99 days of continuous access to 1,000 cores. This resulted in undesired waiting in the queues due to the “fair share” scheduling algorithm for compute resources.

Control Billing

Since walltime is a hard limit, it also enables you to control the maximum cost any job can be billed. A walltime can also prevent experimental code or models from running longer than expected. For instance, if you launch a large array of jobs that should only take a maximum of 15 minutes each, setting a walltime of of 20 minutes per job will prevent “run-away” jobs as you experiment with new code, workflows, input files, parameters, and models.

Best Practices for Estimating Walltime

For first time users of new applications or workloads you will want to ensure your job runs to till completion. It is recommended that you overestimate your walltime for these new jobs. For example, if you think your new job will finish in 4 hours, you should set a walltime estimate of 8 to 10 hours for the first few runs.

Once you have determined your job’s walltime needs, it is recommended that you only pad your walltime estimate by 10-15% moving forward. For production runs, you should request a walltime that is reasonable and is not excessive in runtime. This helps the scheduler fit your jobs into smaller windows of availability and minimize queue wait times.

Walltime and Job Arrays

For job arrays, the walltime is applied to each job element. For example, this job array of 10 elements will set a 2 hour walltime for each job element running on 20 cores of 1 node.

$ qsub -t 1-10 -q T30 -l nodes=1:ppn=20 -l walltime=02:00:00 /path/to/submission/script

Monitoring Jobs

A user’s jobs can be monitored from the command line using the qstat command. Information reported for jobs include a unique job id, job name, submission user, runtime, status, and submission queue. A job id is the best way to identify or reference a specific job when contacting support for assistance.

$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
23224.pod                 XHPL-8x12        penguin                0 Q M40

A job’s status can be any of the following. Please note that the Complete job status is informational, indicating that your job has finished and billing has ended. Complete jobs cannot be deleted using the qdel command as they are no longer running.

State

Description

Q

Job is queued, eligible to run

R

Job is running

E

Job is exiting after having run

H

Job is held

V

Job is vacated

C

Job is complete, billing has ended

To only view running jobs with more detail use the qstat -r command:

$ qstat -r

pod:
                                                                               Req'd    Req'd      Elap
Job ID               Username    Queue    Jobname          SessID NDS   TSK    Memory   Time   S   Time
-------------------- ----------- -------- ---------------- ------ ----- ------ ------ -------- - --------
23224.pod            penguin     M40      XHPL-8x12         56282     8     96    --  24:00:00 R 00:14:02

To see detailed information about a specific job use the qstat -f command specifying a job id:

$ qstat -f 23224.pod
Job Id: 23224.pod
    Job_Name = XHPL-8x12
    Job_Owner = penguin@podmt1
    resources_used.cput = 00:12:34
    resources_used.mem = 2792kb
    resources_used.vmem = 59624kb
    resources_used.walltime = 00:14:02
    job_state = R
    queue = M40
    server = pod
    Checkpoint = u
    ctime = Wed Aug 28 19:17:24 2019
    Error_Path = podmt1:/home/penguin/XHPL-8x12.e23224
    ...

Monitoring System Group Jobs

POD accounts can monitor and display the status of jobs submitted by their managed users. You must setup and configure the PODTools command-line utility and add managed users to a system group. Please view the PODTools documentation for more information. The steps below outline the process for inviting users to create accounts and setting up a new system group,

Setup a new System Group

  1. Login to the POD portal as the top-level account user.

  2. Click the Users & Groups link in the POD portal menu. Alternatively, follow this link to bring up the My POD Users & Groups page directly.

  3. Click the blue Create a System Group button to display the configuration form.

  4. Enter a unique name for your group in the Groupname textbox and use the checkboxes to select existing managed users as members of the new group.

  5. Click the blue Create Group button to create the new system group.

Invite new Managed Users

  1. On the My POD Users & Groups page click the blue Invite users to join your POD Account button. Alternatively, follow this link to bring up the Invite New Users to Join Your POD Account page directly.

  2. Fill out the table specifying the Full Name and E-mail Address for each new user. You can use the Add more rows button to add rows if you want to add more than two new accounts. Once the form is filled out click the Done adding invitations button to expand the form.

  3. Make sure to select the checkbox next to your new group name so that these accounts will automatically be joined to the new system group.

  4. Grant specific permissions for your new users on MT1 & MT2 including: Login Node Permissions, Create Instances, Storage Allocation Maximums, and Storage Usage Quotas.

  5. Click the blue Send Invitations button to send e-mail invitations to your new users.

Usage

First, load the podtools module and use the podsh status command with the -g option to display job status for users in the specified group. Please view the podsh documentation for more information.

$ module load podtools
$ podsh status -g mygroup

T30 Haswell Queue (MT1)

The MT1 and MT2 clusters are made up of many different CPU architectures. Compute jobs will be dispatched to specific compute nodes determined by their submission queue. For example, if you want your job to run on a server with an Intel® Haswell CPU architecture then you must submit your job to the T30 queue on MT1.

T30 Server - Intel® Haswell

  • POD Location: MT1

  • Intel® Xeon® CPU E5-2660 v3 @ 2.6 GHz

  • 20 Bare-metal Compute Cores

  • 128 GB DDR4 RAM

  • QDR InfiniBand non-blocking fabric

  • 10 GbE storage network

T30 Batch Jobs

The T30 Haswell queue allows for single node jobs to use up to 20 cores with each core allotted 6.4GB of RAM. For example, to submit a job to a single node using 10 cores, 64 GB of RAM, for a maximum runtime of 1 hour use this command:

$ qsub -q T30 -l nodes=1:ppn=10 -l walltime=01:00:00 /path/to/job-script

When requesting multiple T30 nodes for an MPI-enabled job you must request all 20 cores for each server. For example, to submit a job to five nodes using 100 cores, 640 GB of RAM, for a maximum runtime of 1 hour use this command:

$ qsub -q T30 -l nodes=5:ppn=20 -l walltime=01:00:00 /path/to/mpi-job-script

T30 Interactive Jobs

It is recommended to use a T30 node for building any applications that you plan on running on T30 nodes. You can run an interactive job by using the qsub command with the -I option. Please note that runtime for interactive jobs are billed until you exit the interactive shell terminating the job. The example below shows a simple interactive job running on node n492.

[penguin@loginnode ~]$ qsub -q T30 -I -l nodes=1:ppn=20

qsub: waiting for job 14701947.pod to start
qsub: job 14701947.pod ready

[penguin@n492 ~]$
[penguin@n492 ~]$ logout

qsub: job 14701947.pod completed

Compiling for Haswell

The T30 queue’s Haswell processors allow for both AVX and AVX2 extended instruction sets. The use of AVX2 extensions require a GCC compiler with version >= 4.9.0. When loading the gcc/4.9.0 environment module, compatible binutils and zlib dependency modules will be loaded as well. The example below shows a single interactive job running on node n492 compiling a C program with AVX and AVX2 extended instruction sets as well as a native build targeting the detected CPU.

[penguin@loginnode ~]$ qsub -q T30 -I -l nodes=1:ppn=20

qsub: waiting for job 14701948.pod to start
qsub: job 14701948.pod ready

[penguin@n492 ~]$ module load gcc/4.9.0
[penguin@n492 ~]$ module list
Currently Loaded Modulefiles:
  1) zlib/1.2.8/gcc.4.9.0        3) gcc/4.9.0
  2) binutils/2.25.1/gcc.4.9.0

[penguin@n492 ~]$ gcc -mavx mycode.c                        # AVX extensions
[penguin@n492 ~]$ gcc -mavx2 mycode.c                       # AVX2 extensions
[penguin@n492 ~]$ gcc -march=native -mtune=native mycode.c  # CPU Native
[penguin@n492 ~]$ logout

qsub: job 14701948.pod completed

S30 Skylake Queue (MT2)

The MT1 and MT2 clusters are made up of many different CPU architectures. Compute jobs will be dispatched to specific compute nodes determined by their submission queue. For example, if you want your job to run on a server with an Intel® Skylake CPU architecture then you must submit to the S30 queue on MT2.

S30 Server: Intel® Xeon® Scalable Processor (Skylake-SP)

  • POD Location: MT2

  • Intel® Xeon® Gold 6148 @ 2.4 GHz

  • 40 Bare-metal Compute Cores

  • 384 GB DDR4 RAM

  • Intel® Omni-Path 100 Gb/s non-blocking fabric

  • Lustre parallel file system

S30 Batch Jobs

The S30 Skylake-SP queue allows for single node jobs to use up to 40 cores. Each core is allotted 9.6 GB of RAM. For example, to submit a job to a single node using 10 cores, 96 GB of RAM, for a maximum runtime of 1 hour use this command:

$ qsub -q S30 -l nodes=1:ppn=10 -l walltime=01:00:00 /path/to/job-script

When requesting multiple S30 nodes for an MPI-enabled job you must request all 40 cores for each server. For example, to submit a job to five nodes using 200 cores, 1920 GB of RAM, for a maximum runtime of 1 hour use this command:

$ qsub -q S30 -l nodes=5:ppn=40 -l walltime=01:00:00 /path/to/mpi-job-script

S30 Interactive Jobs

It is recommended to use an S30 node for building any applications that you plan on running on S30 nodes. You can run an interactive job by using the qsub command with the -I option. Please note that runtime for interactive jobs are billed until you exit the interactive shell terminating the job. The example below shows a simple interactive job running on node n700.

[penguin@loginnode ~]$ qsub -q S30 -I -l nodes=1:ppn=40
qsub: waiting for job 14701947.pod to start
qsub: job 14701947.pod ready

[penguin@n700 ~]$
[penguin@n700 ~]$ logout

qsub: job 14701947.pod completed

Compiling for Skylake-SP

The S30 queue’s Skylake-SP processors allow for AVX2 and AVX512 extended instruction sets. The use of AVX2 or AVX512 extensions require a GCC compiler with version >= 4.9.0. Since AVX512 is a new feature for Skylake-SP processors, AVX512 enabled binaries will only execute on S30 nodes. Use AVX2 extensions to build binaries targeting portability which is supported on T30, B30, and S30 nodes. The example below shows a single interactive job running on node n700 compiling a C program with AVX512 extended instruction sets.

[penguin@loginnode ~]$ qsub -q S30 -I -l walltime=01:00:00 -l nodes=1:ppn=40

qsub: waiting for job 4287980 to start
qsub: job 4287980

[penguin@n700 ~]$ module load gcc/4.9.0
[penguin@n700 ~]$ gcc -mavx512f -mavx512pf -mavx512er -mavx512cd mycode.c
[penguin@n700 ~]$ logout

qsub: job 4287980 completed

GCC compiler users can leverage a number of compile flags for AVX512 extensions:

  • AVX-512 foundation instructions: -mavx512f

  • AVX-512 prefetch instructions: -mavx512pf

  • AVX-512 exponential and reciprocal instructions: -mavx512er

  • AVX-512 conflict detection instructions: -mavx512cd

Intel® compiler users, can use -x CORE-AVX512 to generate AVX-512F, AVX-512CD, AVX-512BW, AVX-512DQ and AVX-512VL binaries. For more information, read Intel’s documentation on AVX512 optimizations using the Intel® C, C++ and Fortran compilers.

Tundra Open Compute Platform

The Intel® Haswell and Skylake cores on POD are delivered through Tundra, Penguin Computing’s Open Compute HPC platform. Penguin Computing’s Tundra Extreme Scale Series provides density, serviceability, reliability and optimized total cost of ownership for highly-demanding computing requirements. As such, Penguin Computing is able to pass these savings to POD customers.

Read more below about Penguin Computing’s Tundra solutions which was awarded the Department of Energy’s National Nuclear Security Administration (NNSA) Advanced Simulation and Computing (ASC) CTS-1 contract: