Quick Start

POD 101: Quick Start for POD

Welcome to POD 101, a brief tutorial describing the process for running jobs using simple submission scripts. To follow along you will need to have created and setup your POD user account. See the Account Setup documentation or contact a POD Sales Representative at podsales@penguincomputing.com for more information. If you do not have an account on POD, please visit the registration request form.

This tutorial focuses on the command line interface (CLI) and job submission script syntax which is the optimal way for organizing and automating your workflow on POD. Submission scripts can be used on the CLI as well as from the web-based Job Manager in the POD portal. To follow along using the CLI, open an SSH connection and access your POD login node using the appropriate application for your Operating System. Following the detailed instructions for setting up SSH and PuTTY will allow you to access your POD login node.

Job Submission Scripts

All jobs on POD should be submitted to the job scheduler from your login node. The job scheduler will allocate requested resources from a pool of high performance compute nodes with CPU and memory well beyond what is available on your login node. Since this submission request can be complex; it is best to record everything in a script to aid in reproducibility.

Most job submission scripts have two sections. The top section uses PBS directives (#PBS) to describe HOW the scheduler should run your job. As the user you will need to define: the job submission queue, the total number of nodes, number of processors-per-node, the maximum runtime of your job, how the output should be organized, etc. The bottom section describes WHAT your job does when it runs. As the user you will need to setup the runtime environment and call the application command with the desired arguments.

Simple Example Job

A simple job submission script that runs the date and hostname commands on 1 processor, from 1 node, submitting to the FREE queue, with a maximum runtime of 1 minute is shown below.

Please Note: The FREE queue is only available on MT1. If you are trying this example on MT2 use B30 for your queue instead of FREE.

#PBS -q FREE
#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:01:00

/bin/date
/bin/hostname

Use your favorite CLI editor to create this simple submission script. Some useful ones installed by default on your login node are nano, vim, and emacs. You can name your script whatever you want, but the best practice is to use a relevant name with a .sub ending. For this example use something like simple101.sub.

Job Submission

Once your submission script has been created use the qsub command to submit the job to the scheduler. This should return a jobid that you can use to keep track of the job. Use the qstat command to display the status of all your recent jobs. A jobid should show up in the status listing along with the name of the job, your username, any runtime, the current status, and submission queue.

A job’s status will typically progress from being Queued (Q), to Running (R), and finishing in a Completed (C) state. Keep in mind that when a user kills a job before it’s completion, it will show in a Vacated (V) state. For more information on Scheduler states and descriptions, see here. To monitor your submitted jobs, re-run qstat to see status changes over time.

$ qsub simple101.sub
17430680.pod

$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
17430680.pod              simple101.sub    penguin         00:00:00 Q FREE

$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
17430680.pod              simple101.sub    penguin                0 R FREE

$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
17430680.pod              simple101.sub    penguin         00:00:00 C FREE

Job Output

Once the job starts running and generating output, you should see two separate output files in the same directory as your job submission script. One file will include .o and the jobid in the file name while the other will have .e. Any output from your job commands should be recorded in the .o file, while any errors are recorded in the .e file. It is best practice to examine both files to see how your job ran. Use the cat and file commands to display the contents of the files on the CLI. The output from the job should include the date and hostname, as executed from the compute node, with no errors.

$ ls -1
simple101.sub
simple101.sub.e17430680
simple101.sub.o17430680

$ cat simple101.sub.o17430680
Thu Jul 18 01:13:43 UTC 2019
n33

$ file simple101.sub.e17430680
simple101.sub.e17430680: empty

If you see any errors in the .e file or qsub output, double check the contents of your submission script. The job scheduler is very particular of the formatting, spelling, and capitalization of the directives and commands in the job submission script. Once you have a working script, you can use this as a basis for building more complex application workflows.

Simple MPI Example Job

The previous example ran one set of commands, on one processor, using one compute node. We will now build a more complex example that runs on multiple processors using an MPI library available on POD.

This job submission script shown below runs an MPI-enabled executable on 12 processors, using 1 node, submitting to the FREE queue, with a maximum runtime of 5 minutes. This is similar to what was ran before, but now the script loads the OpenMPI version 2.0.0 environment module and also displays more information before and after the timed call to mpirun. This information can help when trying to debug a job failure.

Please Note: The FREE queue is only available on MT1. If you are trying this example on MT2 use B30 for your queue instead of FREE.

#PBS -q FREE
#PBS -l nodes=1:ppn=12
#PBS -l walltime=00:01:00

module load openmpi/2.0.0/gcc.6.2.0

echo "================== nodes ===================="
cat $PBS_NODEFILE
echo "================= job info  ================="
echo "Date:   $(date)"
echo "Job ID: $PBS_JOBID"
echo "Queue:  $PBS_QUEUE"
echo "Cores:  $PBS_NP"
echo "mpirun: $(which mpirun)"
echo

cd $PBS_O_WORKDIR

time mpirun ./helloworld.bin
retval=$?

echo
echo "================== done ====================="
echo "Date:   $(date)"
echo "retval: $retval"

Just as before, use your favorite CLI editor to create a simple MPI submission script. For this example use something like mpi101.sub.

MPI Job Submission

Before running you will need to copy the MPI-enabled helloworld.bin binary executable to your home directory from /public/examples/pod101/. This binary was compiled using the openmpi/2.0.0/gcc.6.2.0 library and the submission script should load the matching environment module. Just as before, use qsub to submit this MPI-enabled job to the scheduler. Note the jobid to keep track of your job and use qstat to watch the job status change through the Queued (Q), Running (R), and Completed (C) states.

$ cp /public/examples/pod101/helloworld.bin .
$ qsub mpi101.sub
17430686.pod

$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
17430686.pod              mpi101.sub       penguin         00:00:00 Q FREE

$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
17430686.pod              mpi101.sub       penguin                0 R FREE

$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
17430686.pod              mpi101.sub       penguin         00:00:00 C FREE

MPI Job Output

Like before, once the job starts running and generating output you should see the output and error file names with the new jobid. Use the cat command to examine both files to see how the job is running and once it is completed. You should see the output from the script documenting the nodes used, the date the job ran, jobid, submission queue, total number of processors, path to the mpirun executable used, and the output of the MPI-wrapped call to the application. This application example outputs “Hello world” from each of the running processes.

$ ls -1
helloworld.bin
mpi101.sub
mpi101.sub.e17430686
mpi101.sub.o17430686

$ cat mpi101.sub.o17430686
================== nodes ====================
n33
n33
n33
n33
n33
n33
n33
n33
n33
n33
n33
n33
================= job info  =================
Date:   Thu Jul 18 01:34:35 UTC 2019
Job ID: 17430686.pod
Queue:  FREE
Cores:  12
mpirun: /public/apps/openmpi/2.0.0/gcc.6.2.0/bin/mpirun

Hello world from node n33, rank 0 out of 12 processors
Hello world from node n33, rank 1 out of 12 processors
Hello world from node n33, rank 2 out of 12 processors
Hello world from node n33, rank 3 out of 12 processors
Hello world from node n33, rank 4 out of 12 processors
Hello world from node n33, rank 5 out of 12 processors
Hello world from node n33, rank 6 out of 12 processors
Hello world from node n33, rank 7 out of 12 processors
Hello world from node n33, rank 8 out of 12 processors
Hello world from node n33, rank 9 out of 12 processors
Hello world from node n33, rank 10 out of 12 processors
Hello world from node n33, rank 11 out of 12 processors

================== done =====================
Date:   Thu Jul 18 01:34:35 UTC 2019
retval: 0

$ cat mpi101.sub.e17430686

real    0m0.279s
user    0m0.591s
sys     0m0.198s

Just as before, if you see a lot of errors in the .e file or returned from qsub double check the contents of your submission script. Once you have a working script, you can use this as a basis for building more complex scripts useful for your MPI-enabled application workflow.