Running MPI Jobs

Back to documentation index

MPI Job Syntax

OpenMPI is readily available on Penguin POD for MPI jobs. Since OpenMPI is optimized for POD, there is no need to specify a machine file or -np when using mpirun. By default, mpirun will launch an MPI rank per requested core and use the nodes provided by the scheduler. For instance, the following example will launch 96 MPI ranks using OpenMPI 1.5.5:

#PBS -S /bin/bash
#PBS -l nodes=8:ppn=12,walltime=48:00:00

module load openmpi/1.5.5/gcc.4.4.6
mpirun /path/to/binary

exit $?

Using More Than 4GB per Core

A single POD core provides 4GB of dedicated RAM. If your application requires more than 4GB per MPI rank, you must request all resources available to satisfy your RAM needs, then limit the number of MPI ranks launched using the mpirun arguments --npernode or --loadbalance along with -np. For instance, a 24 core job that needs 8GB per MPI rank will require that you check out 48 cores to get 8GB of RAM for each of the 24 cores.

#PBS -S /bin/bash

# 4 nodes x 12 cores = 48 cores checked out
#PBS -l nodes=4:ppn=12,walltime=24:00:00

# load OpenMPI
module load openmpi/1.5.5/gcc.4.4.6

# limit mpirun to 24 cores and loadbalance the MPI ranks over all 4 nodes
mpirun -np 24 --loadbalance /path/to/binary

# alternatively, use --npernode
# mpirun -np 24 --npernode 6 /path/to/binary

exit $?

Example OpenMPI Job

More example templates can be found in /public/examples.

#PBS -N MPI-EXAMPLE
#PBS -S /bin/bash
#PBS -q FREE
#PBS -j oe
#PBS -l nodes=2:ppn=2
#PBS -l walltime=00:05:00

# Load the ompi environment.  Use 'module avail' from the
# command line to see all available modules.

module load openmpi/1.5.5/gcc.4.7.2

# Display some basics about the job

echo
echo "================== nodes ===================="
cat $PBS_NODEFILE
echo
echo "================= job info  ================="
echo "Date:   $(date)"
echo "Job ID: $PBS_JOBID"
echo "Queue:  $PBS_QUEUE"
echo "Cores:  $PBS_NP"
echo "mpirun: $(which mpirun)"
echo
echo "=================== run ====================="

# Enter the PBS folder from which qsub is run

cd $PBS_O_WORKDIR

# Run your application with mpirun. Note that no -mca btl options 
# should be used to ensure optimal performance.  Jobs will use 
# Infiniband by default.

time mpirun /path/to/your/mpi/binary
retval=$?

# Display end date and return value

echo
echo "================== done ====================="
echo "Date:   $(date)"
echo "retval: $retval"
echo

# vim: syntax=sh

Special MPI Considerations

OpenMPI is strongly encouraged on POD as the OpenMPI tools seen in module avail are optimized for the POD Infiniband environment. In the rare case where a commercial application requires the use of a different MPI implementation, below are some special considerations.

Platform MPI (formerly HPmpi)

If your application requires Platform MPI, it is necessary to set these variables:

export MPI_MAX_REMSH=32
export MPI_REMSH=/usr/bin/bprsh

Then, use mpirun with the following options. $PBS_NODEFILE will be automatically generated inside the PBS TORQUE job.

mpirun -psm -hostfile $PBS_NODEFILE <your application>

Intel MPI

If your application requires Intel MPI, you will need to configure your job to appropriately use a TMI configuration that leverages Infiniband and the Qlogic/Intel PSM libraries. The following example can be used as a template to run Intel MPI jobs with the optimal TMI configuration.

#PBS -S /bin/bash
#PBS -N MPI
#PBS -j oe
#PBS -q M40
#PBS -l nodes=4:ppn=12
#PBS -l walltime=48:00:00

module load intel/11.1.0
cd $PBS_O_WORKDIR

# These are required settings

export I_MPI_FABRICS=shm:tmi 
export TMI_CONFIG=/etc/tmi.conf 
export I_MPI_TMI_LIBRARY=/usr/lib64/libtmi.so 
export I_MPI_TMI_PROVIDER=psm 
export I_MPI_MPD_RSH=/usr/bin/bprsh
export I_MPI_DEBUG=5  # optional

# PBS_NP = number of MPI ranks (cores in nodes=X:ppn=Y)
# PBS_NODEFILE = /var/spool/torque/aux/<jobid> from mother superior

mpdboot -n $PBS_NP -f $PBS_NODEFILE -r $I_MPI_MPD_RSH

## <run program here>

mpdallexit

## EOF ##