Hands-On Session Messaging Fundamentals #1:
Introduction to the NCI Raijin System and MPI

Objective: To ensure everyone can use the remote machine, to learn how to run MPI jobs interactively and in batch mode, and to learn the basics of MPI.
The code for this exercise can be found here.

Downloading code

You may use wget to download the code for each Hands-on session to your working directory. Run the following commands after logging into the remote machine:

wget http://cs.anu.edu.au/courses/distMemHPC/sessions/MF1.tar

tar -xvf MF1.tar

Note: remember to change the target link for each hands-on session. eg: .../MF2.tar for Hands-on MF-2 etc

Accessing the System

There is comprehensive documentation for the NCI Raijin system available here. Briefly browse this page to find the main topics.

Log on to the Raijin system using your given username: ssh raijin.nci.org.au -l <username>

Raijin uses environment modules to customize user environments. Run the command module avail to see what modules are available, and module list to see what modules you are using

Do module load openmpi to add MPI. What version of OpenMPI are you using by default? Add the above command into your ~/.profile for later. Standard UNIX editors are installed including nano, vim and emacs.


MPI Programs

Open the file helloMPI.c. Note there are 3 basic requirements for all MPI codes:

#include "mpi.h"
MPI_Init(&argc, &argv);
MPI_Finalize(); 

You can find the header file in /apps/openmpi/1.6.3/include/mpi.h. (Do you know what version of OpenMPI you are using now?) Take a look at it. It provides the definition of MPI_COMM_WORLD in a complicated fashion involving a global structure that is initialized in another function in the library (it used to be easier!).

MPI_Init() and MPI_Finalize() should be the first and last executable statements in your code -- basically because it is not clear what happens before or after calls to these functions!! man MPI_Init says:

The MPI Standard does not say what a program can do before an MPI_Init or after an MPI_Finalize. In the Open MPI implementation, it should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input, or writing to standard output.

If you want to know what an MPI function does you can:

Note that at the moment we are only interested in MPI1.


Running MPI Programs Interactively

Compile the code: make helloMPI

This will result in:

mpicc -c helloMPI.c
mpicc -o helloMPI helloMPI.o 

mpicc is a wrapper that will end up calling a standard C compiler (in this case gcc). Do mpicc -v helloMPI.c to see all the details. mpicc also ensures that the program links with the MPI library.

Run the code interactively by typing ./helloMPI.

You should find the executable runs using just one process. With some MPI implementations the code will fail because you have not defined the number of processes to be used. Using OpenMPI this is done using the command mpirun.

Try running the code interactively again but this time by typing mpirun -np 2 ./helloMPI.

Now try: mpirun -np 6 ./helloMPI.

Try using -np 20; it will fail - why? What is the maximum number of MPI processes you can create interactively?

If you run this program enough times you may see that the order in which the output appears changes. Output to stdout is line buffered, but beyond that can appear in any order.

mpirun has a host of different options. Do man mpirun for more information. The -np refers to the number of processes that you wish to spawn.


Running MPI Programs via the Batch Queuing System

So far we have only been running our code on one of the Raijin nodes. In total Raijin has 3592 nodes (and 57,472 cores). Six of these are reserved for interactive logins; the remaining nodes are only available via a batch queuing system. (Which of the six interactive nodes are you logged on to? Run the command hostname if unsure.)

Now we will run the same job, but using the PBS batch queuing system. To submit a job to the queuing system we have to write batch script. An example of this is given in file batch_job. Take a look at this. Lines starting with PBS are commands to the queuing system, informing it of how much resources you require and how your job should be executed. We use one of these lines to set the number of processors you want to use. Very important is the line to limit the walltime:

Please ensure you limit walltime similarly for any batch job that you use. After all this setup information you run the job by issuing the mpirun command, but taking the number of processes from the number of processors allocated by the queuing system.

To submit your job to the queuing system, run qsub batch_job.

It will respond with something like

pre>$ qsub batch_job 9485588.r-man2

where 9485588.r-man2 is the id of the job in the queuing system. To see what is happening to this job, run qstat 9485672; or, more simply, for any of your current jobs, use qstat -u $USER.

To delete a job from the queue, run qdel 9485672.r-man2.

When your job completes, the combined standard output and error will be put in a file, in this case named batch_job.o9485672. Inspect this file.

Exercise 1

Modify the code in helloMPI.c to also print out the name of the node each process is executing on. Do this by using the system call:

gethostname(name, sizeof(name));
  1. Run your modified version of helloMPI interactively. What nodes of the cluster are being used?
  2. Repeat the above, but now use the batch file. What nodes are now being used?
  3. Modify the batch script so that your MPI code has enough processes to run on at least two different nodes of the Raijin system. After you know how to do this return to using one node.

Exercise 2: Basic Messaging

We will now modify the MPI program so that process 0 prints the messages for all processes.
  1. First, modify the program to print the messages in order. Processes (including rank 0) should send msgBuf to rank 0 with something like:
    MPI_Send(msgBuf, sizeof(msgBuf), MPI_CHAR, 0, 999, MPI_COMM_WORLD);
    
    Process 0 should run a simple loop with an index src, where 0 ≤ src < nprocs, with a receive call for each of these messages, something like:
    MPI_Recv(msgBuf, sizeof(msgBuf), MPI_CHAR, src, 999, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    
  2. Modify your program so the messages are printed out in the order they are received. Hint: use MPI_ANY_SOURCE instead of src as the source of the message.