Hands-On Session PG-4: Fault Tolerance:
Checkpointing Stencil and ULFM

Objective: To better understand the checkpoint/restart fault tolerance technique and the semantics of MPI-ULFM
The code for this exercise can be found here.
Instructions on how to log into the remote machine and how to download the source code to your working directory (using wget) can be found here.

Exercise 1: Checkpointing the Heat Diffusion Program

In this exercise we will extend the heat diffusion program with disk-based coordinated checkpointing. The files of this exercise are in sessionPG4/heat.

Start by loading MPI: module load openmpi. This will load a standard non fault-tolerant OpenMPI implementation.

In heat_cr.c, checkpointing is done as follows:

Read through heat_cr.c and familiarize yourself with the program structure. Compile and run the program using 4 ranks mpirun -np 4 ./heat_cr , and test it with a small matrix (100x100), running for 100 iterations, and checkpointing every 10 iterations. Check the generated checkpoint files in ckpt/. The matrix files should be empty because data checkpointing is not implemented yet. Delete the generated files: rm ckpt/*

Task
  1. Complete the code to correctly checkpoint the told array, and restore it at restart time.
  2. Try the following scenarios to test your program:
  3. How does checkpointing impact the performance of this program? Assuming a MTTI of 1 minutes, what is the optimal checkpointing interval?
  4. Modify the program to delete any old checkpoint file that is not needed for restart. How many old checkpoints do you need to keep?

Exercise 2: Mandelbrot with Failed Slaves

In this exercise we will use MPI-ULFM to implement a forward recovery mechanism for the dynamically load balanced Mandelbrot Set solver that you developed in Q3 of PS1. A non-complete implementation is provided in sessionPG4/mandel/ftmandel.c, which uses the master-slave approach for load balancing. The master rank (rank0) is responsible for allocating tasks to slaves and reassigning tasks of failed ranks to other slaves in case of failure.

Using MPI-ULFM on Raijin: Master-slave communication:

In the given program, the master partitions the mandelbrot set problem into columns, and distributes these columns to idle slaves upon their request. When a slave requests a task, the master assigns the next unassigned column nextY(); to the slave. If all columns are assigned, the master responds by sending '-1' to indicate that no more work is available. As a result, the slave stops requesting tasks. The master uses a simple data structure y2Slave to memorize the mapping between columns and slaves. When a slave fails, it marks its columns as unassigned, in order to reassign them to other live slaves.

Task:

  1. Read through ftmandel.c and familiarize yourself with the program structure.
  2. To test a failure-free scenario, type mpirun -np 4 ./ftmandel < mandel.input.
  3. To test a failure scenario, type mpirun -np 4 -am ft-enable-mpi ./ftmandel < mandel.input.kill . This will kill rank 2 after 1 second.
  4. The failure scenario will fail because the fault tolerance code is incomplete. Write the missing code segments, marked by the string "ULFMTODO", and test the failure-scenario again.
  5. If your implementation is correct, you should see this message printed "# All pixels appear to be correct!", otherwise, you will see this message "Parallel results are wrong".
  6. What is the effect of a slave failure on performance?
  7. Use batch_mandel to measure the performance using [4, 8, 16] ranks