Hands-On Session PG-4: Fault Tolerance

Hands-On Session PG-4: Fault Tolerance:
Checkpointing Stencil and ULFM

Objective: To better understand the checkpoint/restart fault tolerance technique and the semantics of MPI-ULFM
The code for this exercise can be found here.
Instructions on how to log into the remote machine and how to download the source code to your working directory (using wget) can be found here.

Exercise 1: Checkpointing the Heat Diffusion Program

In this exercise we will extend the heat diffusion program with disk-based coordinated checkpointing. The files of this exercise are in sessionPG4/heat.

Start by loading MPI: module load openmpi. This will load a standard non fault-tolerant OpenMPI implementation.

In heat_cr.c, checkpointing is done as follows:

The checkpoint interval, in terms of number of iterations, is stored in the variable ckpt_interval.
At the beginning of a checkpointing iteration, each rank prints its intermediate state to disk in two files:
- a matrix file containing the local values of the told array, named as ckpt_r[rank]_iter[iter_number].bin.
- a small metadata file that stores the last checkpoint iteration and the program arguments, named as metadata_r[rank].bin
When heat_cr.c starts for the first time, rank0 prompts the user to enter the program arguments (i.e. Nx, Ny, Max_Iter, ...), and broadcasts these values to other ranks. However, when it restarts after a failed execution, each rank reads the program arguments from its local metadata file, and initializes its told array from the last consistent checkpoint.
To simulate process failure, we invoke the system call kill (process_id, SIGKILL) from a victim rank.

Read through heat_cr.c and familiarize yourself with the program structure. Compile and run the program using 4 ranks mpirun -np 4 ./heat_cr , and test it with a small matrix (100x100), running for 100 iterations, and checkpointing every 10 iterations. Check the generated checkpoint files in ckpt/. The matrix files should be empty because data checkpointing is not implemented yet. Delete the generated files: rm ckpt/*

Task

Complete the code to correctly checkpoint the told array, and restore it at restart time.
Try the following scenarios to test your program:
- Run the program without checkpointing: mpirun -np 4 ./heat_cr < heat.input.nockpt , and take a note of the resulting convergence value of the last iteration.
- Run a failure-free scenario with checkpointing: mpirun -np 4 ./heat_cr < heat.input.ckpt . Delete the generated checkpoint files.
- Run a failure scenario: mpirun -np 4 ./heat_cr < heat.input.kill . This will kill rank 1 after 2 seconds.
- Restart the program without any input arguments: mpirun -np 4 ./heat_cr . The input arguments should be taken from the metadata file. Ensure that your program generates the same convergence value of the failure-free scenario.
How does checkpointing impact the performance of this program? Assuming a MTTI of 1 minutes, what is the optimal checkpointing interval?
Modify the program to delete any old checkpoint file that is not needed for restart. How many old checkpoints do you need to keep?

Exercise 2: Mandelbrot with Failed Slaves

In this exercise we will use MPI-ULFM to implement a forward recovery mechanism for the dynamically load balanced Mandelbrot Set solver that you developed in Q3 of PS1. A non-complete implementation is provided in sessionPG4/mandel/ftmandel.c, which uses the master-slave approach for load balancing. The master rank (rank0) is responsible for allocating tasks to slaves and reassigning tasks of failed ranks to other slaves in case of failure.

Using MPI-ULFM on Raijin:

Load MPI-ULFM: module unload openmpi; module load /short/c37/modules/openmpi-ulfm
MPI-ULFM is used in the same way as any other OpenMPI implementation: to compile use mpicc, and to run use mpirun -np N ./a.out .
By default, mpirun starts MPI-ULFM without fault tolerance support. To enable fault tolerance, add -am ft-enable-mpi to the runtime parameters as follows: mpirun -np N -am ft-enable-mpi ./a.out

Master-slave communication:

In the given program, the master partitions the mandelbrot set problem into columns, and distributes these columns to idle slaves upon their request. When a slave requests a task, the master assigns the next unassigned column nextY(); to the slave. If all columns are assigned, the master responds by sending '-1' to indicate that no more work is available. As a result, the slave stops requesting tasks. The master uses a simple data structure y2Slave to memorize the mapping between columns and slaves. When a slave fails, it marks its columns as unassigned, in order to reassign them to other live slaves.

Task:

Read through ftmandel.c and familiarize yourself with the program structure.
To test a failure-free scenario, type mpirun -np 4 ./ftmandel < mandel.input.
To test a failure scenario, type mpirun -np 4 -am ft-enable-mpi ./ftmandel < mandel.input.kill . This will kill rank 2 after 1 second.
The failure scenario will fail because the fault tolerance code is incomplete. Write the missing code segments, marked by the string "ULFMTODO", and test the failure-scenario again.
If your implementation is correct, you should see this message printed "# All pixels appear to be correct!", otherwise, you will see this message "Parallel results are wrong".
What is the effect of a slave failure on performance?
Use batch_mandel to measure the performance using [4, 8, 16] ranks