In this exercise we will extend the heat diffusion program with disk-based coordinated checkpointing. The files of this exercise are in sessionPG4/heat.
Start by loading MPI: module load openmpi
. This will load a standard non fault-tolerant OpenMPI implementation.
In heat_cr.c, checkpointing is done as follows:
ckpt_interval
.told
array, named as ckpt_r[rank]_iter[iter_number].bin.told
array from the last consistent checkpoint.kill (process_id, SIGKILL)
from a victim rank.
Read through heat_cr.c and familiarize yourself with the program structure. Compile and run the program using 4 ranks mpirun -np 4 ./heat_cr
, and test it with a small matrix (100x100), running for 100 iterations, and checkpointing every 10 iterations.
Check the generated checkpoint files in ckpt/. The matrix files should be empty because data checkpointing is not implemented yet.
Delete the generated files: rm ckpt/*
told
array, and restore it at restart time. mpirun -np 4 ./heat_cr < heat.input.nockpt
, and take a note of the resulting convergence value of the last iteration. mpirun -np 4 ./heat_cr < heat.input.ckpt
. Delete the generated checkpoint files. mpirun -np 4 ./heat_cr < heat.input.kill
.
This will kill rank 1 after 2 seconds. mpirun -np 4 ./heat_cr
.
The input arguments should be taken from the metadata file.
Ensure that your program generates the same convergence value of the failure-free scenario.
In this exercise we will use MPI-ULFM to implement a forward recovery mechanism for the dynamically load balanced Mandelbrot Set solver that you developed in Q3 of PS1. A non-complete implementation is provided in sessionPG4/mandel/ftmandel.c, which uses the master-slave approach for load balancing. The master rank (rank0) is responsible for allocating tasks to slaves and reassigning tasks of failed ranks to other slaves in case of failure.
Using MPI-ULFM on Raijin:module unload openmpi; module load /short/c37/modules/openmpi-ulfm
mpicc
, and to run use mpirun -np N ./a.out
.mpirun
starts MPI-ULFM without fault tolerance support.
To enable fault tolerance, add -am ft-enable-mpi
to the runtime parameters as follows: mpirun -np N -am ft-enable-mpi ./a.out
In the given program, the master partitions the mandelbrot set problem into columns, and distributes these columns to idle slaves upon their request.
When a slave requests a task, the master assigns the next unassigned column nextY();
to the slave.
If all columns are assigned, the master responds by sending '-1'
to indicate that no more work is available.
As a result, the slave stops requesting tasks.
The master uses a simple data structure y2Slave
to memorize the mapping between columns and slaves.
When a slave fails, it marks its columns as unassigned, in order to reassign them to other live slaves.
mpirun -np 4 ./ftmandel < mandel.input
.
mpirun -np 4 -am ft-enable-mpi ./ftmandel < mandel.input.kill
. This will kill rank 2 after 1 second.