Hands-On Session Parallelization Strategies #1:
Load Balancing Embarrassingly Parallel Problems

Objective: To better understand issues with load balancing and to further MPI programming proficiency.
The code for this exercise can be found here. Instructions on how to log into the remote machine and how to download the source code to your working directory (using wget) can be found here.

We will use the Mandelbrot Set as an example of a load balancing problem as discussed in lectures.

Mandelbrot Set on the complex plane

The image is generated from values associated with a set of points in the complex plane. The points are quasi stable with values that are computed by iterating the function Zk+1 = Zk2 + C, where Z and C are complex numbers (i.e. Z = A + Bi), with Z initially set to zero and C is the position of the point in the complex plane. Iterations of the above continue until either |Z|>2 or some arbitrary iteration limit is reached (|Z| = sqrt(A2 + B2)). The set of points are enclosed by a circle centered at (0,0) of radius 2.

A slightly modified program to evaluate the Mandelbrot set is contained in mandel.c. Computing a Mandelbrot pixel as implemented takes from 1 to 256 iterations. However even when 256 iterations are required the total time taken is still very small. Hence the code has been modified to do some extra work that is given by the square of the number of Mandelbrot iterations. The extra work is purely to increase task time so we have a meaningful load balancing problem.

The program requires as input:

Nx
the number of data points along the x axis
Ny
the number of data points along the y axis
Rx_min
the minimum x coordinate
Rx_max
the maximum x coordinate
Ry_min
the minimum y coordinate
Ry_max
the maximum y coordinate
Nexpt
Number of timing experiments to run

NOTE: To ensure the point is enclosed within the circle of radius 2 the values of Rx/Ry_Min/Max should be between -2 and +2. The Mandelbrot computation is run Nexpt times and the minimum time is printed out. This should minimize the need to run the computation several times.

The code is written to solve the Mandelbrot problem twice, first sequentially, and then in parallel. We time both, and for a sanity check verify that the results from the sequential code are identical to those from the parallel computation. You should only change the parallel part of the code.

Run the code: mpirun -np 1 mandel

with input

10 10
-2 2
-2 2
4

to make sure it works.

For problem sizes with Nx and Ny less than or equal to 50 the resulting pixels are printed. This is really just for interest. A small gnuplot script mandelbrot.plot is included, which plots data for a file named mandelbrot.dat: you can use it to visualize your output by running:

module load gnuplot
./mandel < mandel.in > mandelbrot.dat
gnuplot < mandelbrot.plot

Note that you might need to use ssh -X ... to connect to Raijin, in order for X11 to forward the window generated by gnuplot. In the above, the input data is read from file mandel.in. For what follows you might find it more convenient to always read the input data from a file rather than retype it each time.

Can you see that the printout looks something like the picture above (noting that the center of picture should be black)?

  1. Complete the table for the following two input data sets. Inspect the file batch_mandel, which will generate the required runs (as this is a longer run, you should run on the batch system using the normal queue.)
         #1 Nx=100, Ny=10, Rx_min=-2, Rx_max=0, Ry_min=-2, Ry_max=0, Nexpt=10
         #2 Nx=10, Ny=100, Rx_min=-2, Rx_max=0, Ry_min=-2, Ry_max=0, Nexpt=10
         -----------------------------------------------------------------
                                  Number of MPI processes and cores used
         Input  Time          1        2        4        8          
         -----------------------------------------------------------------
         #1     Sequential             -        -        -     
         #1     Parallel
         #1     Speedup
         #2     Sequential             -        -        -     
         #2     Parallel
         #2     Speedup
         -----------------------------------------------------------------
    
    Both data sets have the same number of data points and the same Rx/y min/max values. Explain why the performance is so different. (Hint: it may help to run this region for a smaller number of data points so that it prints out their values). Your answer should include comments based on how the code has been parallelized.
     
  2. As it stands, the parallel code records the time from when all the processes start until the time when the last one finishes. Insert code to record the time taken to perform all the pixel computations on each processor. This is the time from the current initial timing call until just BEFORE the MPI_Reduce() call. Each process should determine its minimum task time from the Nexpt experiments. When all experiments are done, collect the minimum task times from the different processes to process 0 using MPI_Gather() and store them in the ttsk_all array (already declared). Modify the code so that process 0 prints out these values when printing out the sequential and parallel total times. Using the same input data as above, complete the following table.
         -----------------------------------------------------------------
                         MPI      Number of MPI processes and cores used
         Input  Time     Rank       2           4           8
         -----------------------------------------------------------------
         #1     Parallel   0         
                           1
                           2        -  
                           3        -  
                           4        -           -  
                           5        -           -  
                           6        -           -  
                           7        -           -  
         #2     Parallel   0
                           1
                           2        -  
                           3        -  
                           4        -           -  
                           5        -           -  
                           6        -           -  
                           7        -           -  
         -----------------------------------------------------------------
    
    Does this explain the previous results?
  3. Implement dynamic load balancing using a master-slave approach in which the slaves request tasks from the master and (in the parallel section) the master does no other useful work but allocate tasks to slaves (i.e. you effectively lose a process, so -np 2 is the minimum number of processes). Gather performance data for two data sets given above on up to 16 processors.
  4. In mandel.c, all processes use an Ny x Nx array, pix_tmp, initialized to 0.0, and afterwards store in there the results of their Ny/nprocs allocated rows. An MPI_Reduce() is used to collect all rows to rank 0. What is an alternative to this (see lecture notes)? What are the performance pro's and cons of each approach?