Notes for Lab 2


Debugging a “Real” Program#

The three bug fixes are:

  • An invalid parameter passed to an MPI routine: Delete B=NULL on line 2555.
  • A segmentation fault: Delete A[100000] = 0.0 on line 3959
  • A deadlock: The issue is process 0 and 3 are trying to send and receive a message (on lines 963 and 3986 respectively) with different tags. This can be fixed by setting the tags to be the same.

Balancing Your Loads#

  1. You should get results like:
    ---------------------------------------------------------
       Number of processors
    Input  Time           	1       2	   4	  8          
    ---------------------------------------------------------
    #1	Sequential    	0.0204  0.0203	0.0204  0.0203
    #1	Parallel      	0.0204	0.0140  0.0104  0.0103
    #1	Speedup       	1.0x	1.45x	1.99x   1.97x
    #2  Sequential	    0.0156	0.0156	0.0156	0.0156
    #2  Parallel      	0.0156	0.0081  0.0042 	0.0022
    #2  Speedup       	1.0x	1.92x 	3.71x	7.09x
    --------------------------------------------------------- 
    

    Expt 1 is on a 10 x 100 array whereas Expt 2 is on a 100 x 10 array. The second scheme shows much better scalability. The code is parallellised over Ny, so in the first case there are just 10 large parallel tasks, but in the second case there are 100 small parallel tasks. Note the division of tasks is round-robin (row%nprocs == rank), some processes have to compute more parallel tasks then others if the tasks are not evenly distributed across processes, large parallel tasks will increase such load imbalance and result in worse performance (other processes with less tasks allocated will have to wait).

  2. You should get results which clearly show the degree of imbalance is much less in Expt 2, something like:
    ---------------------------------------------------------
                 Task     Number of processors
    Input  Time     Rank        2            4            8
    ---------------------------------------------------------
    #1     Parallel   0        0.0064       0.0055       0.0055
                   1        0.0140       0.0105       0.0103
                   2        N/A          0.0009       0.0000
                   3        N/A          0.0035       0.0000
                   4        N/A          N/A          0.0000
                   5        N/A          N/A          0.0001
                   6        N/A          N/A          0.0009
                   7        N/A          N/A          0.0035
    #2     Parallel   0        0.0076       0.0036       0.0020
                   1        0.0081       0.0038       0.0021
                   2        N/A          0.0040       0.0021
                   3        N/A          0.0042       0.0023
                   4        N/A          N/A          0.0016
                   5        N/A          N/A          0.0017
                   6        N/A          N/A          0.0018
                   7        N/A          N/A          0.0019
    ---------------------------------------------------------
    
  3. You have to have an specialized MPI process, i.e., the coordinator (say rank=0) devoted to handing out the next task. This rank has a counter (corresponding to the row being allocated), initially zero, and (essentially) sits in a receive call in an infinite loop. When a receive call comes in, it responds with current value of counter, then increments the counter and goes back into the recv call (note it must be able to receive from any process). Workers sit in other loops requesting tasks from the coordinator. The coordinator will control termination. When the counter exceeds the maximum dimension, it starts responding to tasks with a -1, to indicate no more tasks. On receipt of a -1, a worker terminates from its loop. The coordinator terminates when its has handed out nprocs-1 (i.e., the number of workers) -1 values.

    See the attached solution code mandel.c.

    The following lines were changed in the batch file.

     #PBS -l ncpus=16
     ...
       for p in 2 4 8 16 ; do
    

    You should get results like:

    ---------------------------------------------------------
                                 Number of processors
    Input  Time           	2       4	   8	  16          
    ---------------------------------------------------------
    #1	Sequential    	0.0196  0.0196	0.0196  0.0196
    #1	Parallel        0.0196	0.0109  0.0100  0.0100
    #1	Speedup       	1.0x	1.79x	1.96x   1.96x
    #2  Sequential	    0.0151 	0.0151	0.0151	0.0151
    #2  Parallel      	0.0151	0.0055  0.0026 	0.0016
    #2  Speedup       	1.0x	2.74x 	5.80x	9.43x
    --------------------------------------------------------- 
    
  4. The alternative approach is for the workers to send back each row to the coordinator as it is computed. The coordinator then stores this in its global pixel array. In the static mapping case, this is effectively a gather operation, so better to implement it using a collective operation instead of point-to-point communication. This reduces memory on the workers (only allocate memory for one row/block of rows instead of whole grid) and also reduces communication overhead.

The reduce approach is much more memory consuming (the whole grid of pixels is replicated in all processes!) and also increases communication overhead (as there are many zeros to be reduced in the blocks of rows not mapped to each processor).

bars search times arrow-up