Notes for Lab 3

Please note that there was a typo in the lab that produced incorrect numerical output but did not alter the parallelisability of the problem. The fix is changing line 18 of heat.c to

    grid[j*Nx+(Nx-1)] = T_edge;

where Ny has been replaced with Nx. This has been fixed in the solution file.

5 for point (7,5). For point (n,n), n. It takes one iteration to move information one position. So the number of iterations = distance to closest edge.

You should get something like:

Number of Processors	1	2	4	8
Time (s)	10.7	7.3	6.1	7.8
Speedup	1.0	1.4	1.7	1.4

We are a long way from ideal speed up and performance decreases with 8 processors. We will see why this is the case when the code is profiled in the second part of the lab.

The amount of memory for data per process is 2*Nx*Ny. Thus the total memory is 2*p*Nx*Ny where p is the number of processes (i.e. it scales poorly as processors are storing redundant information). To scale better, the arrays grid_old and grid_new could be allocated with Ny*Nx_loc elements, where Nx_loc ~= Nx / p. We would need to allocate extra storage for the left and right halo columns, and adjust j_start (and j_end) wherever used.
The code is parallelized by dividing the grid of points into vertical strips or chunks (by the x dimension). Each process sends the right boundary of its chunk to process (rank + 1) and waits for that process to return the boundary condition it needs. Under globally blocking communication, process (nproc-2) can’t complete its send before process (nproc-1) posts its receive; process (nproc-3) can’t complete its send before this and so forth. The result is a cascading effect of blocking communication and effectively the serialization of the exchange of boundaries. There are a number of different approaches to avoid this. The most common is to have odd and even processes communicating with the next and previous even and odd process. Simply use mod 2 (% 2) on the rank on the sends. For example, sending to rank+1 (full implementation in the solution file):
```
if (rank+1 < size && rank % 2 == 0) {
  MPI_Send(&grid_new[(j_end-1)*p.Nx], ..., rank+1, ...);
}
if (rank-1 >= 0) {
  MPI_Recv(&grid_new[(j_start-1)*p.Nx], ..., rank-1, ...);
}
if (rank+1 < size && rank % 2 != 0) {
  MPI_Send(&grid_new[(j_end-1)*p.Nx], ..., rank+1, ...);
}
```
This will not result in a significant improvement as the application is not particularly send/recv communication-bound.

You might get something like:

Number of Processors	1	2	4	8
Time (s)	10.8	7.7	6.7	8.6
Speedup	1.0	1.4	1.6	1.3

MPI_Request request[4];  
int num_requests = 0;
if (rank+1 < size) {
  MPI_Isend(&grid_new[(j_end-1)*p.Nx], ..., rank+1, ..., &request[num_requests++]);
  MPI_Irecv(&grid_new[j_end*p.Nx], ..., rank+1, ..., &request[num_requests++]);
}
if (rank-1 >= 0) {
  MPI_Isend(&grid_new[j_start*p.Nx], ..., rank-1, ..., &request[num_requests++]);
  MPI_Irecv(&grid_new[(j_start-1)*p.Nx], ..., rank-1, ..., &request[num_requests++]);
}
MPI_Waitall(num_requests, request, MPI_STATUSES_IGNORE);

You should get something like:

Number of Processors	1	2	4	8
Time (s)	10.7	7.9	5.9	5.5
Speedup	1.0	1.4	1.8	1.9

Running with 8 processors, update_stencil takes only ~11% of the total processing time. Updating the stencil should be the computational bottleneck of the problem so it is unexpected it takes such a small proportion of the overall execution time.

The most time consuming section of the program is copying the grid for the next iteration! This is very expensive because every process copies the entire grid (rather than just their fraction of it as described in Question 3). Therefore it is O(p) more expensive than update_stencil.

MPI_Allreduce() dominates which is strange, considering it is just for 1 word. If you put an MPI_Barrier() before the MPI_Allreduce(), you will see that almost all time is now in the former, with almost none in the latter. This suggests a load imbalance issue in the computation rather than an issue with MPI_Allreduce() itself.

The line between Overhead and Replicated/Sequential work a bit blurry at times. The 4 most time consuming operations were:

R (72%)   for (int i = 0; i < p.Nx*p.Ny; i++) grid_old[i] = grid_new[i];
P (11%)   update_stencil(grid_old, grid_new, p.Nx, j_start, j_end);
P (7.3%)  double local_max_difference = find_max_diff(grid_old, grid_new, p.Nx, j_start, j_end);
O (3.2%)  MPI_Allreduce(&local_max_difference, &max_difference, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);

Indeed we see the proportion of time spent in the parallel sections decrease while replicated proportion increases. For example the copy grid takes 54% of time on 4 processors but only 46% of execution time on 2. On the other hand, the percentage of time taken by the update_stencil and find_max_diff functions decrease from 26% and 22% to 17% and 12% respectively.