Notes for Lab 3
Please note that there was a typo in the lab that produced incorrect numerical output but did not alter the parallelisability of the problem. The fix is changing line 18 of heat.c to
grid[j*Nx+(Nx-1)] = T_edge;
where Ny has been replaced with Nx. This has been fixed in the solution file.
-
5for point(7,5). For point(n,n),n. It takes one iteration to move information one position. So the number of iterations = distance to closest edge. -
You should get something like:
Number of Processors 1 2 4 8 Time (s) 10.7 7.3 6.1 7.8 Speedup 1.0 1.4 1.7 1.4
We are a long way from ideal speed up and performance decreases with 8 processors. We will see why this is the case when the code is profiled in the second part of the lab.
-
The amount of memory for data per process is
2*Nx*Ny. Thus the total memory is2*p*Nx*Nywherepis the number of processes (i.e. it scales poorly as processors are storing redundant information). To scale better, the arraysgrid_oldandgrid_newcould be allocated withNy*Nx_locelements, whereNx_loc ~= Nx / p. We would need to allocate extra storage for the left and righthalocolumns, and adjustj_start(andj_end) wherever used. -
The code is parallelized by dividing the grid of points into vertical strips or chunks (by the
xdimension). Each process sends the right boundary of its chunk to process (rank + 1) and waits for that process to return the boundary condition it needs. Under globally blocking communication, process (nproc-2) can’t complete its send before process (nproc-1) posts its receive; process (nproc-3) can’t complete its send before this and so forth. The result is a cascading effect of blocking communication and effectively the serialization of the exchange of boundaries. There are a number of different approaches to avoid this. The most common is to have odd and even processes communicating with the next and previous even and odd process. Simply use mod 2 (% 2) on the rank on the sends. For example, sending torank+1(full implementation in the solution file):if (rank+1 < size && rank % 2 == 0) { MPI_Send(&grid_new[(j_end-1)*p.Nx], ..., rank+1, ...); } if (rank-1 >= 0) { MPI_Recv(&grid_new[(j_start-1)*p.Nx], ..., rank-1, ...); } if (rank+1 < size && rank % 2 != 0) { MPI_Send(&grid_new[(j_end-1)*p.Nx], ..., rank+1, ...); }This will not result in a significant improvement as the application is not particularly send/recv communication-bound.
You might get something like:
| Number of Processors | 1 | 2 | 4 | 8 |
|---|---|---|---|---|
| Time (s) | 10.8 | 7.7 | 6.7 | 8.6 |
| Speedup | 1.0 | 1.4 | 1.6 | 1.3 |
-
MPI_Request request[4]; int num_requests = 0; if (rank+1 < size) { MPI_Isend(&grid_new[(j_end-1)*p.Nx], ..., rank+1, ..., &request[num_requests++]); MPI_Irecv(&grid_new[j_end*p.Nx], ..., rank+1, ..., &request[num_requests++]); } if (rank-1 >= 0) { MPI_Isend(&grid_new[j_start*p.Nx], ..., rank-1, ..., &request[num_requests++]); MPI_Irecv(&grid_new[(j_start-1)*p.Nx], ..., rank-1, ..., &request[num_requests++]); } MPI_Waitall(num_requests, request, MPI_STATUSES_IGNORE);
You should get something like:
| Number of Processors | 1 | 2 | 4 | 8 |
|---|---|---|---|---|
| Time (s) | 10.7 | 7.9 | 5.9 | 5.5 |
| Speedup | 1.0 | 1.4 | 1.8 | 1.9 |
- Running with 8 processors,
update_stenciltakes only ~11% of the total processing time. Updating the stencil should be the computational bottleneck of the problem so it is unexpected it takes such a small proportion of the overall execution time.
The most time consuming section of the program is copying the grid for the next iteration! This is very expensive because every process copies the entire grid (rather than just their fraction of it as described in Question 3). Therefore it is O(p) more expensive than update_stencil.
MPI_Allreduce() dominates which is strange, considering it is just for 1 word. If you put an MPI_Barrier() before the MPI_Allreduce(), you will see that almost all time is now in the former, with almost none in the latter. This suggests a load imbalance issue in the computation rather than an issue with MPI_Allreduce() itself.
- The line between Overhead and Replicated/Sequential work a bit blurry at times. The 4 most time consuming operations were:
R (72%) for (int i = 0; i < p.Nx*p.Ny; i++) grid_old[i] = grid_new[i]; P (11%) update_stencil(grid_old, grid_new, p.Nx, j_start, j_end); P (7.3%) double local_max_difference = find_max_diff(grid_old, grid_new, p.Nx, j_start, j_end); O (3.2%) MPI_Allreduce(&local_max_difference, &max_difference, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD); - Indeed we see the proportion of time spent in the parallel sections decrease while replicated
proportion increases. For example the copy grid takes 54% of time on 4 processors but only 46% of execution time on 2. On the other hand, the percentage of time taken by the
update_stencilandfind_max_difffunctions decrease from 26% and 22% to 17% and 12% respectively.