Notes for Lab 6

Exercise 1#

Just above the for (y...) loop to compute pix_par[], add:

#pragma omp parallel for schedule(dynamic) default(shared) private(x,c)

Note that the variables x (inner loop variable) and c (pixel info variable) are set to be private variables, and thus each thread has its own copy of the variable. If this was not the case, we would have data race conditions (and thus an incorrect program) when updating x and c from different threads.

Exercise 2#

Choosing 40x40 image size (mandel.in from Lab 2) from a batch run:

p	1	2	4	8	16	32	p=32 on `mandel2.in`
MPI	0.0133	0.0067	0.0033	0.0018	0.0014	0.0014	0.0008
OMP dynamic	0.0133	0.0076	0.0040	0.0026	0.0063	0.0075	0.0057
OMP static	0.0132	0.0083	0.0092	0.0076	0.0069	0.0074	0.0087
OMP static, 1	0.0132	0.0084	0.0058	0.0024	0.0061	0.0061	0.0056

The MPI program corresponds to OMP static but somehow gives better performance. On this data, the OMP dynamic mapping strategy achieves no clearly better performance, as its load balance on this set of data is very small.

Exercise 3#

ust above the j loop where we update tnew, add:

#pragma omp parallel for default(shared) private(i)

(the private clause is needed as in question 1) and further below:

  #pragma omp parallel for
  for (i = 0; i < Nx*Ny; i++) told[i] = tnew[i];

The remaining loop to parallelize calculates mxdiff which we can do using a reduction clause.

#pragma omp parallel for default(shared) private(i,tdiff) reduction(max: mxdiff)

Exercise 4#

We use heat.input (Ny=Nx=5000). These results are from a batch run:

p	1	2	4	8	16
MPI	1.1632	0.8156	0.6371	0.5382	0.5946
OMP	1.0022	0.5739	0.4417	0.5907	0.7008

OMP is initially a little faster. However, there are scalability limitations with both, despite this being a relatively large problem size.

Exercise 5#

The main thread exits without waiting for the child threads. Thus the child threads sometimes do not get a chance to complete sending their messages to standard output. Sometimes, messages get repeated due to incorrect cleanup of I/O buffers. The fix is to add the following after the loop creating the threads:

  for (i=0; i < n; i++)
    pthread_join(threads[i], NULL);

Exercise 6#

In hello(), we add a new instance variable n to the struct type parm, we modify the printf() to printf("Hello from thread %d of %d threads\n", p->id, p->n); in hello(void *arg), and in main(), just before the call to pthread_create(), we add p[i].n = n;.

Exercise 7#

In cpi(), add the declarations for the block distribution’s number of points and starting point:

  int mynumpoint = numpoint / nthread +
    (myid < nthread-1? 0: numpoint % nthread);
  int my1stpoint = myid * (numpoint / nthread);

and change the loop to:

  for (i = 1; i <= mynumpoint; i++) {
    x = h * ((double) (my1stpoint + i) - 0.5);
    sum += 4.0 / (1.0 + x * x);
  }

You may have found that simply doing a division of pi by nthread before printing the final pi value (line 81) would also lead to the correct result (without any changes as above). This is because the original cpi function called from every thread just computes the full pi independently of each other. However, this does NOT actually partition/parallelize the computation; you would not get any speed-up out of this. The point of this exercise is to grasp how to partition the workload (numpoint iterations) based on the thread ids.