Notes for Lab 1


mpiexample1: The error message we see when trying to run with -np 64 implies there are insufficient processes on the node to satisfy the request. Running the command lscpu shows

[ag9761@gadi-login-01 ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  1
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        4
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
Stepping:            7
CPU MHz:             2900.000
CPU max MHz:         3900.0000
CPU min MHz:         1200.0000
BogoMIPS:            5800.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-3,7,8,12-14,18-20
NUMA node1 CPU(s):   4-6,9-11,15-17,21-23
NUMA node2 CPU(s):   24-27,31-33,37-39,43,44
NUMA node3 CPU(s):   28-30,34-36,40-42,45-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Particularly see the sections:

CPU(s):              48
Core(s) per socket:  24
Socket(s):           2
Model name:          Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz

This implies the maximum number of processors we can request on one node is 48, and indeed when we lower the request to 48 processors we no longer get the same error message. However, on the login node we observe that increasing the number of processors increases the delay before the output is printed. This is because cores on the login node are a shared resource.

Exercise 1#

  • Your modified version of mpiexample1.c should look like this:
    const int hostname_len = 64;
    char name[hostname_len];
    gethostname(name, hostname_len);
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    printf("Hello world from process %d of %d and hostname %s\n", rank, nprocs, hostname);
    

    With 4 processes, this should print something like:

     Hello world from process 0 of 4 and hostname gadi-login-05.gadi.nci.org.au
     Hello world from process 1 of 4 and hostname gadi-login-05.gadi.nci.org.au
     Hello world from process 2 of 4 and hostname gadi-login-05.gadi.nci.org.au
     Hello world from process 3 of 4 and hostname gadi-login-05.gadi.nci.org.au
    
  • Your code should now be executing on a numbered node of Gadi, for example:
    Hello world from process 0 of 4 and hostname gadi-cpu-clx-1547.gadi.nci.org.au
    Hello world from process 1 of 4 and hostname gadi-cpu-clx-1547.gadi.nci.org.au
    Hello world from process 2 of 4 and hostname gadi-cpu-clx-1547.gadi.nci.org.au
    Hello world from process 3 of 4 and hostname gadi-cpu-clx-1547.gadi.nci.org.au
    
  • Your modified batch_job script should look something like:
    #PBS -q express
    #PBS -j oe
    #PBS -l walltime=00:00:10,mem=64GB
    #PBS -wd
    #PBS -l ncpus=96
    mpirun -np 96 ./mpiexample1
    cat $PBS_NODEFILE
    

Exercise 2#

  • Timer overhead is the amount of time it takes to call the timer function. The total time to run your program will be increased by this value multiplied by the number of times you call the timer. Timer resolution is the period of time below which the timer will sometimes report a value of zero. It represents the smallest period that can accurately be measured by the timer.

Example output:

First 200 timings:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Average: 0.016356 us

We notice that the smallest non-zero value is 1, so the resolution of the timer is 1us. We see a reading of 1 reported approximately once every 50 0s implying an overhead of 1/50 = 0.02us. The time averaged over 1000000 iterations gives us a more accurate estimate of the overhead to be 0.016us.

  • The following code tests the resolution and overhead of MPI_Wtime():
  double *time = (double *)calloc(num_measurements, sizeof(double));
  double avg = 0;
  for (int i = 0; i < num_measurements; i++) {
    // Measure the time twice in quick succession
     double start = MPI_Wtime();
     double end = MPI_Wtime();

    time[i] = end - start;
    avg += (double) time[i];
  }
  avg = avg / num_measurements;

  printf("First 200 timings:\n");
  for (int i = 0; i < 10; i++) {
    for (int j = 0; j < 20; j++) {
      printf("%.1e ", time[i * 20 + j]);
    }
    printf("\n");
  }
  printf("Average: %.2e us\n", avg);

When we run this, the program prints a sample output like:

First 200 timings:
1.8e-07 3.9e-08 2.4e-08 2.7e-08 2.5e-08 2.5e-08 2.6e-08 2.5e-08 2.5e-08 2.6e-08 2.4e-08 2.4e-08 2.6e-08 2.4e-08 2.5e-08 2.7e-08 2.5e-08 2.4e-08 2.6e-08 2.6e-08 
2.4e-08 2.7e-08 2.6e-08 2.4e-08 2.6e-08 2.6e-08 2.4e-08 2.7e-08 2.7e-08 2.4e-08 2.5e-08 2.6e-08 2.4e-08 2.6e-08 2.6e-08 2.5e-08 2.6e-08 2.6e-08 2.4e-08 2.7e-08 
2.7e-08 2.4e-08 2.5e-08 2.6e-08 2.4e-08 2.6e-08 2.7e-08 2.5e-08 2.6e-08 2.6e-08 2.3e-08 2.5e-08 2.6e-08 2.4e-08 2.5e-08 2.6e-08 2.4e-08 2.4e-08 2.6e-08 2.4e-08 
2.5e-08 2.7e-08 2.4e-08 2.4e-08 2.6e-08 2.4e-08 2.5e-08 2.6e-08 2.4e-08 2.4e-08 2.7e-08 2.4e-08 2.4e-08 2.6e-08 2.4e-08 2.5e-08 2.6e-08 2.4e-08 2.5e-08 2.6e-08 
2.4e-08 2.4e-08 2.6e-08 2.5e-08 2.4e-08 2.6e-08 2.4e-08 2.5e-08 2.6e-08 2.6e-08 2.5e-08 2.6e-08 2.6e-08 2.5e-08 2.7e-08 2.6e-08 2.5e-08 2.6e-08 2.6e-08 2.5e-08 
2.6e-08 2.7e-08 2.5e-08 2.7e-08 2.7e-08 2.4e-08 2.6e-08 2.6e-08 2.4e-08 2.6e-08 2.6e-08 2.5e-08 2.6e-08 2.6e-08 2.4e-08 2.6e-08 2.5e-08 2.4e-08 2.7e-08 2.7e-08 
2.5e-08 2.4e-08 2.6e-08 2.5e-08 2.4e-08 2.6e-08 2.4e-08 2.5e-08 2.6e-08 2.3e-08 2.5e-08 2.6e-08 2.5e-08 2.4e-08 2.6e-08 2.4e-08 2.4e-08 2.6e-08 2.4e-08 2.4e-08 
2.6e-08 2.5e-08 2.5e-08 2.6e-08 2.4e-08 2.3e-08 2.6e-08 2.4e-08 2.4e-08 2.6e-08 2.4e-08 2.4e-08 2.7e-08 2.7e-08 2.5e-08 2.6e-08 2.6e-08 2.5e-08 2.6e-08 2.6e-08 
2.5e-08 2.7e-08 2.7e-08 2.5e-08 2.6e-08 2.6e-08 2.5e-08 2.6e-08 2.6e-08 2.5e-08 2.7e-08 2.6e-08 2.4e-08 2.5e-08 2.6e-08 2.4e-08 2.6e-08 2.5e-08 2.4e-08 2.7e-08 
2.7e-08 2.4e-08 2.6e-08 2.6e-08 2.4e-08 2.6e-08 2.7e-08 2.4e-08 2.7e-08 2.6e-08 2.4e-08 2.5e-08 2.6e-08 2.5e-08 2.5e-08 2.7e-08 2.4e-08 2.4e-08 2.5e-08 2.4e-08 
Average: 2.47e-08 us

Here we can read off the average overhead to be 2.47e-8s. The variation is of the order of 0.1e-08 (1ns), which we deduce to be the approximate resolution. This is better than gettimeofday() so we will use this.

  • It is stated that MPI_Wtick gives the resolution. Trying this, we add
     printf("MPI_Wtick=%.1e\n", MPI_Wtick());

and we find:

    MPI_Wtick       1e-09

(wow!). This appears to be in agreement with above.

Exercise 3#

  • The code deadlocks for buffer_length=67336 because we have both processes wishing to send a large message at the same time. For a small message size, the message can fit in the internal buffer provided by the MPI implementation. Once it has been copied into the buffer the MPI_Send is free to run asynchronously. However if the message is too large to fit in the internal buffer (as is the case with buffer_length=67336) we see a transition from an asynchronous send to a synchronous send. It is very important that you understand this, and appreciate that we have a bit of code that works fine for a small problem, but as soon as it becomes too large it deadlocks. These sort of bugs are not easy to find. The solution is easy, just reorder the send and receives for one process:
...
else if (rank == 1) {
  MPI_Recv(recv_buffer, buffer_length, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
  MPI_Send(send_buffer, buffer_length, MPI_INTEGER, 0, 0, MPI_COMM_WORLD);
}

Exercise 4#

The potential problem is that the time for a single ping-pong of 64 bytes is of the order of the overhead of MPI_Wtime(). We observe that this ping pong is of the order of 330ns, an order of magnitude greater than the timer overhead, so timing effect will introduce ~10% error. We can rectify this problem by timing many repeats of the ping pong and then taking the average time.

  • Modify the code to do a ping-pong up to 4MB and fix the timing problem:
int main(int argc, char *argv[]) {
  const int max_len = 4 * 1024 * 1024;
  const int msg_reps = 100;

  MPI_Init(&argc, &argv);

  int rank, nprocs;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
  for (int len = 1; len <= max_len; len *= 4) {
    double timeMean;
    int *buffer = calloc(len, sizeof(int));
    double t1 = MPI_Wtime();
    for (int i = 0; i < msg_reps; i++) {
      if (rank == 0) {
        MPI_Send(buffer, len, MPI_INTEGER, 1, 0, MPI_COMM_WORLD);
        MPI_Recv(buffer, len, MPI_INTEGER, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
      } else if (rank == 1) {
        MPI_Recv(buffer, len, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        MPI_Send(buffer, len, MPI_INTEGER, 0, 0, MPI_COMM_WORLD);
      }
    }
    double t2 = MPI_Wtime() - t1;
    timeMean = t2 / msg_reps;
    if (rank == 0)
      printf("n=%10d  t=%.2es   bw=%.2e GB/s\n", len, timeMean,
             2.0 * len * sizeof(int) / timeMean / 1e9);
    MPI_Barrier(MPI_COMM_WORLD);
    free(buffer);
  }
  MPI_Finalize();
  return 0;
}

You may get something like this:

n=         1  t=5.24e-07s   bw=1.53e-02 GB/s
n=         4  t=4.21e-07s   bw=7.60e-02 GB/s
n=        16  t=4.97e-07s   bw=2.57e-01 GB/s
n=        64  t=1.98e-05s   bw=2.58e-02 GB/s
n=       256  t=2.00e-06s   bw=1.02e+00 GB/s
n=      1024  t=5.30e-06s   bw=1.55e+00 GB/s
n=      4096  t=5.99e-06s   bw=5.47e+00 GB/s
n=     16384  t=1.55e-05s   bw=8.46e+00 GB/s
n=     65536  t=5.53e-05s   bw=9.49e+00 GB/s
n=    262144  t=2.13e-04s   bw=9.83e+00 GB/s
n=   1048576  t=8.08e-04s   bw=1.04e+01 GB/s
n=   4194304  t=4.79e-03s   bw=7.01e+00 GB/s
  • The latency is the time to send an empty message - but for the purpose of this test we can measure the latency as the time for the fastest ‘small’ message (in this case, it is 4 integers). From the above results:
  Latency = 4.21e-7 / 2 = 0.2 us 
  Peak bandwidth ~= 10.4 GB/s  (for 1048576 words)
  • The modified code will be something like the following (with process 0 sending/receiving a sample message to/from all other processes respectively):
int main(int argc, char *argv[]) {
  const int max_len = 4 * 1024 * 1024;
  const int msg_reps = 100;

  MPI_Init(&argc, &argv);

  int rank, nprocs;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
  for (int len = 1; len <= max_len; len *= 1024) {
    double timeMean;
    int *buffer = calloc(len, sizeof(int));
    for (int r = 1; r < nprocs; r++) {
      double t1 = MPI_Wtime();
      for (int i = 0; i < msg_reps; i++) {
        if (rank == 0) {
          MPI_Send(buffer, len, MPI_INTEGER, r, 0, MPI_COMM_WORLD);
          MPI_Recv(buffer, len, MPI_INTEGER, r, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        } else if (rank == r) {
          MPI_Recv(buffer, len, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
          MPI_Send(buffer, len, MPI_INTEGER, 0, 0, MPI_COMM_WORLD);
        }
      }
      double t2 = MPI_Wtime() - t1;
      timeMean = t2 / msg_reps;
      if (rank == 0)
        printf("dest=%d n=%10d  t=%.2es   bw=%.2e GB/s\n", r, len, timeMean,
               2.0 * len * sizeof(int) / timeMean / 1e9);
      MPI_Barrier(MPI_COMM_WORLD);
    }
    free(buffer);
  }
  MPI_Finalize();
  return 0;
}
  • With destination nodes (dest) as
  • 1-48 for a single node (considering messages passed within the same socket / between two sockets)
  • 48-95 for messages passed between two nodes

You may see something like this:

| Message Size (ints) | time for ping-pong between two processes |
|---------------------|-------------------|----------------------|
|                     | within the node   | between two nodes    |
| 1                   | 4.43e-07          | 2.38e-06             |
| 1024                | 5.16e-06          | 5.60e-06             |
| 1048576             | 8.78e-04          | 6.35e-04             |

Except otherwise noted, there was slight variations for different ranks within these (the values taken above were among the lowest). In brackets is the destination node taken from the sample output.

Note: The 1048576 value between two sockets was consistent but anomalous, being 3x slower than inter-node.

bars search times arrow-up