The one-to-all operation on a hypercube of dimension d can be
specified as:
for each i = d-1,...,0, each process whose
rank r satisfies r % (2w) = 0 sends the data to
process r+w (and correspondingly the processes whose rank r
satisfies r % (2w) = w receive the data from process r-w),
where w = 2i.
oneAllSFhypercube.c is a program which simulates a one-to-all
broadcast communication on a hypercube (with root process 0). It can
be invoked by
mpirun -np p ./oneAllSFhypercube n
, where
the number of processes p must be a power of 2,
i.e. p=2d, and the message size is specified by n.
batch_oneAllhyper
.
It compares the hypercube algorithm with a baseline (a synchronized pipelined
broadcast, from the program oneAllFlits.c). The theory predicts
the latter should be about p/d times slower (about 4 times for
p=16 and 10 times for p=64). What do you observe?
oneAllFlits.c is a program which performs a (series of)
pipelined broadcast(s) with root process 0. By default, it is
synchronized in the sense that, when the message reaches the final
process p-1, process p-1 sends process 0 an
empty message so that the next broadcast cannot start before this
(this enforces the measurement of end-to-end time for the whole
operation). The program can be invoked by
mpirun -np p ./oneAllFlits n nf
,
where nf is the number of flits the message is broken down into
(default value 1).
batch_oneAllFlits
.
It compares the synchronized broadcasts of nf=1,8, and that
of an unsynchronized broadcast (which could indicate an upper bound on
the benefits from using flits). How does the benefit (if any) of
breaking the message into flits vary with p and n?
What is the reason that you can get a benefit in the first place?