Notes for Lab 7

Exercise 1#

To compute a-b as well, assuming that we pass an array c of length 2, we add the operation in __global__ void add(): *(c+1) = a - b; and in main(), the following lines are changed or added:

int c[2];
HANDLE_ERROR( cudaMalloc( (void**)&dev_c, 2*sizeof(int) ) );
HANDLE_ERROR( cudaMemcpy( c, dev_c, 2*sizeof(int), cudaMemcpyDeviceToHost););
printf( "2 + 7 = %d\n", c[0] );
printf( "2 - 7 = %d\n", c[1] );

Exercise 2#

stugpu2 has 4 GeForce RTX 2080Ti GPUs with an 1665 MHz clock. It has 68 SMs; each with a max of 49152 bytes of shared memory per block, and 65536 bytes of registers per block. There is a maximum of 1024 threads per block and 32 threads per warp.

Exercise 3#

The kernel is:

__global__ void add( int *a, int *b, int *c, int *d ) {
    int tid = blockIdx.x;    // this thread handles the data at its thread id because each block only has one thread
    if (tid < N) {
        c[tid] = a[tid] + b[tid];
        d[tid] = a[tid] - b[tid];
    }

and the lines modified or added to main() are:

int a[N], b[N], c[N], d[N];
int *dev_a, *dev_b, *dev_c, *dev_d;

HANDLE_ERROR( cudaMalloc( (void**)&dev_d, N * sizeof(int) ) );

add<<<N,1>>>( dev_a, dev_b, dev_c, dev_d );

HANDLE_ERROR( cudaMemcpy( d, dev_d, N * sizeof(int),cudaMemcpyDeviceToHost ) );

printf( "%d - %d = %d\n", a[i], b[i], d[i] );

Exercise 4#

The modified add kernel is:

int tid = blockIdx.x*(N/32);
int tidLast = (tid + N/32 > N)? N: tid + N/32;
while (tid < tidLast) {
    c[tid] = a[tid] + b[tid];
    d[tid] = a[tid] - b[tid];
    tid++;
}

which can be found in add_loop_gpu2.cu.

Alternately, a more common method for writing this would be

int tid = blockIdx.x * blockDim.x + threadIdx.x;
int nthreads = gridDim.x;
for (int i = 0; i < N; i+= nthreads) {
    c[i] = a[i] + b[i];
    d[i] = a[i] - b[i];
}

as this enables contiguous memory accesses within a warp when there are multiple threads per block.

Exercise 5#

Note: here gridDim.x=128, blockDim.x=128, giving 128 * 128 threads in total. The loop in add() becomes:

while (tid < N) {
    if (tid % 2 == 0)
        c[tid] = a[tid] + b[tid];
    else
        c[tid] = a[tid] - b[tid];
    tid += blockDim.x * gridDim.x;
}

Exercise 6#

We add a kernel:

__global__ void scale( double v, int *c ) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    while (tid < N) {
        c[tid] = v*c[tid];
        tid += blockDim.x * gridDim.x;
    }

and add code to main() after the verification code:

scale<<<128,128>>>( 1.0/Vnorm, dev_c );
HANDLE_ERROR( cudaMemcpy( c, dev_c, N * sizeof(int),
                          cudaMemcpyDeviceToHost ) );

Note: to make this example more meaningful, all arrays should be changed to double.

Exercise 7#

The code for the timing calls:

#include <sys/time.h>
#define TIMEVAL(tvs) ((tvs).tv_sec + 1.0E-06 * (tvs).tv_usec)
...
struct timeval tv;
double timeTot, timeKernel;
...
gettimeofday(&tv, NULL);
timeTot = TIMEVAL(tv);
// copy the arrays ...
...
gettimeofday(&tv, NULL);
timeKernel = TIMEVAL(tv);
dot<<<blocksPerGrid,threadsPerBlock>>>( dev_a, dev_b, dev_partial_c );
cudaDeviceSynchronize(); //wait for the kernel to finish!
gettimeofday(&tv, NULL);
timeKernel = TIMEVAL(tv) - timeKernel;
... // end of loop to calculate c
gettimeofday(&tv, NULL);
timeTot = TIMEVAL(tv) - timeTot;
printf("time (s) for total %.2e, dot %.2e\n", timeTot, timeKernel);

Exercise 8#

For a value of N = 33 * 1024 * 1024, the following table indicates the kernel time with various threadsPerBlock=8,16,32,64,128,256,512,1024 (1st column), with blocksPerGrid ~= N/(M*threadsPerBlock) for M=1,2,4,16 (M gives the number of elements each thread processes)

	1	2	4	16
1024	8.12e-04	8.21e-04	8.09e-04	8.04e-04
512	1.13e-03	1.14e-03	1.13e-03	1.13e-03
256	1.87e-03	1.86e-03	1.83e-03	1.83e-03
128	3.24e-03	3.21e-03	3.22e-03	3.22e-03
64	6.16e-03	6.15e-03	6.17e-03	6.20e-03
32	1.21e-02	1.21e-02	1.21e-02	1.21e-02
16	2.38e-02	2.37e-02	2.37e-02	2.37e-02
8	4.64e-02	4.64e-02	4.64e-02	4.64e-02

We see no real difference with varying blocksPerGrid, but performance increases linearly with threadsPerBlock, up until 256, and still increases to 1024.

Exercise 9#

We add:

double dotHost( int n, float *a, float *b ) {
  double sum = 0.0;
  for (int i=0; i < n; i++)
    sum += a[i]*b[i];
  return sum;
}

and in the main program:

gettimeofday(&tv, NULL);
timeTot = TIMEVAL(tv);
c = dotHost(N, a, b);
gettimeofday(&tv, NULL);
timeTot = TIMEVAL(tv) - timeTot;
printf("CPU time (s) for dot %.2e\n", timeTot);

For threadsPerBlock=1024 and N = 33 * n * 1024, we have for the total time:

n	1024	256	64	8	1
t_GPU	9.19e-02	2.84e-02	7.77e-03	2.10e-03	1.26e-03
t_CPU	2.22e-01	6.16e-02	1.40e-02	2.66e-03	3.85e-04

i.e. the CPU is faster only at N = 33*1024;