Notes for Lab 7
Exercise 1#
To compute a-b
as well, assuming that we pass an array c
of length 2, we add the operation in __global__ void add()
:
*(c+1) = a - b;
and in main()
, the following lines are changed or added:
int c[2];
HANDLE_ERROR( cudaMalloc( (void**)&dev_c, 2*sizeof(int) ) );
HANDLE_ERROR( cudaMemcpy( c, dev_c, 2*sizeof(int), cudaMemcpyDeviceToHost););
printf( "2 + 7 = %d\n", c[0] );
printf( "2 - 7 = %d\n", c[1] );
Exercise 2#
stugpu2 has 4 GeForce RTX 2080Ti GPUs with an 1665 MHz clock. It has 68 SMs; each with a max of 49152 bytes of shared memory per block, and 65536 bytes of registers per block. There is a maximum of 1024 threads per block and 32 threads per warp.
Exercise 3#
The kernel is:
__global__ void add( int *a, int *b, int *c, int *d ) {
int tid = blockIdx.x; // this thread handles the data at its thread id because each block only has one thread
if (tid < N) {
c[tid] = a[tid] + b[tid];
d[tid] = a[tid] - b[tid];
}
and the lines modified or added to main()
are:
int a[N], b[N], c[N], d[N];
int *dev_a, *dev_b, *dev_c, *dev_d;
HANDLE_ERROR( cudaMalloc( (void**)&dev_d, N * sizeof(int) ) );
add<<<N,1>>>( dev_a, dev_b, dev_c, dev_d );
HANDLE_ERROR( cudaMemcpy( d, dev_d, N * sizeof(int),cudaMemcpyDeviceToHost ) );
printf( "%d - %d = %d\n", a[i], b[i], d[i] );
Exercise 4#
The modified add kernel is:
int tid = blockIdx.x*(N/32);
int tidLast = (tid + N/32 > N)? N: tid + N/32;
while (tid < tidLast) {
c[tid] = a[tid] + b[tid];
d[tid] = a[tid] - b[tid];
tid++;
}
which can be found in add_loop_gpu2.cu
.
Alternately, a more common method for writing this would be
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int nthreads = gridDim.x;
for (int i = 0; i < N; i+= nthreads) {
c[i] = a[i] + b[i];
d[i] = a[i] - b[i];
}
as this enables contiguous memory accesses within a warp when there are multiple threads per block.
Exercise 5#
Note: here gridDim.x
=128, blockDim.x
=128, giving 128 * 128 threads
in total. The loop in add()
becomes:
while (tid < N) {
if (tid % 2 == 0)
c[tid] = a[tid] + b[tid];
else
c[tid] = a[tid] - b[tid];
tid += blockDim.x * gridDim.x;
}
Exercise 6#
We add a kernel:
__global__ void scale( double v, int *c ) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < N) {
c[tid] = v*c[tid];
tid += blockDim.x * gridDim.x;
}
and add code to main()
after the verification code:
scale<<<128,128>>>( 1.0/Vnorm, dev_c );
HANDLE_ERROR( cudaMemcpy( c, dev_c, N * sizeof(int),
cudaMemcpyDeviceToHost ) );
Note: to make this example more meaningful, all arrays should be
changed to double
.
Exercise 7#
The code for the timing calls:
#include <sys/time.h>
#define TIMEVAL(tvs) ((tvs).tv_sec + 1.0E-06 * (tvs).tv_usec)
...
struct timeval tv;
double timeTot, timeKernel;
...
gettimeofday(&tv, NULL);
timeTot = TIMEVAL(tv);
// copy the arrays ...
...
gettimeofday(&tv, NULL);
timeKernel = TIMEVAL(tv);
dot<<<blocksPerGrid,threadsPerBlock>>>( dev_a, dev_b, dev_partial_c );
cudaDeviceSynchronize(); //wait for the kernel to finish!
gettimeofday(&tv, NULL);
timeKernel = TIMEVAL(tv) - timeKernel;
... // end of loop to calculate c
gettimeofday(&tv, NULL);
timeTot = TIMEVAL(tv) - timeTot;
printf("time (s) for total %.2e, dot %.2e\n", timeTot, timeKernel);
Exercise 8#
For a value of N = 33 * 1024 * 1024
, the following table indicates the kernel time with various threadsPerBlock=8,16,32,64,128,256,512,1024
(1st column), with blocksPerGrid ~= N/(M*threadsPerBlock)
for M=1,2,4,16
(M
gives the number of elements each thread processes)
1 | 2 | 4 | 16 | |
---|---|---|---|---|
1024 | 8.12e-04 | 8.21e-04 | 8.09e-04 | 8.04e-04 |
512 | 1.13e-03 | 1.14e-03 | 1.13e-03 | 1.13e-03 |
256 | 1.87e-03 | 1.86e-03 | 1.83e-03 | 1.83e-03 |
128 | 3.24e-03 | 3.21e-03 | 3.22e-03 | 3.22e-03 |
64 | 6.16e-03 | 6.15e-03 | 6.17e-03 | 6.20e-03 |
32 | 1.21e-02 | 1.21e-02 | 1.21e-02 | 1.21e-02 |
16 | 2.38e-02 | 2.37e-02 | 2.37e-02 | 2.37e-02 |
8 | 4.64e-02 | 4.64e-02 | 4.64e-02 | 4.64e-02 |
We see no real difference with varying blocksPerGrid
, but performance increases linearly with threadsPerBlock
, up until 256, and still increases to 1024.
Exercise 9#
We add:
double dotHost( int n, float *a, float *b ) {
double sum = 0.0;
for (int i=0; i < n; i++)
sum += a[i]*b[i];
return sum;
}
and in the main program:
gettimeofday(&tv, NULL);
timeTot = TIMEVAL(tv);
c = dotHost(N, a, b);
gettimeofday(&tv, NULL);
timeTot = TIMEVAL(tv) - timeTot;
printf("CPU time (s) for dot %.2e\n", timeTot);
For threadsPerBlock=1024
and N = 33 * n * 1024
, we have for the total time:
n | 1024 | 256 | 64 | 8 | 1 |
---|---|---|---|---|---|
t_GPU | 9.19e-02 | 2.84e-02 | 7.77e-03 | 2.10e-03 | 1.26e-03 |
t_CPU | 2.22e-01 | 6.16e-02 | 1.40e-02 | 2.66e-03 | 3.85e-04 |
i.e. the CPU is faster only at N = 33*1024
;