# GPU SM ARCHITECTURE & PROGRAMMING MODEL

### COMP4300/8300 PARALLEL SYSTEMS

# PROF. JOHN TAYLOR

MAY 2024

TEOGAMPROVIDER ID: PRV IDDOZ (A USTRAU AN UNI PRDVI DER CODE: 001.002

2

4

Australian National University

# Logistics

Attendance to the Lab sessions is highly encouraged. Most of the practical aspects of the programming models are covered in the Labs.

### NVIDIA V100 GPU AT NCI -2D MATRIX SUM

| Device | Version                   | Grid           | Block   | Time      | Speedup |
|--------|---------------------------|----------------|---------|-----------|---------|
| CPU    | matrix-add-cpu            | Nx, Ny = 32768 | N/a     | 31,366 ms | 1       |
| CPU    | matrix-add-<br>openmp-avx | Nx, Ny = 32768 | N/A     | 3302 ms   | 9.5     |
| CPU    | matrix-add-<br>openmp-gcc | Nx, Ny = 32768 | N/A     | 550 ms    | 57      |
| GPU    | matrix-add-gpu            | 1024 x 1024    | 32 x 32 | 19.23 ms  | 1631    |
| GPU    | matrix-add-gpu            | 1024 x 2048    | 32 x 16 | 18.40 ms  | 1704    |
| GPU    | matrix-add-gpu            | 2048 x 1024    | 16 x 32 | 21.38 ms  | 1467    |
| GPU    | matrix-add-gpu            | 2048 x 2048    | 16 x 16 | 18.73 ms  | 1674    |
|        |                           |                |         |           |         |

# Heterogeneous Computing



# **Reference Material**

- > NVIDIA's CUDA C++ Best Practices Guide, <u>https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/</u>
- Nvidia H100 TensorCore GPU Architecture https://resources.nvidia.com/enus-tensor-core
- Jia, Z., Maggioni, M., Staiger, B., & Scarpazza, D. P. (2018). Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826.
- Professional CUDA c programming. Cheng, John, Max Grossman, and Ty McKercher. John Wiley & Sons, 2014.
- CUDA by Example: An Introduction to General-Purpose GPU Programming, Sanders, Jason, and Edward Kandrot, Addison-Wesley Professional, 2010.
- Tesla V100 Performance Optimization Guide, <u>https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/</u><u>tesla-</u> product-literature/v100-application-performance-guide.pdf

# **Heterogeneous Computing**

- Terminology:
  - *Host* The CPU and its memory (host memory)
  - Device The GPU and its memory (device memory)



# Heterogeneous Computing



### The End of The Road for General-Purpose Processors



- End of Dennard scaling caused the end of the generalpurpose processor era (both uniprocessor and multicore)
- Use of domain specific architectures (DSAs): programmable but designed for a class of problems with specific structures.
- GPUs are designed for data-parallel algorithms (especially linear algebra)
- More transistors are devoted to data processing rather than data caching and flow control
- Require domain specific programming model that makes it possible for the software to match the hardware (e.g. CUDA)
- Extracting performance requires the programmer to expose parallelism, to manage memory efficiently (e.g. caching), to tailor the algorithm to the hardware

# TOP 500 List November 2023

| Cores     | Rmax<br>(PFlop/s) | Rpeak<br>(PFlop/s) | Power<br>(kW)           |
|-----------|-------------------|--------------------|-------------------------|
| 8,699,904 | 1,194.00          | 1,679.82           | 22,703                  |
| 4,742,808 | 585.34            | 1,059.33           | 24,687                  |
| 1,123,200 | 561.20            | 846.84             |                         |
| 7,630,848 | 442.01            | 537.21             | 29,899                  |
| 2,752,704 | 379.70            | 531.51             | 7,107                   |
|           | 2,752,704         | 2,752,704 379.70   | 2,752,704 379.70 531.51 |



# **CPU versus GPU - FLOP rates**



GPU FLOP Rates have been growing exponentially:-

- 2010's GFLOP/s see the graph opposite
- 2020's TFLOPS/s to PFLOP/s e.g. H100 GPU

# Impact of Heterogenous Computing



# **Simple Processing Flow**



# **Simple Processing Flow** PCI Bus CPU Memory 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance DRAN

# **Simple Processing Flow**



# A Simplistic View of the GPU Architecture



13

A scalable array of complex "cores" called Streaming Multiprocessors (SM)

- > Each core has an array of functional units (e.g. ALUs) with SIMD execution
- Instructions operate in groups of 32 "SIMD" threads called warps
- > On the NVIDIA H100 GPU up to 64 warps can be executed concurrently (interleaved) on a single SM
- Up to 132 SMs × 128 CUDA cores/SM = 16896 Cuda cores per device
- ▶ H100 includes Tensor cores + Transformer engine for training large language models
- > This is why GPUs are called throughput-oriented architectures

# **Heterogenous Computing**





#### > GPU computing is not meant to replace CPU computing

- > CPU computing is good for controlintensive tasks, and GPU computing is good for data-parallel computationintensive tasks
- Modern high-end HPC systems are heterogenous: They combine CPUs and GPUs, mapping tasks to the most suitable ΡU
- > A typical heterogeneous compute node consists of two multicore CPU sockets and two or more many-core GPUs
- > GPUs operate in conjunction with a CPU-based host typically through a PCI-Express bus

# **Heterogenous Computing**



# Heterogenous Computing



- In a heterogeneous, the CPU is called the host and the GPU is called the device
- A heterogeneous application consists of two parts: Host code (runs on CPU) and device code (runs on GPU)
- Applications are initialized by the CPU: the CPU code is responsible for managing the environment, code, and data for the device before loading compute-intensive tasks onto the device.
- Host and device have distinct and separate virtual memory address spaces!
- ➢ Host ↔ device communication is slow and becomes easily a performance bottleneck.

### **Compute Unified Device Architecture (CUDA)**



- CUDA C is an is an extension of standard ANSI providing APIs and a programming model for NVIDIA GPUs
- A CUDA program consists of a mixture host and device code
- NVIDIA's CUDA nvcc compiler separates the device code from the host code during the compilation process
- The device code is written using CUDA C extended with keywords for labeling dataparallel functions, called kernels

# Hello World from a GPU



20

- The qualifier global tells the compiler the function is a device kernel and will be called from the CPU and executed on the GPU
- The kernel is launched with the triple angle brackets notation (helloFromGPU <<<1, 10>>>())
- The parameters within the triple angle brackets specify how many threads will execute the kernel (10 GPU threads).
- The function cudaDeviceReset() cleans up all resources associated with the current device
- The flag -arch=sm 70 tells the nvcc compiler to produce a binary for the Volta V100 architecture

### **CUDA Programming Structure**



- A typical processing flow of a CUDA program follows this pattern:
- Copy data from CPU memory to GPU memory
- Invoke kernels to operate on the data stored in GPU memory
- Copy data back from GPU memory to CPU memory
- When a kernel has been launched, control is returned immediately to the host.
- The host can operate independently of the device for most operations. CUDA is an asynchronous model.

### **CUDA Memory Management**



- CUDA provides functions to allocate device memory, release device memory, and transfer data between the host memory and device memory
- > GPU memory allocation  $\rightarrow$  synchronous

```
cuda Error_t cuda Malloc ( void ** devPtr, size_t size )

Transfer data between the host and device →
```

- Kinds of transfer: cudaMemcpyKind = {
- cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost,
- cudaMemcpyDeviceToDevice }

synchronous

 $\succ$  cudaMemset and cudaFree are also synchronous

### **CUDA Memory Management**

| Control ALU ALU<br>Cache | ALU ALU       |      |  |
|--------------------------|---------------|------|--|
|                          | ALU ALU       |      |  |
|                          |               |      |  |
| DRAM <                   | PCle Bus DRAM | DRAM |  |

| STANDARD C FUNCTIONS | CUDA C FUNCTIONS |
|----------------------|------------------|
| nalloc               | cudaMalloc       |
| nencpy               | cudaMemcpy       |
| nenset               | cudaMenset       |
| free                 | cudaFree         |



- CUDA provides functions to allocate device memory, release device memory, and transfer data between the host memory and device memory
- ▶ GPU memory allocation  $\rightarrow$  synchronous

cuda Error\_t cuda Malloc ( void \*\* devPtr, size\_t size )

WARNING: device pointers (e.g devPtr) may not be dereferenced in the host code.

# **CUDA Thread Organization**

| Host         | D                | evice            |                  |                  |                  |  |
|--------------|------------------|------------------|------------------|------------------|------------------|--|
|              |                  | Grid             |                  |                  |                  |  |
| Kernel       | <b> &gt;</b>     | Bloc<br>(0, 0    |                  | Block<br>(1, 0)  | Block<br>(2, 0)  |  |
|              |                  |                  | :k<br>I)         | Block<br>(1, 1)  | Block<br>(2, 1)  |  |
|              |                  |                  |                  |                  |                  |  |
| Block (1, 1) |                  |                  |                  |                  |                  |  |
| V            | Thread<br>(0, 0) |                  | Thread<br>(2, 0) | Thread<br>(3, 0) | Thread<br>(4, 0) |  |
|              | Thread<br>(0, 1) | Thread<br>(1, 1) | Thread<br>(2, 1) | Thread<br>(3, 1) | Thread<br>(4, 1) |  |
|              | Thread<br>(0, 2) | Thread<br>(1, 2) | Thread<br>(2, 2) | Thread<br>(3, 2) | Thread<br>(4, 2) |  |

24

- Two-level thread hierarchy decomposed into blocks of threads and grids of blocks
- All threads spawned by a single kernel form a thread grid
- Threads in a grid are grouped in thread blocks
- Threads in the same block can cooperate using block-local sychronization and shared memory
- Threads from different blocks cannot synchronize!
- Each block has a unique ID, bblockIdx, within the grid
- Each thread has a unique ID, threadIdx, within its block (local)

23

TEOSA PROVIDERIO: PRV12002 (AUSTRALIAN UNIVERSITY) CRCOS PROVIDER CODE

# **Defining Grids and Blocks**

nElem = define grid and block structure dim3 block (3); dim3 grid ((nElem+block.x=1) / block.x); check grid and block dimension from host side printf ("grid.x %d grid.y %d grid.z %d\n", grid.x, grid.y, grid.z); rintf ("block.x %d block.y %d block.z %d\n", block.x, block.y, block.z);
/ check grid and block dimension from device side checkIndex <<<grid , block >>> (); global void checkIndex (void) { printf (" thread Idx :(%d, %d, %d) block Idx :(%d, %d %d) block Dim :(%d, %d, %d) grid Dim :(%d, %d, &d)\n". thread Idx.x, thread Idx.y, thread Idx.z, block Idx . x, block Idx . v, block Idx . z, block Dim.x, blockDim.y, blockDim.z, gridDim.x, grid Dim.v. grid Dim.z);

grid.x 2 grid.y 1 grid.z 1 block.x 3 block.y 1 block.z 1 threadIdx:(0, 0, 0) blockIdx:(1, 0, 0) blockDim:(3, 1, 1) gridDim:(2, 1, 1) threadIdx:(1, 0, 0) blockIdx:(1, 0, 0) blockDim:(3, 1, 1) gridDim:(2, 1, 1 threadIdx: (2, 0, 0) blockIdx: (1, 0, 0) blockDim: (3, 1, 1) gridDim: (2, 1, 1) threadIdx:(0, 0, 0) blockIdx:(0, 0, 0) blockDim:(3, 1, 1) gridDim:(2, 1, 1 threadIdx:(1, 0, 0) blockIdx:(0, 0, 0) blockDim:(3, 1, 1) gridDim:(2, 1, threadIdx:(2, 0, 0) blockIdx:(0, 0, 0) blockDim:(3, 1, 1) gridDim:(2, 1, 1

- CUDA organizes grids and blocks in three dimensions
- > uint3 blockIdx = {blockIdx.x, blockIdx.y, blockIdx.z}
- > uint3 threadIdx = {threadIdx.x, threadIdx.y, threadIdx.z}
- When defined on the host grids and blocks use the dim3type (and not uint3) with 3 unsigned integer fields
- Note that the grid size is rounded up to the multiple of the block size
- For a given kernel, the grid and block dimensions are decided based on performance characteristics and limitations of GPU resources

## **CUDA Kernel Semantics**

- > The definition of a CUDA kernel requires special function gualifiers
  - ➢ global → Executed on device, callable from host and device, must have void return type
  - $\blacktriangleright$  device  $\rightarrow$  Executed on device, callable from device only
  - $\blacktriangleright$  host  $\rightarrow$  Executed on host, callable from host only
- GPU kernels use implicit parallelism!
- For example, from the host code

Matrix dimensions nx = ny = 16.384

Running on an NVIDIA Kepler K80

2D grid and 2D block between lines 9-12

> sumMatrixOnGPU2D <<<(512,512),</p>

oid sum Arrays On Host (float \*A, float \*B, float \*C, const int N) { for (int i = 0; i < N; i++) {</pre> C[i] = A[i] + B[i];

You can obtain a GPU parallel kernel by peeling off the forloop and assigning work to different threads

\_global void sum Arrays On GPU (float \* A, float \* B, float \* C) { int i = thread Idx . x; C[i] = A[i] + B[i];

### **Organizing Threads: Matrix Addition**



- $\blacktriangleright$  We want to perform the matrix sum C = A + B in parallel on the GPU.
- > The matrices have dimensions nx and ny
- Each thread performs the addition

C(ix, iy) = A(ix, iy) + B(ix, iy)

for a distinct element of A, B and C with row and column indices (ix, iy)

- We can map a single thread to each matrix element in the A, B or C arrays at position idx using a 2D grid of thread blocks where
  - > ix = threadIdx.x + blockIdx.x \* blockDim.x
  - > ix = threadIdx.y + blockIdx.y \*
  - > blockDim.v  $\geq$  idx = iv \* nx + ix

# Matrix Addition with 2D Grid and 2D Blocks

// malloc device global memory float \*d\_MatA, \*d\_MatB, \*d\_MatC; cuda Malloc ((void \*\*) & d\_MatA, nBytes); cuda Malloc ((void \*\*) & d MatB, nBytes); cuda Malloc ((void \*\*) & d MatC, n Bytes) // transfer data from host to device cuda Memcpy (d\_MatA , h\_A , nBytes , cuda Memcpy Host To Device ); > Kernel execution configuration set to use a cuda Memcpy (d MatB, h B, nBytes, cuda Memcpy Host To Device ); // invoke kernel at host side int dimx = 32; int dimy = 32; dim3 block (dimx, dimy); dim3 grid ((nx+block.x-1) / block.x, (ny+block.y-1) / block . y); iStart = cpuSecond (); (32,32)>>> elapsed 0.060323 sec sum Matrix On GPU 2 D <<< grid , block >>>( d MatA , sumMatrixOnGPU2D <<<(512,1024),</p> d\_MatB, d\_MatC, nx, ny); (32,16)>>> elapsed 0.038041 sec cuda Device Synchronize (); iElaps = cpuSecond () - iStart. sumMatrixOnGPU2D <<< (1024,1024),</p> \_global\_\_void sum Matrix On GPU 2D (float \*MatA, float \*MatB, float \*MatC, int nx, int ny) { unsigned int ix = threadIdx.x + blockIdx.x \* (16,16) >>> elapsed 0.045535 sec blockDim.x; unsigned int iy = threadIdx.y + blockIdx.y \* block Dim .y; unsigned int idx = iy\*nx + ix; if (ix < nx && iy < ny) MatC[idx] = MatA[idx] + MatB[idx];

27

# Matrix Addition with 1D Grid and 1D Blocks



29

- > Matrix dimensions nx = ny = 16,384
- Now we use a 1D grid with 1D blocks
- Each thread in the new kernel handles ny elements

30

- Running on an NVIDIA Kepler K80
- > sumMatrixOnGPU1D <<<(512,1), (32,1)>>>
  elapsed 0.061352 sec
- > sumMatrixOnGPU1D <<<(128,1),(128,1)>>>
  elapsed 0.044701 sec

# Matrix Addition with 2D Grid and 1D Blocks



global void sum Matrix On GPUMix (float \*MatA, float \*MatB, float \* MatC, int nx, int ny) { unsigned int ix = thread Idx .x + block Idx .x \*block Dim .x; unsigned int iy = block Idx .y; unsigned int idx = iy\*nx + ix; if (ix < nx && iy < ny) MatC[idx] = MatA[idx] + MatB[idx];

#### > Now we use a 2D grid with 1D blocks

- Each thread takes care of only one data element and the second dimension of grid equals ny
- Running on an NVIDIA Kepler K80 sumMatrixOnGPUMix <<<(512,16384), (32,1)>>> elapsed 0.073727 s
  - > sumMatrixOnGPUMix <<<(64,16384), (256,1)>>> elapsed 0.030765 s (best performance so far)
- Changing execution configurations affects performance
- A naive kernel implementation does not generally yield the best performance
- For a given kernel, trying different grid and block dimensions may yield better performance