# INTRODUCTION TO GPU ARCHITECTURE & PROGRAMMING

COMP4300/8300 PARALLEL SYSTEMS

PROF. JOHN TAYLOR

**APRIL 2024** 



Introduction to the key concepts of the CUDA Programming Model

## Logistics

> Attendance to the Lab sessions is highly encouraged. Most of the practical aspects of the programming models are covered in the Labs.



TEGGA PROVIDER ID: PRV12002 (AUSTRALIAN UNIVERSITY) CRICOS PROVIDER CODE: 001200



- > Programming for the GPU is not an extension of CPU programming
- > GPU hardware is changing rapidly, ever more *massive* parallelism
- > You need to understand the scale of a problem that a GPU can address



#### GPU and the CUDA Programming Model



Modified from original source: Maggie Zhang, Nvidia



- > The reality for the GPU is that there are two fundamental types of parallelism
- ➤ Also referred too as fine- and coarse-grained parallelism





> There are lots of different types of parallelism that are referred to in the literature





- To achieve high performance on the GPU you need to address both types of parallelism
- > If you address only one you will see only a fraction of the possible performance

TECKA DECIMINED IN 201/12002 (ALIKTEMIAN) INIVESSITY (DECIM DECIMINES CITIES 001200)



- $\rightarrow$  Task A  $\rightarrow$  Task B  $\rightarrow$  Task C (Task B can only start after Task A completes)
- > We must understand how the hardware implements parallelism
- Future lectures will delve in to this in more detail



TEQSA PROVIDER ID: PRV12002 (AUSTRALIAN UNIVERSITY) ORCOS PROVIDER CODE: 00120C



- > A Wave is the ideal number of blocks that fills a GPU
- ➤ Wave quantization is a key challenge



- ➤ What happens when we double the size or number of GPUs?
- > Task A and the last step of B fit in half the expanded GPU
- > For a fixed problem size that fits in one GPU we do not get a 2x gain



The Scourge of the Ninja: Wave Quantization

Tiny little part of A that didn't fit in the first wave

But what if A were two blocks bigger?

Extra wave of A can waste a lot of resources

There are no partial waves, hence "wave quantization"

- ➤ Here is an extreme example of the problem of wave quantization
- > The problem size is just a little larger than a wave
- Lots of resources will be wasted



\_\_\_\_\_



- ➤ Here is an extreme example of the problem of wave quantization
- The problem is just a little larger than a wave (2 blocks)



Most of the GPU will be idle when running A', Task B cannot start

Yes, More Waves Mitigates This, But Bigger GPUs = Fewer Waves 2012: Kepler GK110 architecture = 15 SMs 2022: Hopper H100 architecture = 132 SMs 9x SM count increase in 10 years So a kernel which ran in 10 waves on GK110 now runs in 1.1 waves on H100 We went from 10% overhead from wave Hopper H100 Full Chip

- More waves can reduce the impact of wave quantization –
- > original design assumption was for 100 waves
- > The dramatic increase in the size of GPUs (the number of SMs) has reduced the number of waves for a fixed workload and increased the overhead



- ➤ Wave quantization Statistics: on average you will lose 50% of the performance
- > Without planning, you may lose much more performance



Ninjas Use Single-Wave Kernels Don't map threads to data; map data to threads Image suggests 16x16 tiles = 256 blocks rdware suggests √132 SMs = 11.5 x11.5 tiles H100 with 132 SMs

- > Natural assumption is to map threads to data
- > Correct mapping is the reverse data to threads
- ➤ Divide your tasks across 132 SMs





> The consequence of poor mapping is that we have an imbalanced workload



TOOM PROVIDED IN TROOME (ALTERNATIVE MANUFACTURE) TO THE TOOME (ALTERNATIVE MANUFACTURE) T



- > Data parallelism alone will rarely be sufficient to achieve top performance
- > Task parallelism will help, but it is harder to implement





- ➤ The optimal programming approach is to produce single-wave kernels
- > This will not always be possible, watch out for load imbalance
- ➤ Bulk data parallelism will not typically achieve 100% efficiency





- > Task A does not fill the GPU and Task B cannot run until A finishes
- > Task X is independent of Task A, so Task X can now fill the GPU





- > CUDA streams concurrent execution
- > Stream = A sequence of operations that execute in issue-order on the GPU
- > CUDA operations from different streams may be interleaved



TEOGRA PROVIDER IO: PRIVI2002 (AUSTRALIAN UNIVERSITY) CRICOS PROVIDER CODE: 00120C





- > Complex task parallelism can be represented in a CUDA graph
- ➤ A CUDA graph enables multiple GPU operations to be launched through a single CPU operation
- > Build and launch CUDA graphs



> Throughput is faster with task parallelism



> Not all problems you may encounter can be divided into multiple independent tasks ...





> Pipeline parallelism allows you to create and take advantage of parallel tasks



- > Elementwise operations are a rare opportunity
- > Convolutions are an example where surrounding data is required

Example: 1D Stencil, radix-5



> Always take advantage of elementwise kernels if they are greater than one wave



TICÇA PROJECTIO PRIVIZEZ PALIZITALIANIANIAN



➤ Pipeline parallelism by splitting tasks can lead to chained dependencies and undermine any performance gains





- > Avoid the dependencies by reducing the size of dependent tasks (B&C)
- > Reducing the size of tasks increases the number of tasks



The Real Problem: All-to-All Algorithms For example: sorting, fourier transforms, and unfortunately many other useful things Scalar operations, mappings, indexing Convolutions, derivatives, filters, stencil operations Transitive functions, searching, sorting, reductions... and many others

- ➤ All to all algorithms require extensive communication and synchronization
- > Memory usage and bandwidth can limit performance





- > This approach also introduces redundant computation at the edges
- > The impact of redundant computation can be small as a few % over large arrays





- > The pipelining solution delivers no benefits in this case
- You will often be working with all-to-all algorithms





- > You may be able to break chunks of your problem into pipelines
- ➤ All-to-all will act as a synchronization point



, in to an win det as a synteme inzation point

Parallel tasks

Parallel tasks

Task 1 Task 2 Task 3 Task 4

Postbool Attended Femore Femore Femore Femore Femore Add & Norm

Food

Food

Add & Norm

Add & Norm

Food

Food

Food

Add & Norm

Add & Norm

Food

Food

Food

Food

Food

Add & Norm

Food

- ➤ We can divide the model into seperate parts
- > This is a form of task parallelism for complex workflows



- ➤ Model parallelism divides a model into separate tasks
- > The example is a multi-layer deep learning model



TICGA PROVICES DI PRIZZEZ PARTICIO DI PRIZZEZ PARTICIO DI PRIZZEZ PROVIES COLO SUZZO



- > A simple split may not work well if you ignore dependencies between tasks
- > In this example Task 3 will act as a bottleneck



\_\_\_\_

TEGGA PROVIDER DI: PRI/2000 (AUSTRALIAN UNIVERSITY) CRICOS PROVIDER CI

\_\_\_\_\_



Reducing synchronization can be more efficient that attempting to balance the task workload

<u></u>

SA PROVIDER ID: PRV12002 (AUSTRALIAN UNIVERSITY) CRICOS PROVIDER CODE: 00120C



- A key goal when implementing model parallelism is to minimize inter-task synchronization ie Reduce waiting time and keep the GPU busy
- > This applies to task parallelism in general



> Once you have identified a split that minimizes synchronization you can then further split based on that hierarchy



TEGSA PROVIDERIO: PRV12002 (AUSTRALIAN UNIVERSITY) CRICOS PROVIDER CODE: 001200



- > As with many compute architectures, GPU memory is a critical resource
- > The more tasks the less cache available, the more cache misses that undermine efficiency



-





- > A high cache hit rate produces the highest performing code
- > L2 cache has higher bandwidth and lower latency than HBM memory



- > Row-major finishes at the bottom
- > Task B will start again at the top left ....



Aside: Stop running all your kernels row-major from the top left Then B misses in cache when starting from the top HBM Memory (80GB) Finishes here

> ➤ When switching to Task B you will always generate a cache-miss with row-major kernels



- > Running with B in reverse order will produce a cache hit (~10x faster)
- ➤ Managing cache effectively can deliver significant benefits





- > Identifying whether your program is bandwidth limited is essential to producing high performing code
- For most problems that you encounter this will be the case
- > Can we run our problem in L2 cache?





To Keep Data in Cache We Run Each Task in Series, NOT in Parallel

- > Previous examples demonstrated how to split problems into smaller tasks
- > Split the tasks into L2 cache-sized chunks
- > Run each Task in series on the cache size chunk!

This is Known As "Tiling" Your Execution in Cache You'll really want to design your program for this up-front Input data and working set must be sized to fit entirely in L2 cache HBM Memory (80GB) Split data into cache-sized chunks Run tasks in series

- > Running tasks in series is known as tiling e.g tile-based graphics rendering
- Choosing the optimal tiling size is crucial for achieving good performance



So We Can Task-Parallelise Our Task-Based Cache Tiling

- > Can take advantage of both task and data parallelism, all running in cache
- Programming complexity increases
- > Design from the start, refactoring to achieve this most likely will be hard





## The grid stride loop pattern in CUDA

> The **grid stride loop pattern** is a technique used in CUDA programming to ensure that a kernel can efficiently process data arrays of any size

```
_global__ void saxpy(int n, float a, float *x, float *y) {
	for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
		 y[i] = a * x[i] + y[i];
	}
}
```

- In this example, each thread calculates its unique index in the array ('i = blockldx.x \* blockDim.x + threadldx.x'), and then processes the element at that index.
- ➤ The thread then increments its index by the total number of threads in the grid ('blockDim.x \* gridDim.x'), and processes the next element, repeating this process until all elements have been processed
- > This pattern allows the kernel to handle data arrays of any size, even when the number of threads launched is less than the number of data elements.
- > It also makes your CUDA kernels more flexible and scalable

---

### Summary



- > Programming a streaming multiprocessor is not an extension of CPU programming!
- > Is a GPU required based on the scale of the task and the ability to expose parallelism?
- > Data and task parallelism concepts are the GPU fundamentals that you should master
- > Seek to achieve wave quantization on the target GPU
- > Task parallelism
- > All-to-all algorithms break task parallelism, use higher level model parallelism
- > Create a CUDA graph of complex model parallelism tasks, reduce dependencies
- > Avoid bandwidth limitations, tile execution in cache



#### **Optimization Workflow in CUDA**

# Wrong View of Optimization!

- Try all the optimization methods in the book
- ...optimization is endless...

