#### PERFORMANCE ANALYSIS WITH THE ROOFLINE MODEL (CPU & GPU)

#### COMP4300/8300 PARALLEL SYSTEMS

#### **PROF. JOHN TAYLOR**

MAY 2024





#### Semester 1 SELT - survey journey

respectful feedback (your teacher

can't identify you)

The Student Experience of Learning & Teaching survey allows students to give feedback on their courses and teachers. It is **voluntary** and **confidential**, and run by the Institutional Research (IR) team.





#### Find out more on the *Info for Students* webpage:

https://services.anu.edu.au/learning-teaching/education-data/student-experience-of-learning-teaching-selt/information-for



#### **SELT - Frequently asked questions**

#### What kind of feedback is helpful?

Think about your experience of the course and teaching, and what worked or didn't work for you.

When writing feedback, focus on respectful and constructive language – if you were a teacher, what type of feedback would help you improve the class?

#### Can teachers see who left specific feedback?

SELT is confidential, and teachers cannot see, or ask to see, the identity of a respondent. Unless you self-identify, for example by using names or describing specific events, teachers cannot identify you.

If your class has very few enrolments, it may be difficult to remain completely anonymous.



#### Find out more on the *Info for Students* webpage:

https://services.anu.edu.au/learning-teaching/education-data/student-experience-of-learning-teaching-selt/information-for



#### References

Accelerating HPC Applications with NVIDIA Nsight Compute Roofline Analysis <u>https://developer.nvidia.com/blog/accelerating-hpc-applications-with-nsight-compute-roofline-analysis/</u>

C. Yang, T. Kurth, and S. Williams, Hierarchical Roofline analysis for GPUs:
Accelerating performance optimization for the NERSC-9 Perlmutter system,
Concurrency and Computation: Practice and Experience, e5547,
2019. <u>https://doi.org/10.1002/cpe.5547</u>

Analyze CPU Roofline

4

https://www.intel.com/content/www/us/en/docs/advisor/userguide/2024-1/analyze-cpu-roofline.html

# **CPU and GPU Code Optimisation**

Assess, Parallelize, Optimize, **Deploy**(APOD)

- Having completed the GPU acceleration of one or more components of the application it is possible to compare the outcome with the original expectations developed in the assess step
- The partially parallelized implementation can be carried through to production as it allows the user to profit from their investment as early as possible (the speedup may be partial but is still valuable)



Assess, Parallelize, Optimize, Deploy(APOD)

- > APOD is a cyclical process
- initial speedups can be achieved, tested, and deployed with only minimal initial investment of time
- the cycle can begin again by identifying further optimization opportunities, seeing additional speedups
- Incremental deployment of the even faster versions of the application into production



Assess, Parallelize, Optimize, Deploy(APOD)

By understanding the end-user's requirements and constraints and by applying Amdahl's and Gustafson's laws, the developer can determine the upper bound of performance improvement from acceleration of the identified portions of the application.



Assess, Parallelize, Optimize, Deploy(APOD)

- Having identified the hotspots and having done the basic exercises to set goals and expectations, the developer needs to parallelize the code.
- Depending on the original code, this can be as simple as calling into an existing GPU-optimized library such as cuBLAS or adding a few preprocessor directives as hints to a parallelizing compiler.
- Some applications' designs will require some amount of refactoring to expose their inherent parallelism



Assess, Parallelize, Optimize, Deploy(APOD)

- After each round of application parallelization is complete, the developer can move to optimizing the implementation to improve performance
- Optimizations can be applied at various levels, from overlapping data transfers with computation all the way down to fine-tuning floating-point operation sequences
- The available profiling tools are invaluable for guiding this process



## **Performance Models**

We need a quantitative model that defines good performance with reference to the specific hardware available.

Good performance is defined by two fundamental requirements:

- Must not be significantly penalized by serialization (Amdahl's Law), load imbalance or communication (parallel overhead, synchronization) (scalability requirement)
- Must attain high utilization of the CPU's compute and/or bandwidth capabilities (efficiency requirement)



- In order to find such a model let's start looking at some simple algorithmic motifs which are common in scientific computing.
- Clearly, a first limit for their execution speed is the peak *floating point operations* (FLOPs) performance of the hardware, measured in FLOPs/s (Figure: Cascade Lake <u>https://www.microway.com</u>)
- Another limit is main memory bandwidth, which restricts the speed of data transfers from and to the CPU.
- The more FLOPs/s an application performs per unit of data transferred, the more likely the code is to reach peak performance.



The key metric is the **arithmetic intensity**:

$$AI = \frac{FLOP/s}{Bytes/s}$$
(1)

where the denominator is the number of Bytes read from and written to main memory per second.

- The AI can be calculated taking the total number of FLOPs divided by the Bytes transferred from and to main memory during program execution.
- An especially important value is the machine balance B<sub>m</sub>: the ratio between the peak FLOP performance and the memory bandwidth

B<sub>m</sub> = <u>Peak Performance (FLOP/s)</u> Memory bandwidth (Bytes/s)

(2)

# **The Roofline Model**

- This is a simple but very powerful model that ties FLOP performance, Al and memory performance in a 2D graph. [William, Waterman, and Patterson, Communications of the ACM Volume 52, Number 4 (2009), Pages 65-76]
- One good way to find peak memory performance is to use the STREAM benchmarks.
- For a given code snippet (kernel) one can find a point on the horizontal axis by measuring its AI.
- The performance of that kernel lies on the vertical line through that point.



## **The Roofline Model**

For each given AI, the "roofline" is the maximum FLOP performance achievable by the code on the specific hardware architecture used.

Max Attainable GFLOP/s = min - Peak Memory BW \*AI Peak FLOP Performance

- There are four performance regions: -
  - > unattainable performance
  - bandwidth bound performance
  - compute bound performance
  - poor performance
- The ridge point has abscissa equal to the machine balance.



### **The Roofline Model**

- Kernels that lie close to the roofline are making good use of the hardware resources.
- Kernels can have low performance (GFLOPs/s), but make good use (% STREAM) of a machine.
- Kernels can have a relatively high performance (GFLOPs/s), but still make poor use (% Peak GFLOPs/s) of a machine.



How can your kernel lie in the poor performance region?



How can you kernel lie in the poor performance region?

A likely reason is the lack of some specific optimizations that enable the program to use the underlying hardware efficiently.



- Each optimization constitutes a "performance ceiling" below the appropriate roofline.
- The horizontal lines are performance rooflines associated with FLOP performance.
- The tilted lines are associated with memory performance.



- The first FLOP performance ceiling is associated with Thread Level Parallelism (TLP).
- In order to achieve reasonable performance on modern processors one must use all cores in parallel.
- The second FLOP performance ceiling is associated with Instruction Level Parallelism (ILP) and SIMD parallelism.
- Maximize ILP is about hiding completely functional unit latency: loop unrolling, loop fusion, avoid branching in inner loops, etc.
- Need to use SIMD instructions in order to use all processors floating point arithmetic units.



- The final FLOP performance ceiling is associated with *floating point instruction mix*.
- Peak FLOP performance requires an equal number of nearly simultaneous additions and multiplications.
- This balance is necessary typically because the computer supports a fused multiply-add (FMA) instruction.
- Another reason may be also that FP units have an equal number of FP adders and FP multipliers.



- The first memory performance ceiling is associated with strided access.
- In order to reduce main memory traffic, inner loops must have a *unit stride* access pattern.
- The second is about *memory affinity*, that is NUMA effects.
- Allocate data and the threads tasked to operate on that data to the same memory/numa domain pair







1

1/2

1/8

1/4

ernel

1

1/2

TEQSA PROVIDER ID: PRV12002 (AUSTRALIAN UNIVERSITY) CRICOS PROVIDER CODE: 00120C

2

Operational Intensity (Flops/Byte)

2

ernel

8

16

4

# **GPU Code Optimisation**

GPU Performance optimization revolves around four basic strategies:

- Maximize parallel execution to achieve maximum utilization
- Optimize memory usage to achieve maximum memory throughput
- Optimize instruction usage to achieve maximum instruction throughput
- Minimize memory thrashing by not constantly allocating and freeing memory



- Typically reasons for low performance are architecture (and kernel) dependent
- Why could the kernel corresponding to red dots have low performance?
- How can we increase performance?



- Expose more parallelism to maximise SM usage and hide arithmetic latency
  - Change grid and block sizes (kernel configuration) to increase block-level parallelism
  - > Unroll loops
  - Tuning number of registers per thread
- Avoid thread divergence



What if we want better bandwidth utilization?



AI: FLOPs/Byte Ratio

- What if we want better bandwidth utilization?
  - Optimize for obtaining aligned and coalesced memory transactions
  - This may require a redesign of data structures (e.g. AoS to SoA) and better indexing
  - Expose more parallelism to maximise SM usage and hide memory latency



# **High Performance GPU Code**

- Assume your kernels yields the yellow dot performance.
  - Are you happy with this level of performance...?



## **Increasing AI for GPU Code**

- Assume your kernels yields the yellow dot performance.
  - Are you happy with this level of performance...?
  - How can we increase AI?



### **Increasing AI for GPU Code**

- Assume your kernels yields the yellow dot performance. Are you happy with it?
- How can we increase AI?
  - Increasing the data reuse and reducing the data movement increases AI i.e. reducing data movement



### **Increasing AI for GPU Code**

- Assume your kernels yields the yellow dot performance. Are you happy with it?
- How can we increase AI?
  - Increasing the data reuse and reducing the data movement increases AI
  - ➢ Better caching → Optimize code to be cache friendly (avoid false sharing on CPU)
  - Usage of shared memory on GPU to reduce global memory transactions
  - Also, you may be able to achieve better performance with a fundamental algorithm redesign!



#### **Roofline Analysis with Intel Advisor**

Performance Metrics Summary Cores: 1 on 1 socket(s) 🔮 🔻 🔻 DRAM 🔻 🖄 Guidance 🔻  $\equiv$ N Q sm 🔺 GFLOPS SP Vector FMA Peak: 177.3 GFLOPS п 100 DP Vector FMA Peak: 88.79 GFLOPS DP Vector Add Peak: 44 39 GELOPS 10 Scalar Add Peak: 6,87 GFLOPS 0 0 • • 0 00 0 0.1 0.01 0.001 0.0001 Bound by com 1.00e-5 ي ر بان Memory bound? and memory re Compute bound? FLOP/Byte (Arithmetic Intensity) 0.001 0.01 0.1 1 10 100 1000 10000 1.00e+5 1.00e+6 1.00e+7

You can use the Intel Advisor tool to produce a roofline plot for CPU

# **Roofline Analysis with Intel Advisor**

- Ideally, compile your program with icx, not gcc
- You need full isolation from other users when profiling
- module load intel-advisor
- OMP NUM THREADS=NT advixe-cl --collect roofline --project-dir ./results dir -search-dir src:=./ --search-dir bin:=./ --./testAdvect -P M N
- > advixe-guito launch the GUI







34

COS PROVIDER CODE: 00120C

You can use NVIDIA Nsight Compute to produce a roofline plot for GPU



**Kernel\_A** is an instruction throughput bound kernel: with each thread we do 10000 double precision adds, for an arithmetic intensity of 10000 / 8 (bytes per double precision word) = 1250.

```
__global__ void kernel_A(double* A, int N, int M)
{
    double d = 0.0;
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < N) {
        #pragma unroll(100)
            for (int j = 0; j < M; ++j) {
                d += A[idx];
                }
            A[idx] = d;
            }
    }
}</pre>
```

**Kernel\_B** is identical to kernel\_A, except that we artificially throttle occupancy on the GPU by allocating 96 kB of shared memory per thread block, which means that only one thread block can be resident on an SM at any one time, for an occupancy of 1/32 = 3.125%.

```
__global__ void kernel_A(double* A, int N, int M)
{
    double d = 0.0;
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < N) {
        #pragma unroll(100)
            for (int j = 0; j < M; ++j) {
                d += A[idx];
                }
            A[idx] = d;
            }
}</pre>
```

```
cudaFuncSetAttribute(kernel_B,
cudaFuncAttributeMaxDynamicSharedMemorySize, 96 * 1024);
kernel B<<<numBlocks, threadsPerBlock, 96 * 1024>>>(A,N,M);
```

**Kernel C** is primarily memory-bandwidth bound -we just do a single double precision add, combined with a load and a store (for an arithmetic intensity of 1/16= .0625). The memory access pattern is strided -- we load every element of B exactly once and store every element of A exactly once, but any given warp is accessing memory locations 32 bytes apart between each thread

```
__global__ void kernel_C(double* A, const double* B, int N)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    // Strided memory access: warp 0 accesses (0,stride, 2*stride,
    ...), warp 1 accesses
    // (1, stride + 1, 2*stride + 1, ...).
    const int stride = 16;
    int strided_idx = threadIdx.x * stride + blockIdx.x % stride +
    (blockIdx.x / stride) * stride * blockDim.x;
    if (strided_idx < N) {
        A[idx] = B[strided_idx] + B[strided_idx];
    }
}</pre>
```

- ➤ module load cuda
- > ncu --set full -o ProfileOutput
   <application>
- > ncu-ui to launch the GUI



#### DEMO

#### **Roofline Analysis with NVIDIA Nsight Compute**

