Lab 8: Processor Design Space Exploration

Outline#

In this week’s lab you will:

Investigate how the dispatch width affects the CPI and power consumption of a processor.
Investigate how cache parameters impact performance, and the tradeoffs involved in implementing caches.
Learn how to simulate multicore processors, and investigate their performance.

Introduction#

In last week’s lab, you experimented with the Sniper simulator. You set various CPU parameters, ran simulations, and learned how to interpret the results. We assume you are comfortable with how to do basic simulations with Sniper and the names of the different parameters. In this week’s lab you will use those skills to investigate different CPU configurations to gain insights into the relationships between parameters and quantitatively analyse the results to evaluate performance.

The lab is split into three tasks. The tasks consist of running several simulations to collect data to make observations and answer a series of open-ended questions. You should discuss your insights with classmates or tutors to get the most out of this lab. We estimate each task should take approximately an hour but feel free to spend longer if you find a particular task exciting and want to dive deeper. For each task, choose a subset of the traces to use. At a minimum, try to use five traces in each task for good sample size and run more if you have time.

Automating simulations#

In the last lab, we provided you with 11 SIFT traces, and in each task, you would need to run many simulations across each trace. Manually modifying the Sniper command line arguments for each simulation will very quickly become tedious, time-consuming, and error-prone. This lab is an excellent chance to start developing some scripts to automate repetitive tasks and organize results for easy analysis. This practice will come in handy for the second assignment as you can quickly reuse the scripts to test various configurations. You can then focus your effort on providing comprehensive and detailed insights.

There are many ways you could go about scripting, e.g., you could write a simple shell script or use your choice of (another) scripting language (Python is a good choice). Whatever you choose, here are some good items to automate:

For each configuration in a list, run each trace saving the output in a separate directory
Automatically generating CPI stacks/McPat reports
Parsing the outputs of these programs to extract the relevant data and assemble it into a single table/report

The top two items are highly recommended, and the third is up to you whether you think the time invested in writing the script will be less than the time collating the data manually.

Dispatch Width#

You will examine the impact of changing the dispatch width of a processor and it’s cache hierarchy. For this task, use the gainestown and rob configurations. Consider the three processor core configurations:

one with a 3-level cache hierarchy and dispatch width of 4 (default Gainestown)
dispatch width of 2 (3-level cache hierarchy)
dispatch width of 2 and a 2-level cache hierarchy

Using the configurations above answer the following questions:

For a given benchmark trace, which core type delivers the best IPC?
What is the impact of changing the dispatch width on total power and energy? What about EDP?
What is the impact of changing the number of levels in the cache hierarchy on total power and energy? What about EDP?
How does the CPI stack change as the dispatch width is increased? Why do you think this is?

Caches#

You will investigate fundamental tradeoffs in designing set-associative caches. The goal is to understand the impact of various cache parameters on hit rate and overall performance. For this task use the you will need to modify the following parameters:

[perf_model/l1_dcache]
perfect = false (make this true to simulate a perfect cache)
cache_size = 32 (in KiloBytes)
associativity = 8
replacement_policy = lru
data_access_time = 4 (in # cycles)
tags_access_time = 1
cache_block_size = 64 (in Bytes)

Consider the following questions:

For a fixed size L1 data cache, what is the relationship between the degree of associativity and hit rate (i.e., an n-way set-associative cache has the degree n)?
For a fixed size L1 data cache, what is the relationship between the block size and hit rate?
What is the impact on hit rate of different replacement policies? Sniper provides the following replacement policies: round_robin, lru, lru_qbs, nru, mru, nmru, plru, srrip, srrip_qbs, and random. Use Google to familiarize yourself with the different replacement policies. (You do not need to understand each one of these. We just want you to pick one or two and observe the cache hit rates.)
What is the relationship between L1 data cache access time (both tag and data) on overall execution time and CPI? Use CPI stacks to understand the impacts.
What is the improvement in overall CPI and execution time due to a perfect level-1 data cache?

You can limit your analysis to LRU, MRU, and random and only if you have the time, pick another replacement policy to analyze deeply.

As you answer the above questions, consider how changing one parameter may affect others. For example, if you increase cache size this will likely lead to an increase in the data access time. If you increase the associativity this would increase the tags access time. Consider these tradeoffs as you are determining your simulation parameters.

Does the outcome of the above analysis match your intuition? If that is not the case, do you have a hypothesis about the potential reasons for (any) surprising results? Can you set up an experiment to verify your hypothesis?

You can watch COMP2310 cache lecture if you want a quick overview of caches and what associativity means.

Heterogeneous Multicores#

In this task you will simulate multicore processors with Sniper. You will gain insight into how best to schedule two different applications on heterogeneous multicore processors. You will come up with static schedules that maximize system throughput and compare the accuracy of your scheduling algorithms against the optimal scheduling. First, we provide some background on multicore processors and quantifying their performance.

Multicores#

Multicore processors or multicores are ubiquitous. In summary, they consist of two or more independent cores. Each core resembles a separate pipeline similar to the MIPS pipeline we discussed in the class. The multiple cores typically share the last level of the cache (Level-3 in high-end processors). Each core has private Level-1 and Level-2 caches. The cores also share a bus-based, or a network-based interconnect. We need to connect the cores via the interconnect to support multithreaded applications. Multithreaded applications share program data, and the most recent copy of the data may reside in any core’s private caches. We will study an example interconnection network in the class. Note that contention for shared resources leads to interference when running multiple applications together on a multicore processor. Interference can potentially degrade performance.

Heterogeneous Multicores#

Unlike a homogeneous multicore that consists of all cores of the same type, a heterogeneous multicore consists of power-efficient and high-performance cores. An example of such an architecture is a two-core processor comprised of an in-order MIPS pipelined core and an out-of-order pipelined core.

System Throughput#

For a uniprocessor, IPC quantifies the instruction throughput. We need a metric to quantify the instruction throughput of a multicore processor. Simply adding the IPCs for the two applications does not reveal a lot of insight. Similarly, computing the arithmetic mean of the two IPCs hides the true impact of the multicore. It also does not help compare alternative scheduling policies or compare different workloads’ behavior on the same processor. Instead, we use and measure the system throughput for a two-core homogeneous multicore processor running two applications as follows: (1) find the isolated IPCs for each application running on a single core ( $I P C_{i s o l a t e d - 1}$ and $I P C_{i s o l a t e d - 2}$ ), (2) find the IPCs when the two applications are co-executing on a shared multicore processor ( $I P C_{s h a r e d - 1}$ and $I P C_{s h a r e d - 2}$ ), (3) find the ratios of shared IPCs to the isolated IPCs, e.g., $I P C_{s h a r e d - 1} / I P C_{i s o l a t e d - 1}$ (4) the sum of the two ratios is the system throughput. The extension of system throughput to a heterogeneous multicore processor is straightforward. We leave it to you to figure it out. For isolated IPCs, use the big core as a reference.

Note that the system throughput is a weighted-IPC metric. We first normalize the IPCs against a reference core. It is interpreted as the number of jobs per unit of time the multicore can perform. For example, if you observe a system throughput of two for a specific workload, it means the multicore can perform two jobs per unit of time compared to the uniprocessor (one).

To interpret system throughput, consider that for two applications, the isolated IPCs are equal to one. Suppose the shared IPC is also equal to one. The system throughput is two in this scenario. It means that the multicore processor is operating at full capacity (twice the throughput of the uniprocessor). The shared IPC cannot be greater than the isolated IPC. However, if due to interference for shared resources, the IPCs on the shared multicore equals 0.5 for both applications, then the system throughput is equal to one. This result implies the multicore processor is operating at half of the ideal (or expected) throughput.

To obtain an even more meaningful metric, we can divide the system throughput by n, i.e., the core count. The maximum (normalized) system throughput would then be one. A normalized throughput below 0.5 would mean it is better to run two applications one after the other on a uniprocessor than running them together on a multicore processor.

Simulating Heterogeneous Multicores#

You can simulate a multicore processor with Sniper using the following command. Let us simulate a heterogeneous multicore processor with big and LITTLE cores similar to the ARM big.LITTLE architecture.

./run-sniper -n 2 -c gainestown -c rob -c big,LITTLE --sim-end=last-restart --traces=trace1,trace2

The above command assumes there are two new configuration files (1) big.cfg and (2) LITTLE.cfg in the Sniper’s configurations directory.

Commercial big.LITTLE processors use performance-focused cores for big cores and energy-efficient cores for LITTLE ones. Some examples of commercial big.LITTLE processors include the ARM-based Apple M1 system on a chip, Samsung Exynos, and Qualcomm Snapdragon.

We suggest the following configurations for simulating big and LITTLE cores:

big uses a dispatch width of four.
LITTLE uses a dispatch width of two.

You can create big.cfg and LITTLE.cfg in the configurations directory and use the following to overwrite the dispatch width settings.

[perf_model/core/interval_timer]
dispatch_width = 4

[perf_model/core/interval_timer]
dispatch_width = 2

Although you can simulate any number of processor cores, we will limit ourselves to two. The “-n 2” in the above command tells Sniper to create a processor with two cores. The “-c big,LITTLE” tells Sniper that the first core (core 0) is the big core. Knowing this helps to interpret results in sim.out.

Simulating multicore processors lead to a challenging methodological problem. Precisely, one of the two applications could finish earlier than the other one. One application could finish early because different applications have different IPCs. Also, in a heterogeneous multicore processor, we have two processor cores with varying capabilities. We can inform Sniper to terminate the simulation when (1) the first application finishes execution or (2) the last application finishes execution. Note that in both cases, one of the cores is idle some of the time. To simulate a more realistic scenario when both cores in a processor are busy, Sniper provides a third option called last-restart. In this case, Sniper restarts the application that finishes execution early. The simulation terminates when both applications have finished executing once.

We can test different scenarios using the above three options for running simulations. It all depends on the target environment and the target metric. We ask you to use last-restart in this exercise because we are interested in observing the system throughput (in terms of instructions per second). We can also use first. The setting of last is of interest if our focus is on the total execution time for the workload.

Note that in computer architecture, we use the term workload to refer to a combination of applications or benchmarks running together or one after the other. A workload thus consists of many applications. A benchmark is a single instance of a representative application.

Estimating ILP#

A reduction in the base component from LITTLE to big cores is an indication that the application exposes high levels of instruction-level parallelism (ILP) to the processor. At the same time, a high dispatch width enables the processor to find more independent instructions every cycle.

Find two benchmarks with a large reduction in the base component from LITTLE to big cores. Is there a corresponding decrease in the IPC? If not, why?
Suppose you have the CPI stacks for two benchmarks on the LITTLE core. You now need to pick one benchmark to migrate to the big core. Which one would you choose? What missing information do you need to make a more informed choice?

Scheduling#

Assigning programs to cores in heterogeneous multicores is not a trivial task. More specifically, assigning programs to core types (big or LITTLE) significantly impacts the system’s performance (throughput). Sniper uses a default scheduling algorithm called pinned that statically assigns a core to each benchmark for the duration of the workload execution. In this task, you will investigate the best way to schedule two programs on a big.LITTLE processor.

Given the CPI stacks for two benchmarks on the LITTLE core, devise a simple algorithm to schedule the two-benchmark workload on a big.LITTLE processor.
Create four two-benchmark workloads by randomly choosing two benchmarks. Using your algorithm, schedule the workload on a big.LITTLE processor using the command mentioned above. Note down the system throughput.
Sniper includes several scheduling algorithms: static, pinned, roaming, big_small and sequential. Investigate the throughput for different scheduling algorithms.