Lab 9: Simulating New Processor Features

Outline#

In this week’s lab you will:

Add and implement a new cache replacement policy in Sniper’s C++ backend code
Add configurable parameters to your cache replacement policy
Add a metric to gain insights into program behavior

Introduction#

This lab will get you started working with the backend of Sniper. Recall from lab 7, Sniper is split into two parts; a frontend which allows you to specify processor configurations from predefined components and policies, and a backend in which you can implement new components and policies in C++. Over the last two labs you should have become well-versed in working with the frontend to gain insights into processor behavior on a set of benchmarks. In this lab you’ll learn about the backend by implementing a cache replacement policy.

Make sure you have a good understanding of the fundamentals of caches before starting this lab. At a minimum you should know what the key parameters of a cache are and understand the role of a cache replacement policy.

This lab is directly relevant to assignment 2 where you are asked to implement branch predictors. The interface and design of the cache replacement subsystem are similar enough that you should be able to apply what you do in this lab to the assignment. There is a video on the assignment 2 page which goes into more detail explaining the specifics of the branch predictor subsystem.

Although the backend is written in C++, it doesn’t use many of its advanced features. With a strong programming background and good understanding of object-oriented programming (e.g. Java in COMP1110) you should be able to navigate and understand the code.

Getting Started#

Navigate to common/core/memory_subsystem/cache. In this directory you will find the code responsible for simulating a cache. Take some time to look through the code in this directory, although you don’t need a precise understanding of how all of it works, see if you can identify how it simulates a cache. Have a look at cache_set.h and cache_set.cc.

We are going to use the Most Recently Used policy as a case study to see how cache replacement policies are implemented. Find the header file for this replacement policy. In the header you should identify a couple important functions:

The constructor
The destructor
getReplacementIndex
updateReplacementPolicy

The purposes of the constructor and destructor are self explanatory. getReplacementIndex is called when the cache needs to load a new cache block into the set, this function returns the index of the cache block to be replaced. updateReplacementPolicy is called whenever a cache block in the set is accessed so that replacement policy can updates its internal data structures. Go to the implementation file for the MRU policy and have a look at how they are implemented. Note how when deciding which block to replace it first checks to see whether there are any invalid blocks first and prioritizes filling the cache before evicting a an existing block.

Implementing Your Own Policy#

You are going to implement a Least Frequently Used policy. In a LFU policy each cache block has a counter associated with it. Each time that block is accessed the counter is incremented. When the cache needs to evict a block it chooses the one with the lowest count i.e. the one that has been accessed the least. When it loads a new block it resets the block’s counter to zero.

Start by copying and renaming the header and implementation files for the MRU policy. Delete code in these new files that is specific to the MRU policy and implement the LFU policy described above. You’ll need to define any new data structures in the header file and initialize them in the constructor. It’s also good practice to delete them in the destructor. When implementing getReplacementIndex remember to check for invalid cache blocks before replacing a block.

The size of the type you use in your implementation does not reflect what might actually be required in hardware. Consider that the cache associativity is represented with an unsigned 32-bit integer. This does not mean that a 32 bit register is required when storing a cache block index. If the cache is 4-way associative then a 2-bit register would be used in hardware. Be careful of this when you are calculating the hardware budget (as you are asked to do in assignment 2). You can’t just look at the size of the types in Sniper, you need to analyze your implementation from first principles.

You now need to link your cache replacement policy with the frontend so that you can use it. Find the function createCacheSet in cache_set.cc. This should tell you everything you need to do to add your replacement policy to the list of options. You’ll need to:

Add an element to the ReplacementPolicy enum,
Update the function parsePolicyType to interpret the name of policy provided by the user and,
Add a case to createCacheSet to return a new instance of your cache set.

Linking the predictors with the frontend is already done for you in the assignment 2 Sniper template. You only need to do this process if you want to add additional predictor policies beyond the three required in the assignment. The interface for adding predictors is slightly different, refer to the video on implementing branch predictors in Sniper on the assignment 2 page for more details.

Once you have done this, recompile Sniper to include your new policy. You should be able to test your new policy.

./run-sniper -c gainestown -c rob -g perf_model/l1_dcache/replacement_policy="lfu" --traces=trace_name

How does your policy perform? Run some simulations and observe the miss rate. Here are some possible questions for you to investigate:

How does your policy perform on different benchmarks? Which benchmarks benefit from this policy? Which are hindered? Can you formulate a hypothesis as to why?
How does your policy compare against other policies in sniper e.g. lru, mru etc.?
How does your policy perform with different degrees of associativity?

There are some access patterns which cause a cache with a LFU replacement policy to perform badly. Can you come up with an example? Think about this carefully before moving on to the next section.

Adding a Parameter#

Consider the access pattern where one block is accessed repeatedly at the start of a program but then not accessed again. Other blocks which have been accessed more recently may be evicted because they have not been accessed as much, even though they likely to be accessed in the future. One way to combat this is to add a decay to the frequency counts over time.

Improve your cache replacement policy by decrementing the frequency counts of each cache block every 10 times a block in a cache set is accessed.

How does this new policy perform? To gather more data you should experiment with different decay rates. To do this effectively without having to recompile Sniper for each configuration we can add a parameter. To do this just use Sim()->getCfg()->getIntArray("name/of/parameter", core_id);. Replace name/of/parameter with whatever you want the parameter to be called, this is what you will use with the -g flag to set the parameter in the frontend.

Make the decay rate a parameter and run simulations to try to determine the ideal decay rate. Is this a good solution to the problem? Can you come up with a better one?

If you want to have a look at inputting parameters of other types (e.g. Strings or Booleans) go to config.hpp.

Adding a Metric#

There are three types of cache misses; cold misses where the block has never been accessed before, capacity misses where the block was previously in the cache but was evicted because the cache was full, and conflict misses where the block was previously in cache but was evicted because the cache set it maps to was full. Capacity and conflict misses can be reduced by increasing the size and associativity of the cache but cold misses seem unavoidable. However, they can be reduced by using a prefetcher, which preemptively loads blocks into cache before they are accessed. To do this the prefetcher needs to predict which memory locations a program is likely to use. We can exploit common access patterns to make good predictions. For example sequential access is a common pattern, that is if addresses in block n are access then addresses in block n+1 are also likely to be accessed. This is called a stride-1 prefetcher.

You have been tasked with investigating whether this access pattern is common enough to justify the implementation of such a prefetcher. You need to count the number of times that block is loaded into cache given that the previous block was loaded. To add a metric to Sniper you need to do three things:

Add a field to the relevant class. Have a look at cache.h and cache.cc.
Register the metric using the registerStatsMetric function. The parameters are:
- The name of the component registering the metric, this is conveniently passed to the constructor of the Cache class.
- The core ID that the component is running on, also passed to the constructor.
- The name of the metric, this is what it will be called in the output database and when you run dumpstats.
- A pointer to the field you created
Finally, modify the field to track your metric (e.g. increment it when something happens)

You’ll need to find the function that is called whenever a cache block is loaded into the cache and increment your counter. Once you’ve done these steps and recompiled Sniper you should be able to run a simulation and see your metric in the output of tools/dumpstats.py.

Make sure you remember to include stats.h to be able to call registerStatsMetric if it is not already.

Using this method you can only register scalar metrics which are unsigned 64-bit integers. If you want to track metrics of other types you’ll need to do a bit more work. The code which does this is in stats.h and stats.cc.

Given the results of your simulation do you think that implementing this prefetcher is worth it? What about a stride-n prefetcher?

What Next#

In this lab you’ve gotten experience with modifying Sniper’s backend to add your own implementations. You should be able to apply what you have learned to assignment 2 to implement your branch predictors and gain good insights. The last three labs have been an introduction to Sniper and processor simulation in general. If you are interested in taking this further, refer to the Sniper website to see what else is available. You could investigate how Sniper implements other components of the processor, implement your own policies or add more advanced statistics tracking features.