Cache

Level of memory closest to the CPU
Optimizes for the principle of locality

Tags and Valid Bits

How we know what memory is currently stored in the cache
Store the block address as well as the data
- Tag
- Index
  - Implicitly stored in the row of a direct cache
- Offset
  - Bits at the end of the address that are implicitly remembered
  - Two Parts
    - Word Offset
    - Block Offset
  - (2^number of offset bits) bytes will be returned by a single memory access
  - Since memory is in multiples of 4, you don’t need to remember the last 2 bits in 32 bit systems because it will always be 00
  - Since memory is in multiples of 8, you don’t need to remember the last 3 bits in 64 bit systems because it will always be 000

Types

Direct Cache
- Structure
  - The last $n$ bits of the address (excluding offset) are implicit based on the row of the cache
    - the number of rows in the cache are $2^{n}$
  - The rest of the address is stored in the tag
  - Valid bit which is 1 when data is stored in the location
- When accessing
  - Take the last $n$ bits of the access address, and go to that row of the cache
  - Check that the valid bit is 1
  - Check that the rest of the bits of the access address matches the tag at that row
  - If they match, then the cache is storing data for that address
- Problem: certain code can create lots of conflicts if different addresses with the same last three bits are accessed in a loop
  - Because of the Pigeonhole Principle
Fully-Associative Cache
- Structure
  - Store the data anywhere
  - Store the entire address (excluding the offset) in the tag
- Tradeoff
  - Reduce the number of conflicts
  - Tags are larger
  - Linear lookup time which can be parallelized into constant but it would be very expensive to do so
Set-Associative Cache
- Combination of the other two
- Structure
  - Create multiple direct caches
    - the number of caches → n-way name
    - 0-way = direct cache
    - n-way = fully associate cache
  - When adding, pick a way using a method and then add to that one like a normal direct cache
    - Least Recently Used (LRU): Choose the way that was written to the least recently
    - Pick randomly
- Accessing
  - Check all ways to find the matching row if it exists
Typical Usage
- L1: Direct Access
- L2: Set associative
- Main memory: fully associative (not a cache but accessed in the same way)
Finding the index number fast
- divide and then use modulo
- #todo

Blocks

We can store multiple words data in one row of the cache
- Need to adjust offset so any address in the same block points to the same row in the cache
Optimizes for spatial locality
When putting data into the cache, also put nearby data into the cache (the block)
Size considerations
- But in a fixed-sized cache, larger blocks → space for fewer blocks
- That means you override more with each load, which can negatively affect temporal locality
- There is a balance that optimizes both depending on the size of the cache, found experimentally

Misses

On hit, the CPU proceeds normally
On cache miss
1. Stall the CPU pipeline
2. Fetch block from next level of hierarchy
3. Instruction cache miss → Restart instruction fetch
4. Data cache miss → Complete data access

Writing

Write-Through
- So that when you write to L1, write to all the other caches
- This is very resource intensive, so it is not often used
Write-Back
- On data write hit, just update the block in the cache
- Keep track of which blocks have been written to (dirty block)
- Only write back to lower caches when those lower caches are needed
Write Allocation
- For what should happen on write miss
- Write-Through
  - Allocate on miss: fetch the block
  - Write around: don’t fetch the block
- Write-Back
  - Fetch the block

Instruction vs Data Caches

Instructions and data have different patterns of temporal and spatial locality
- Also instructions are generally read-only
Can have separate instruction and data caches
- Advantages
  - Doubles bandwidth between CPU and memory hierarchy
  - Each cache can be optimized for its pattern of locality
- Disadvantages
  - Slightly more complex design
  - Cannot dynamically adjust ratio taken by instructions and data
100% of instructions will access the instruction cache

Performance

Components of CPU Time
- Program execution cycles (includes cache hit time)
- Memory stall cycles (mainly from cache misses)
With simplifying assumptions:

Memory stall cycles = \frac{memory accesses}{program} \cdot miss rate \cdot miss penalty = \frac{instructions}{program} \cdot \frac{misses}{instruction} \cdot miss penalty

Average memory access time (AMAT)
- Hit time + (Miss rate * Miss penalty)
- Access time of L2 from the perspective of L1
  - L1 Miss penalty + (L2 Miss rate * L2 Miss penalty 2)
Overall Cycles Per Instruction (CPI)
- Remember that 100% of instructions will access the instruction cache

Types of Misses

Compulsory
- The first time you access an address, it will not be the cache
- Unless it was pre-fetched
Capacity
- The working set of blocks accessed by the program is too large to fit in the cache
Conflict
- When a block is evicted to early because it was replaced
- Does not happen with fully associative caches
  - When confused about capacity vs conflict you can go “would this still happen if fully associative” and if the answer is no then it is a conflict miss
Coherence
- Only for multicore systems
- When a different processor has modified a value that is also stored in the processor’s cache

Interactions with Software

Misses depend on access patterns
- Algorithm behavior
- Compiler optimization for memory access
  - The cache can be populated by inserting loads to XZR

Liam's UF Notes

Explorer

Cache

Tags and Valid Bits

Types

Blocks

Misses

Writing

Instruction vs Data Caches

Performance

Types of Misses

Interactions with Software

Graph View

Table of Contents

Backlinks

Explorer