Cache

  • Level of memory closest to the CPU
  • Optimizes for the principle of locality

Tags and Valid Bits

  • How we know what memory is currently stored in the cache
  • Store the block address as well as the data
    • Tag
    • Index
      • Implicitly stored in the row of a direct cache
    • Offset
      • Bits at the end of the address that are implicitly remembered
      • Two Parts
        • Word Offset
        • Block Offset
      • (2^number of offset bits) bytes will be returned by a single memory access
      • Since memory is in multiples of 4, you don’t need to remember the last 2 bits in 32 bit systems because it will always be 00
      • Since memory is in multiples of 8, you don’t need to remember the last 3 bits in 64 bit systems because it will always be 000

Types

  • Direct Cache
    • Structure
      • The last bits of the address (excluding offset) are implicit based on the row of the cache
        • the number of rows in the cache are
      • The rest of the address is stored in the tag
      • Valid bit which is 1 when data is stored in the location
    • When accessing
      • Take the last bits of the access address, and go to that row of the cache
      • Check that the valid bit is 1
      • Check that the rest of the bits of the access address matches the tag at that row
      • If they match, then the cache is storing data for that address
    • Problem: certain code can create lots of conflicts if different addresses with the same last three bits are accessed in a loop
  • Fully-Associative Cache
    • Structure
      • Store the data anywhere
      • Store the entire address (excluding the offset) in the tag
    • Tradeoff
      • Reduce the number of conflicts
      • Tags are larger
      • Linear lookup time which can be parallelized into constant but it would be very expensive to do so
  • Set-Associative Cache
    • Combination of the other two
    • Structure
      • Create multiple direct caches
        • the number of caches n-way name
        • 0-way = direct cache
        • n-way = fully associate cache
      • When adding, pick a way using a method and then add to that one like a normal direct cache
        • Least Recently Used (LRU): Choose the way that was written to the least recently
        • Pick randomly
    • Accessing
      • Check all ways to find the matching row if it exists
  • Typical Usage
    • L1: Direct Access
    • L2: Set associative
    • Main memory: fully associative (not a cache but accessed in the same way)
  • Finding the index number fast
    • divide and then use modulo
    • #todo

Blocks

  • We can store multiple words data in one row of the cache
    • Need to adjust offset so any address in the same block points to the same row in the cache
  • Optimizes for spatial locality
  • When putting data into the cache, also put nearby data into the cache (the block)
  • Size considerations
    • But in a fixed-sized cache, larger blocks space for fewer blocks
    • That means you override more with each load, which can negatively affect temporal locality
    • There is a balance that optimizes both depending on the size of the cache, found experimentally

Misses

  • On hit, the CPU proceeds normally
  • On cache miss
    1. Stall the CPU pipeline
    2. Fetch block from next level of hierarchy
    3. Instruction cache miss Restart instruction fetch
    4. Data cache miss Complete data access

Writing

  • Write-Through
    • So that when you write to L1, write to all the other caches
    • This is very resource intensive, so it is not often used
  • Write-Back
    • On data write hit, just update the block in the cache
    • Keep track of which blocks have been written to (dirty block)
    • Only write back to lower caches when those lower caches are needed
  • Write Allocation
    • For what should happen on write miss
    • Write-Through
      • Allocate on miss: fetch the block
      • Write around: don’t fetch the block
    • Write-Back
      • Fetch the block

Instruction vs Data Caches

  • Instructions and data have different patterns of temporal and spatial locality
    • Also instructions are generally read-only
  • Can have separate instruction and data caches
    • Advantages
      • Doubles bandwidth between CPU and memory hierarchy
      • Each cache can be optimized for its pattern of locality
    • Disadvantages
      • Slightly more complex design
      • Cannot dynamically adjust ratio taken by instructions and data
  • 100% of instructions will access the instruction cache

Performance

  • Components of CPU Time
    • Program execution cycles (includes cache hit time)
    • Memory stall cycles (mainly from cache misses)
  • With simplifying assumptions:
  • Average memory access time (AMAT)
    • Hit time + (Miss rate * Miss penalty)
    • Access time of L2 from the perspective of L1
      • L1 Miss penalty + (L2 Miss rate * L2 Miss penalty 2)
  • Overall Cycles Per Instruction (CPI)
    • Remember that 100% of instructions will access the instruction cache

Types of Misses

  • Compulsory
    • The first time you access an address, it will not be the cache
    • Unless it was pre-fetched
  • Capacity
    • The working set of blocks accessed by the program is too large to fit in the cache
  • Conflict
    • When a block is evicted to early because it was replaced
    • Does not happen with fully associative caches
      • When confused about capacity vs conflict you can go “would this still happen if fully associative” and if the answer is no then it is a conflict miss
  • Coherence
    • Only for multicore systems
    • When a different processor has modified a value that is also stored in the processor’s cache

Interactions with Software

  • Misses depend on access patterns
    • Algorithm behavior
    • Compiler optimization for memory access
      • The cache can be populated by inserting loads to XZR