A cache is a small, fast memory that holds copies of recently used data and instructions from main memory. It sits between the processor and main memory, transparently catching repeat accesses so they don’t have to go all the way to the slower main memory.

The processor first checks the cache for any memory access. If the data is there (“cache hit”), it’s returned in a few nanoseconds. If not (“cache miss”), the request goes to main memory, the data is returned to the processor, and a copy is placed in the cache for future reuse.

Caches work because of locality — programs tend to access the same data and code repeatedly over short time windows.

Levels of cache

Modern processors have multiple cache levels, each progressively larger and slower:

LevelTypical sizeAccess timeNotes
L1 instruction cache32 KB~1 nsHolds recently executed instructions
L1 data cache32 KB~1 nsHolds recently accessed data
L2 cache256 KB – 2 MB~5–10 nsUnified (instruction + data)
L3 cache8 – 64 MB~20–40 nsShared across cores in multi-core chips
Main memory (DRAM)gigabytes~100 nsWhat caches are caching

L1 is split into separate instruction and data caches to allow the processor to fetch an instruction and access data simultaneously (Harvard architecture at the cache level). L2 and L3 are usually unified.

The whole hierarchy is exposed to the programmer only through performance — the cache is otherwise invisible. Programs don’t issue “cache” instructions; the cache hardware decides what to keep based on access patterns.

Cache structure

Memory is divided into blocks (also called lines or cache lines), each typically 32 to 64 bytes. The cache stores blocks, not individual bytes — when there’s a miss, an entire block gets fetched from main memory.

Why blocks? Because of spatial locality — once a program touches address , it likely soon touches , , and so on. Pre-fetching the surrounding bytes into cache costs almost no extra time but pays off if any of them get accessed.

Each cache entry has:

  • Tag bits — identify which block of main memory this cache entry currently holds.
  • Valid bit (1 bit) — does this entry hold a real block, or is it empty/uninitialized?
  • Dirty/modified bit (1 bit) — has this entry been written to since being fetched? If yes, it must be written back to main memory before being evicted.
  • Data block — the actual cached bytes.

For the address-bit math behind tag, index, and offset, see Cache address mapping.

Hit and miss

When the processor requests an address, the cache hierarchy is checked level by level:

  1. Check L1. If the tag matches and the valid bit is set: hit, return in ~1 ns.
  2. L1 miss → check L2. If hit: ~5–10 ns, also fill the line into L1.
  3. L2 miss → check L3. If hit: ~20–40 ns, fill into L2 (and usually L1).
  4. L3 miss → fetch from main memory: ~100 ns, fill into L3, L2, L1.

So an “L1 miss” is not the same as “go to RAM” — it usually only costs an L2 access. Only an L3 miss (or last-level miss in general) actually reaches DRAM.

Hits are fast; misses are slow. Cache performance is summarized by the hit rate — fraction of accesses that hit. Modern caches achieve L1 hit rates above 95% for typical workloads.

Why misses happen — the 3C taxonomy

Hill (1987) classified cache misses into three categories that appear in every architecture textbook:

  • Compulsory (cold) misses — the first time a block is ever accessed. Even an infinitely large cache would still take this hit. Mitigation: prefetching, larger blocks (more spatial coverage per miss).
  • Capacity misses — the block was once in the cache, but evicted because the program’s working set is larger than the cache. Mitigation: bigger cache, or restructure code to fit a smaller working set.
  • Conflict misses — the block was once in the cache and was evicted even though there was room elsewhere, because two addresses map to the same set and the associativity ran out. Only happens in direct-mapped or low-associativity caches. Mitigation: higher associativity, or aligning hot data to avoid index collisions.

Some authors add a fourth C for coherence misses — a block was evicted by another core’s write in a multi-core system. This is the cost of cache coherence protocols like MESI.

A simple miss-rate estimate for an instruction cache, given a block size words: assuming a warm cache that holds the working set, every block-aligned fetch misses the first time the block is touched (one compulsory miss per instructions), so the miss rate for sequential code is approximately . This estimate ignores capacity and conflict misses; real workloads see higher miss rates due to data accesses, branches, evictions, and the cold-start phase.

Replacement and write policies

When a miss happens and the cache is full, some old block must be evicted to make room. Common policies:

  • LRU (Least Recently Used) — evict the block that hasn’t been touched in the longest time. Best performance, expensive to implement perfectly.
  • Pseudo-LRU — approximation of LRU, cheaper to implement.
  • Random — evict a random block. Surprisingly effective and very cheap.
  • FIFO — evict the oldest block.

For writes:

  • Write-through — every write updates both cache and main memory. Simple, but slow.
  • Write-back — writes update only the cache; the dirty block is written to main memory later, when evicted. Faster, but requires the dirty bit and complicates cache coherence in multi-core systems. Coherence is the property that all cores see a single, agreed-upon value for each memory address even though each core has its own private cache. With write-back, a core’s modified copy of a line lives only in its own cache for a while, so the system needs a protocol to track who has which copy and to invalidate or update other cores’ copies on a write. The standard cross-cache state machine is the MESI protocol (and its variants MOESI, MESIF).

Modern caches typically use write-back with sophisticated coherence protocols.

Why caches are crucial

A modern CPU at 3 GHz needs an instruction every 0.3 ns. Main memory is 100 ns away — 300 cycles. Without caches, the CPU would idle 99.7% of the time waiting for memory. Caches absorb most accesses at near-CPU speed, bringing the effective memory latency down to single-digit cycles for most workloads.

The whole CPU pipeline is designed assuming cache hits dominate. A cache miss often stalls the pipeline for tens or hundreds of cycles — the kind of cost that makes locality optimization in performance-critical code worthwhile.

For the related concepts of temporal and spatial locality (why caches work) and offset (how they work), see those notes.