The translation lookaside buffer (TLB) is a small, fast cache inside the MMU (memory management unit) that holds recently used virtual-to-physical address translations. It exists because without it, every memory access would require reading the page table from main memory — turning a single memory operation into two.

Why it’s needed

In a virtual memory system, every memory access requires translating a virtual address to a physical one. The translation lives in the page table, which itself lives in main memory.

Without a TLB, the sequence for a single memory access is:

  1. Start with virtual address .
  2. Look up ‘s page number in the page table → physical frame number . (Memory access #1: to the page table.)
  3. Combine with ‘s offset to get physical address .
  4. Access . (Memory access #2: the actual data.)

That’s two memory accesses per one “memory access” the program sees. Memory just got twice as slow.

The TLB caches the page table entries the program has used recently. With a TLB:

  1. Look up ‘s page number in the TLB. If hit, get instantly.
  2. Combine with offset → .
  3. Access .

One memory access, plus a TLB lookup that takes maybe one cycle. The 2× overhead is gone.

TLB structure

The TLB is small — typically 16 to 512 entries. Each entry stores:

  • Virtual page number (the lookup key).
  • Physical frame number.
  • Permission bits (read/write/execute).
  • Valid bit, dirty bit, etc.

Lookup is parallel — all entries are compared against the requested virtual page number simultaneously, giving fast access.

TLB miss

When the requested virtual page isn’t in the TLB:

  1. The hardware (or OS, depending on architecture) walks the page table in main memory.
  2. The translation is fetched.
  3. The TLB is updated with the new entry (often evicting an older one).
  4. The original access proceeds.

A TLB miss is not “one extra memory access” on a modern 64-bit system. The page table is hierarchical — x86_64 uses a 4-level page table (5-level with PML5), and walking it requires loading one entry from each level. Each load is dependent on the previous one (the entry tells you where the next-level table lives), so they can’t be parallelized. If the page-table entries themselves miss in the data cache, each level can stall for a full DRAM round-trip. A worst-case page-table walk on x86_64 can take hundreds of cycles, not tens. A page-table walker (PWC, MMU) may also cache intermediate-level entries to mitigate this, similar to a TLB but for the upper levels.

A page fault is different and far worse: the page isn’t in RAM at all and has to come from disk. Page faults cost millions of cycles. TLB miss → page-table walk → physical memory access stays at most in the hundreds of cycles.

Hit rates and impact

A typical TLB hit rate is above 99% for well-behaved programs — the same handful of pages get used repeatedly thanks to locality. The 1% of misses are amortized over thousands of hits, so the average translation cost is essentially free.

When the working set spans many pages (e.g., walking large data structures, doing big linear scans), TLB misses can become a bottleneck. Modern CPUs use larger TLBs and multi-level TLBs (just like multi-level caches) to mitigate this.

In context

The TLB is part of the Memory hierarchy for address translations, parallel to the data cache hierarchy:

Data:        register → L1 → L2 → L3 → RAM → SSD/disk
Translations: TLB → page table in RAM → page table in disk?

Both hierarchies aim for “translate or fetch in one cycle most of the time, fall back gracefully on miss.”

Worked example: address translation

Suppose 32-bit virtual addresses with 4 KB pages:

[ 20-bit virtual page number | 12-bit page offset ]

For virtual address :

  • VPN: top 20 bits = .
  • Offset: bottom 12 bits = .

The TLB is queried with VPN :

  • Hit: the TLB entry returns physical frame number, say . Combine with offset: physical address = . Memory access proceeds at this physical address.

  • Miss: walk the page table. The walk takes 1-4 memory accesses (depending on page-table depth — multi-level page tables in 64-bit systems use 4 levels). Once found, install the entry in the TLB and proceed.

TLB entry format

A typical TLB entry:

[ valid | VPN | PFN | permissions | dirty | ASID | ... ]
  • Valid: this entry holds a real translation.
  • VPN: virtual page number (the search key).
  • PFN: physical frame number (the result).
  • Permissions: read, write, execute, user/kernel.
  • Dirty: page has been written since loaded.
  • ASID (Address Space Identifier): which process this translation is for. Avoids the need to flush TLB on context switch — entries with the wrong ASID are simply ignored.

Some TLBs also store the page size (for systems supporting multiple sizes), and cache-related metadata.

Multiple TLBs in practice

Modern processors have separate TLBs for instructions vs. data (Harvard architecture at the TLB level), and often multiple levels (L1 TLB and L2 TLB). They also have separate TLBs for different page sizes (4 KB pages vs. larger 2 MB or 1 GB “huge pages” used by some applications).

Typical configurations vary widely between vendors and generations — these are illustrative only:

  • L1 ITLB (instruction): 64–128 entries, 4-way associative.
  • L1 DTLB (data): 64–128 entries, 4-way associative.
  • L2 TLB (unified): 1500–2000 entries, 8-way or more associative.

A current Intel client CPU might have a 64-entry L1 DTLB and a ~2K-entry L2 TLB; an Apple M-series core has different sizes again; a small embedded ARM Cortex might be a tenth of these. Pull from the specific datasheet rather than assuming. The L1 TLBs are accessed every clock cycle (every memory access), so they have to be fast — small and parallel-search. The L2 TLB catches L1 misses.

Hardware vs software TLB management

Two approaches to handling TLB misses:

  • Hardware-walked TLB: the CPU has dedicated logic that walks the page table on a TLB miss. The page-table format is hard-coded into the hardware. Used by x86, ARM.

  • Software-managed TLB: the CPU traps on a TLB miss and jumps to an OS handler that walks the page table (in any format the OS chooses). The handler installs the new entry. Used by older MIPS, some embedded designs.

Hardware is faster (no trap overhead); software is more flexible (any page-table format).

Context switches

When the OS swaps from one process to another, the TLB’s entries are now wrong — they refer to the previous process’s address space. Two options:

  • Flush the TLB on every context switch. Simple, but the next process pays a TLB-miss storm as it warms up.
  • Tag entries with ASID (Address Space Identifier). Each entry remembers which process it belongs to; mismatches are ignored. Multiple processes’ translations coexist in the TLB. Modern OSes use this.

Either way, context switching has some TLB cost. This contributes to why frequent switching between processes hurts throughput.