Translation lookaside buffer

The translation lookaside buffer (TLB) is a small, fast cache inside the MMU (memory management unit) that holds recently used virtual-to-physical address translations. Without it, every memory access would require reading the page table from main memory, turning a single memory operation into two.

Why it’s needed

In a virtual memory system, every memory access requires translating a virtual address to a physical one. The translation lives in the page table, which itself lives in main memory.

Without a TLB, the sequence for a single memory access is:

Start with virtual address $V$ .
Look up $V$ ‘s page number in the page table → physical frame number $F$ . (Memory access #1: to the page table.)
Combine $F$ with $V$ ‘s offset to get physical address $P$ .
Access $P$ . (Memory access #2: the actual data.)

That’s two memory accesses per one “memory access” the program sees. Memory just got twice as slow.

The TLB caches the page table entries the program has used recently. With a TLB:

Look up $V$ ‘s page number in the TLB. If hit, get $F$ instantly.
Combine $F$ with offset → $P$ .
Access $P$ .

One memory access, plus a TLB lookup that takes maybe one cycle. The 2× overhead is gone.

TLB structure

The TLB is small, typically 16 to 512 entries. Each entry stores:

Virtual page number (the lookup key).
Physical frame number.
Permission bits (read/write/execute).
Valid bit, dirty bit, etc.

Lookup is parallel: all entries are compared against the requested virtual page number simultaneously, giving fast access.

TLB miss

When the requested virtual page isn’t in the TLB:

The hardware (or OS, depending on architecture) walks the page table in main memory.
The translation is fetched.
The TLB is updated with the new entry (often evicting an older one).
The original access proceeds.

A TLB miss is not “one extra memory access” on a modern 64-bit system. The page table is hierarchical: x86_64 uses a 4-level page table (5-level with PML5), and walking it requires loading one entry from each level. Each load is dependent on the previous one (the entry tells you where the next-level table lives), so they can’t be parallelized. If the page-table entries themselves miss in the data cache, each level can stall for a full DRAM round-trip. A worst-case page-table walk on x86_64 can take hundreds of cycles, not tens. A page-table walker (PWC, MMU) may also cache intermediate-level entries to mitigate this, like a TLB but for the upper levels.

A page fault is different and far worse: the page isn’t in RAM at all and has to come from disk. Page faults cost millions of cycles. TLB miss → page-table walk → physical memory access stays at most in the hundreds of cycles.

Hit rates and impact

TLB hit rate depends entirely on the access pattern. Code that revisits a small set of pages (the common case: loops over the same arrays, repeated calls into the same functions) routinely sits above 99% because the working page set fits in even a small TLB. Workloads that touch many pages with little reuse, like hash-heavy random access into large data, big linear sweeps over multi-GB structures, or pointer-chasing across a heap larger than the TLB’s reach, can drop below 90% and stall on page-table walks. Hugepages (2 MB or 1 GB instead of 4 KB) are the standard workaround: each TLB entry now covers 512× or 262 144× more memory.

CPUs counter this by giving TLBs the same multi-level structure as data caches: a fast L1 ITLB and DTLB plus a unified L2 TLB sitting underneath them.

In context

The TLB is part of the Memory hierarchy for address translations, parallel to the data cache hierarchy:

Data:        register → L1 → L2 → L3 → RAM → SSD/disk
Translations: TLB → page table in RAM → page table in disk?

Both hierarchies aim for “translate or fetch in one cycle most of the time, fall back gracefully on miss.”

Worked example: address translation

Suppose 32-bit virtual addresses with 4 KB pages:

[ 20-bit virtual page number | 12-bit page offset ]

For virtual address $0x12345678$ :

VPN: top 20 bits = $0x12345$ .
Offset: bottom 12 bits = $0x678$ .

The TLB is queried with VPN $0x12345$ :

Hit: the TLB entry returns physical frame number, say $0x4823$ . Combine with offset: physical address = $(0x4823 ≪ 12) ∣ 0x678 = 0x4823678$ . Memory access proceeds at this physical address.
Miss: walk the page table. The walk takes 1-4 memory accesses (depending on page-table depth; multi-level page tables in 64-bit systems use 4 levels). Once found, install the entry in the TLB and proceed.

TLB entry format

A typical TLB entry:

[ valid | VPN | PFN | permissions | dirty | ASID | ... ]

Valid: this entry holds a real translation.
VPN: virtual page number (the search key).
PFN: physical frame number (the result).
Permissions: read, write, execute, user/kernel.
Dirty: page has been written since loaded.
ASID (Address Space Identifier): which process this translation is for. Avoids the need to flush TLB on context switch; entries with the wrong ASID are ignored.

Some TLBs also store the page size (for systems supporting multiple sizes), and cache-related metadata.

Multiple TLBs in practice

Out-of-order CPUs split the TLB the same way they split the cache: a separate ITLB for instruction fetches and a DTLB for loads/stores (Harvard organisation), each with its own associativity and ports. Underneath both sits a larger unified L2 TLB. Most cores also keep separate TLB sets per page size, since the 4 KB, 2 MB, and 1 GB page formats have different index/tag widths.

Typical configurations vary widely between vendors and generations — these are illustrative only:

L1 ITLB (instruction): 64–128 entries, 4-way associative.
L1 DTLB (data): 64–128 entries, 4-way associative.
L2 TLB (unified): 1500–2000 entries, 8-way or more associative.

A current Intel client CPU might have a 64-entry L1 DTLB and a ~2K-entry L2 TLB; an Apple M-series core has different sizes again; a small embedded ARM Cortex might be a tenth of these. Pull from the specific datasheet rather than assuming. The L1 TLBs are accessed every clock cycle (every memory access), so they have to be fast: small and parallel-search. The L2 TLB catches L1 misses.

Hardware vs software TLB management

Two approaches to handling TLB misses:

Hardware-walked TLB: the CPU has dedicated logic that walks the page table on a TLB miss. The page-table format is hard-coded into the hardware. Used by x86, ARM.
Software-managed TLB: the CPU traps on a TLB miss and jumps to an OS handler that walks the page table (in any format the OS chooses). The handler installs the new entry. Used by older MIPS, some embedded designs.

Hardware is faster (no trap overhead); software is more flexible (any page-table format).

Context switches

When the OS swaps from one process to another, the TLB’s entries are now wrong: they refer to the previous process’s address space. Two options:

Flush the TLB on every context switch. Simple, but the next process pays a TLB-miss storm as it warms up.
Tag entries with ASID (Address Space Identifier). Each entry remembers which process it belongs to; mismatches are ignored. Multiple processes’ translations coexist in the TLB. Modern OSes use this.

Either way, context switching has some TLB cost. This contributes to why frequent switching between processes hurts throughput.

Idriss Rami — Notes

Explorer