Carry-lookahead adder

A carry-lookahead adder (CLA) computes all the carry bits in parallel directly from the inputs, instead of waiting for them to ripple stage by stage. The trick is to express each carry as a flat function of the inputs by decomposing per-bit behaviour into two simpler signals: generate and propagate.

The ripple-carry adder takes $O (n)$ gate delays for an $n$ -bit add because each stage must wait for the previous stage’s carry. For wide datapaths at high clock rates, that ripple is too slow. Lookahead breaks the dependency chain and brings worst-case carry delay down to $O (lo g n)$ when applied hierarchically.

Generate and propagate

For each bit position $i$ with operand bits $A_{i}, B_{i}$ :

Generate $G_{i} = A_{i} \cdot B_{i}$ — bit $i$ produces a carry-out regardless of carry-in (both inputs are $1$ , so $A_{i} + B_{i} = 1 0_{2}$ unconditionally).
Propagate $P_{i} = A_{i} \oplus B_{i}$ — bit $i$ passes its carry-in through unchanged (exactly one input is $1$ , so $A_{i} + B_{i} = 1$ and adding any carry-in flips that to $1 0_{2}$ ).

These two signals capture everything about how stage $i$ contributes to the carry chain:

$C_{i + 1} = G_{i} + P_{i} \cdot C_{i} .$

Read it as: stage $i$ ‘s carry-out is high if it generates a carry by itself, or if it propagates an incoming carry. The sum bit comes from the propagate signal and the carry-in:

$S_{i} = P_{i} \oplus C_{i} .$

(For the generate-active case where $G_{i} = 1$ , both $A_{i} = B_{i} = 1$ , so $P_{i} = 0$ and the sum bit is $C_{i}$ . The carry, not the operand sum, is what surfaces.)

$A_{i}$	$B_{i}$	$C_{i}$	$G_{i}$	$P_{i}$	$C_{i + 1}$	$S_{i}$
0	0	0	0	0	0	0
0	0	1	0	0	0	1
0	1	0	0	1	0	1
0	1	1	0	1	1	0
1	0	0	0	1	0	1
1	0	1	0	1	1	0
1	1	0	1	0	1	0
1	1	1	1	0	1	1

Flattening the carry chain

Substitute the recurrence into itself to expand each carry as a function of the inputs and the original $C_{0}$ :

$C_{1} = G_{0} + P_{0} C_{0}$ $C_{2} = G_{1} + P_{1} C_{1} = G_{1} + P_{1} G_{0} + P_{1} P_{0} C_{0}$ $C_{3} = G_{2} + P_{2} G_{1} + P_{2} P_{1} G_{0} + P_{2} P_{1} P_{0} C_{0}$ $C_{4} = G_{3} + P_{3} G_{2} + P_{3} P_{2} G_{1} + P_{3} P_{2} P_{1} G_{0} + P_{3} P_{2} P_{1} P_{0} C_{0}$

Every carry is a single-level OR of ANDs over the inputs, independent of every other carry. With enough silicon, all four can be computed in parallel from $A_{0..3}, B_{0..3}, C_{0}$ in three gate delays (XOR for $P$ , AND for $G$ , OR-of-ANDs for $C$ ). That’s constant time at this scale, not linear in the bit width.

A 4-bit CLA combines:

4 full-adders (one per bit) for the sum and per-bit $G / P$ .
A carry-lookahead unit (CLU) computing $C_{1}, C_{2}, C_{3}, C_{4}$ in parallel using the flattened formulas above.

Group propagate and generate

The CLU also produces group signals that summarise the whole 4-bit block:

$P_{G} = P_{3} P_{2} P_{1} P_{0}$

The block as a whole propagates a carry only if every stage propagates.

$G_{G} = G_{3} + P_{3} G_{2} + P_{3} P_{2} G_{1} + P_{3} P_{2} P_{1} G_{0}$

The block as a whole generates a carry if any stage generates and all higher stages propagate.

These are the same recurrence applied at the next level up. Cascade four 4-bit CLAs with a higher-level lookahead carry unit (LCU) consuming their group $P_{G}, G_{G}$ and you get a 16-bit adder with carry latency proportional to $lo g_{4} (16) = 2$ levels. Stack again for 64 bits, 256 bits, etc.

$4^{n}$ -bit hierarchical CLA

For a $4^{n}$ -bit CLA, four $4^{n - 1}$ -bit CLAs feed an LCU. The LCU computes group $P$ and $G$ from the per-block signals, applying the same recurrence at the higher level:

$P_{G} = P_{3 \cdot 4^{n - 1}} \cdot P_{2 \cdot 4^{n - 1}} \cdot P_{4^{n - 1}} \cdot P_{0}$

$G_{G} = G_{3 \cdot 4^{n - 1}} + P_{3 \cdot 4^{n - 1}} G_{2 \cdot 4^{n - 1}} + P_{3 \cdot 4^{n - 1}} P_{2 \cdot 4^{n - 1}} G_{4^{n - 1}} + P_{3 \cdot 4^{n - 1}} P_{2 \cdot 4^{n - 1}} P_{4^{n - 1}} G_{0}$

The carry latency grows as $lo g_{4} n$ — three levels for 64 bits, four for 256.

Cost

The price of $O (lo g n)$ delay is more silicon. The CLU’s gate count grows superlinearly because of the wider AND gates needed for the higher-order terms. In practice, designers limit the lookahead block size (4 bits is the textbook unit; 8 occasionally) and use hierarchy for the rest.

Aspect	Ripple-carry	Carry-lookahead (hierarchical)
Worst-case carry delay	$O (n)$ stages	$O (lo g n)$ levels
Gate count	$O (n)$	$O (n lo g n)$ approximately
Layout	Trivially regular, neighbour wiring	Group/level structure, longer wires
Fan-in pressure	Constant per stage	Grows with block size
When to use	Low-speed, narrow operands	Wide, high-speed datapaths

Other fast-adder schemes

Carry-lookahead is the textbook example, not the only fast adder:

Carry-skip adder — splits the operand into blocks; if a block has $P_{G} = 1$ (carries propagate through the entire block), the incoming carry “skips” past the block via a multiplexer. Simpler than full lookahead.
Carry-select adder — computes both possible sums for each block (assuming carry-in 0 and carry-in 1), then selects the right one once the carry arrives.
Kogge-Stone, Brent-Kung, Han-Carlson — parallel-prefix adders that organise the $G / P$ combination as a tree, optimising different points on the depth/wire/area trade-off. Modern CPU adders are usually parallel-prefix designs.

All of these reduce carry latency below the $O (n)$ of ripple, paying with more gates and more wiring. Real CPUs use deeply optimised parallel-prefix adders, but they all build on the lookahead idea: break the carry dependency by computing $G / P$ separately and combining them in parallel.

Idriss Rami — Notes

Explorer

Carry-lookahead adder

Generate and propagate

Flattening the carry chain

Group propagate and generate

$4^{n}$ -bit hierarchical CLA

Cost

Other fast-adder schemes

Graph View

Table of Contents

Backlinks

Idriss Rami — Notes

Explorer

Carry-lookahead adder

Generate and propagate

Flattening the carry chain

Group propagate and generate

4n-bit hierarchical CLA

Cost

Other fast-adder schemes

Graph View

Table of Contents

Backlinks

$4^{n}$ -bit hierarchical CLA