Instruction execution cycle

The instruction execution cycle is the sequence of steps a processor runs for every instruction: Fetch, Decode, Execute, Memory, Writeback. Five stages, five clock cycles in a simple non-pipelined design. Pipelined processors overlap the stages so effective throughput is one instruction per cycle.

The cycle is the same for every RISC-style instruction. What varies between types is what each stage does. The Memory stage matters for Loads and Stores but does nothing for plain ALU ops; the Writeback stage updates a register for ALU and Loads but not Stores.

The five stages

Stage 1 — Fetch

Pull the instruction at PC from memory into IR; increment PC.

In Register transfer notation:

$IR \leftarrow [[PC]]; PC \leftarrow [PC] + 4$

The brackets matter: $[PC]$ is “contents of PC” (the address); $[[PC]]$ is “contents of memory at that address” (the instruction itself).

This is always a memory read. See Memory Read and Write Operations.

Stage 2 — Decode / Read

Decode the opcode in IR; read the source register operands from the register file into the inter-stage registers RA and RB.

$R A \leftarrow [R s]; RB \leftarrow [Rt]$

The control unit examines IR’s opcode to determine which type of instruction this is and which control signals to assert in the upcoming stages.

Stages 3–5 — Vary by instruction type

Each instruction type uses these stages differently:

Instruction type	Step 3 (Execute)	Step 4 (Memory)	Step 5 (Writeback)
ALU (Add, Sub, etc.)	$RZ \leftarrow [R A] op [RB]$	$R Y \leftarrow [RZ]$	$R d \leftarrow [R Y]$
Load $Rt, X (R s)$	$RZ \leftarrow [R A] + X$	$R Y \leftarrow [[RZ]]$	$Rt \leftarrow [R Y]$
Store $Rt, X (R s)$	$RZ \leftarrow [R A] + X$ ; $RM \leftarrow [RB]$	$[RZ] \leftarrow [RM]$	(no action)
Call address	$PC_Temp \leftarrow [PC]$ ; $PC \leftarrow$ address	$R Y \leftarrow [PC_Temp]$	$LINK \leftarrow [R Y]$
Return	$PC \leftarrow [R A]$	(no action)	(no action)

What’s happening in each row

ALU operations (add, subtract, etc.):

Stage 3: ALU computes the result, latched into RZ.
Stage 4: RZ moves to RY (just a pass-through, but the inter-stage register lets the next stage read consistently).
Stage 5: RY’s value writes back to the destination register.

Loads (ldw):

Stage 3: ALU computes the effective memory address (base + offset), result in RZ.
Stage 4: Memory is read at that address; the data goes into RY.
Stage 5: RY writes back to the destination register.

Stores (stw):

Stage 3: ALU computes the effective address (RZ); the value to store loads into RM.
Stage 4: Memory writes RM to address RZ.
Stage 5: Nothing. Store has no destination register.

Call:

Stage 3: Save current PC into PC_Temp; set PC to the call target.
Stage 4: Move PC_Temp into RY.
Stage 5: Write RY to the link register.

Return:

Stage 3: Set PC to the contents of RA (which holds the link register’s value).
Stages 4–5: Nothing.

Inter-stage registers

The capital-letter registers (RA, RB, RZ, RY, RM) aren’t visible to the programmer. They hold values between pipeline stages. Each stage’s outputs get latched into one of these registers at the end of the stage; the next stage reads from the latch.

Why have them? In a pipelined design, multiple instructions are in flight at once. RA’s value for the instruction in stage 3 has to stay stable while the instruction in stage 2 computes a different RA for itself. The inter-stage registers keep the stages from interfering.

Pipelining

In a non-pipelined design, only one instruction is being processed at a time, and it takes 5 cycles to complete (5 stages × 1 cycle each). Throughput: one instruction per 5 cycles.

In a pipelined design, each stage works on a different instruction simultaneously:

Cycle 1: Inst1=F
Cycle 2: Inst1=D Inst2=F
Cycle 3: Inst1=E Inst2=D Inst3=F
Cycle 4: Inst1=M Inst2=E Inst3=D Inst4=F
Cycle 5: Inst1=W Inst2=M Inst3=E Inst4=D Inst5=F
Cycle 6:         Inst2=W Inst3=M Inst4=E Inst5=D ...

After the pipeline fills, one instruction completes per cycle, a 5× speedup over the non-pipelined version in the ideal case. Real pipelines never sustain that ideal because of three classes of hazard:

Data hazards. An instruction needs a value that an earlier instruction hasn’t yet written back. The pipeline must stall (insert a bubble) or forward the value from a later stage, extra hardware that costs cycles.
Control hazards. Branches don’t resolve until the Execute stage (or later), but the Fetch stage has already pulled in instructions assuming the branch wasn’t taken. On a misprediction, those fetched instructions get flushed, costing several cycles per mispredicted branch.
Structural hazards. Two instructions need the same hardware resource (e.g. a single memory port for both fetch and load) at the same time. One must wait.

In practice a textbook 5-stage pipeline (the MIPS R2000 in Hennessy & Patterson) settles in the low 1.x CPI range on general-purpose code, around 1.2–1.5 once load-use stalls and branch flushes are counted, not the ideal 1.0. Out-of-order superscalar designs push the IPC (instructions per cycle, the inverse of CPI) back above 1 by issuing multiple instructions per cycle.

For the hardware that implements this, see Hardware datapath and Control unit and control signals.

Idriss Rami — Notes

Explorer