Ch 7: Pipelined
Processor
5-stage pipeline, hazard detection, data forwarding, stalls, flushes, and performance — all the tricky details for your exam.
← Page 1: Single-CyclePipeline Fundamentals
Pipelining achieves temporal parallelism: while instruction N is in Execute, instruction N+1 is in Decode, and N+2 is in Fetch — all simultaneously.
The Five Pipeline Stages
IF — Instruction Fetch
PC → Instruction Memory → Fetch instruction. Compute PC+4. Store in IF/ID pipeline register.
ID — Instruction Decode / RF Read
Decode opcode → generate control signals. Read registers rs1, rs2 from register file. Sign-extend immediate. Store in ID/EX register.
EX — Execute
ALU performs operation. Mux selects SrcA (reg or forwarded) and SrcB (reg, forwarded, or imm). Compute branch target PCTarget = PCE + ImmExtE. PCSrcE = BranchE AND ZeroE.
MEM — Memory Access
lw: reads data memory at ALUResultM. sw: writes WriteDataM to memory. Non-memory instructions pass through. Pass results to EX/MEM register.
WB — Write Back
ResultSrcW selects: ReadDataW (lw), ALUResultW (R-type), PCPlus4W (jal). Write to rd in register file. RegWriteW enables the write.
Single-Cycle vs Pipelined — Key Differences
| Aspect | Single-Cycle | Pipelined |
|---|---|---|
| CPI | Always 1 | ~1.23 (with hazards) |
| Clock Period | 750 ps (lw path) | 350 ps (Execute stage) |
| Multiple instructions | One at a time | Up to 5 simultaneously |
| Hazards | None (one instruction) | Data & control hazards |
| Control signals | Decoded once | Travel with instruction through stages |
| Signal naming | No suffix | Suffix = stage (RegWriteD, ALUSrcE…) |
Control signals are generated in Decode and then travel with the instruction through pipeline registers. The suffix tells you where the signal currently lives:
RegWriteD → in Decode, RegWriteE → in Execute, RegWriteM → in Memory, RegWriteW → in Writeback.
Same for ResultSrc, MemWrite, ALUControl, etc.
Pipeline Registers — What They Hold
Between each stage there is a set of flip-flops (pipeline registers) that capture all needed values at the rising clock edge. This isolates stages.
IF/ID Register holds:
- InstrD (32-bit instruction)
- PCPlus4D (PC+4 for the instruction)
- PCD (current PC, for branch target)
ID/EX Register holds:
- RD1E, RD2E (register file outputs)
- PCE, PCPlus4E
- RdE (destination register number)
- ImmExtE (sign-extended immediate)
- Rs1E, Rs2E (source register numbers for hazard unit)
- All control signals: RegWriteE, ResultSrcE, MemWriteE, ALUControlE, ALUSrcE, JumpE, BranchE
EX/MEM Register holds:
- ALUResultM
- WriteDataM (value to store for sw)
- RdM, PCPlus4M
- RegWriteM, ResultSrcM, MemWriteM
MEM/WB Register holds:
- ReadDataW (data from memory for lw)
- ALUResultW
- RdW, PCPlus4W
- RegWriteW, ResultSrcW
The hazard unit needs to compare source registers in the Execute stage against destination registers in Memory and Writeback. So Rs1E, Rs2E (from the instruction in Execute) and RdM, RdW (from instructions further along) are all routed to the hazard unit simultaneously.
Pipeline Hazards
A hazard occurs when an instruction depends on a result that isn't yet available. Without handling, the processor produces wrong results.
Data Hazard (RAW)
Read After Write: instruction tries to read a register that a previous instruction hasn't written back yet. Most common hazard. Solved by forwarding or stalling.
Control Hazard
Branch decision not known until Execute stage. Instructions fetched after the branch (in IF and ID) may need to be flushed if the branch is taken.
Structural Hazard
Two instructions need the same hardware at the same time. Example: both Decode and Writeback use the register file. Handled by register file design (write on falling edge, read on rising edge).
Reading Pipeline Diagrams — Cycle-by-Cycle
Each row = one instruction. Each column = one clock cycle. Boxes show which stage the instruction is in. Blue arrows = data forwarding. Gaps = stalls.
In cycle 5: add is in WB (writing s8), sub is in EX (needs s8) → ForwardAE = 01 (forward from WB)
Data Hazard: The Classic Example
add s8, s4, s5 # writes s8 (available after EX, cycle 3) or s7, s7, s6 # no dependency on s8 sub s2, s3, s8 # reads s8 — needs forward from add's result
Only sub depends on s8. By the time sub reaches Execute (cycle 5), add is in Writeback (cycle 5). The hazard unit detects this and sets ForwardAE=01, routing the WB result to SrcA of the ALU in Execute.
Data Forwarding
Forwarding routes results directly from internal pipeline bus to the ALU inputs of the Execute stage — bypassing the register file. Eliminates most data hazards without stalls.
Forwarding Mux — 3-way selection
Both SrcA and SrcB of the ALU have a 3:1 mux (ForwardAE and ForwardBE). The hazard unit drives these based on register number comparisons.
// ForwardAE and ForwardBE encoding
Forwarding Logic Equations
if ((Rs1E == RdM) AND RegWriteM) AND (Rs1E != 0) // Case 1: MEM
ForwardAE = 10
else if ((Rs1E == RdW) AND RegWriteW) AND (Rs1E != 0) // Case 2: WB
ForwardAE = 01
else ForwardAE = 00 // Case 3: no hazard
// ForwardBE: same logic, replace Rs1E with Rs2E
Register x0 (zero register) is always 0 — it can never be written. If an instruction sources from x0, it should always get 0 from the register file, never forward a "written" value. The Rs1E != 0 condition prevents false forwarding to x0.
If Case 1 (MEM forward) fires, we use it even if Case 2 would also fire. The Memory-stage result is more recent (from the instruction immediately before the current one), so it's always correct. The if-else ordering enforces this priority.
Forwarding Worked Example
Cycle 5 state: Execute: sub s2, s3, s8 → Rs1E=s3, Rs2E=s8 Memory: or s7, s7, s6 → RdM=s7, RegWriteM=1 Writeback: add s8, s4, s5 → RdW=s8, RegWriteW=1 ForwardAE check for Rs1E=s3: Rs1E==RdM? s3==s7? NO Rs1E==RdW? s3==s8? NO → ForwardAE = 00 (use register file) ForwardBE check for Rs2E=s8: Rs2E==RdM? s8==s7? NO Rs2E==RdW? s8==s8? YES, RegWriteW=1, s8!=0 → ForwardBE = 01 (forward from Writeback)
The lw Stall — Load-Use Hazard
Forwarding cannot always help. When lw is immediately followed by an instruction that uses the loaded value, the data isn't ready until after Memory — one cycle too late for the next Execute.
Why forwarding fails for lw
The Problem
lw gets the data from memory in the Memory stage (cycle N+4). The next instruction needs it in Execute stage (cycle N+3). That's one cycle BEFORE the data exists — impossible to forward backwards in time.
The Solution: Stall
Freeze the Fetch and Decode stages for one cycle (insert a bubble/NOP into Execute). After one stall cycle, the lw data is available in the EX/MEM register and can be forwarded normally.
lw Stall Diagram
← The forwarding arrow in cycle 5→6: ReadData from lw's MEM forwarded to and's EX. One stall cycle added.
Stall Logic Equations
lwStall = ((Rs1D == RdE) OR (Rs2D == RdE)) AND ResultSrcE[0]
// Effect on pipeline
StallF = StallD = lwStall
FlushE = lwStall OR PCSrcE
// StallF=1: PC register doesn't update (re-fetches same instruction)
// StallD=1: IF/ID register doesn't update (Decode re-processes same instr)
// FlushE=1: ID/EX register cleared → NOP bubble inserted into Execute
ResultSrcE controls what gets written back: 01 = lw (memory read). Bit[0]=1 only when ResultSrcE=01 or 11 — effectively detecting a lw instruction in the Execute stage. Rs1D/Rs2D are the source registers of the instruction in Decode. RdE is the destination register of the instruction in Execute. If they match AND it's a lw in Execute → stall!
Branch (Control) Hazards
The branch target and taken/not-taken decision are resolved in the Execute stage. But by that time, 2 instructions have been fetched after the branch — and they must be flushed if the branch is taken.
Branch Resolution Timeline
Branch resolved in EX (cycle 3). PCSrcE=1 → flush sub and or. 2-cycle branch misprediction penalty.
Flush Logic Equations
PCSrcE = (BranchE AND ZeroE) OR JumpE
// Flush the instructions fetched after the branch
FlushD = PCSrcE // clear IF/ID register (instruction in Decode)
FlushE = lwStall OR PCSrcE // clear ID/EX register (instruction in Execute)
// When PCSrcE=1: PC gets PCTargetE, pipeline goes to correct path
Two instructions were fetched after the branch (one in IF, one in Decode at cycle 3). Both get flushed. Each flushed instruction = 1 wasted cycle. So branch penalty = 2 cycles → branch CPI = 1 + 2 = 3 (if always mispredicted). If branch is NOT taken, no flush needed, CPI = 1.
Complete Hazard Logic Reference
The Hazard Unit sits outside the pipeline and monitors register numbers and control signals to generate stall/flush/forward signals. This is the most exam-tested component.
// ForwardAE (for SrcA of ALU in Execute)
if ((Rs1E == RdM) AND RegWriteM) AND (Rs1E != 0)
ForwardAE = 10 // forward from Memory
else if ((Rs1E == RdW) AND RegWriteW) AND (Rs1E != 0)
ForwardAE = 01 // forward from Writeback
else ForwardAE = 00 // use register file
// ForwardBE: identical, use Rs2E instead of Rs1E
// ═══════════ LOAD-USE STALL ═══════════
lwStall = ((Rs1D == RdE) OR (Rs2D == RdE)) AND ResultSrcE[0]
StallF = StallD = lwStall
// ═══════════ CONTROL HAZARD FLUSH ═══════════
PCSrcE = (BranchE AND ZeroE) OR JumpE
FlushD = PCSrcE
FlushE = lwStall OR PCSrcE
Hazard Unit Inputs and Outputs
| Signal | Direction | Description |
|---|---|---|
| Rs1E, Rs2E | Input | Source register numbers in Execute stage |
| RdM | Input | Destination register in Memory stage |
| RdW | Input | Destination register in Writeback stage |
| RegWriteM | Input | Will Memory stage write to register file? |
| RegWriteW | Input | Will Writeback stage write to register file? |
| Rs1D, Rs2D | Input | Source register numbers in Decode stage |
| RdE | Input | Destination register in Execute stage |
| ResultSrcE[0] | Input | Is instruction in Execute a lw? (bit 0 of ResultSrc) |
| PCSrcE, ZeroE | Input | Branch taken? (from ALU zero and control) |
| ForwardAE, ForwardBE | Output | Select forwarded value for ALU SrcA, SrcB |
| StallF, StallD | Output | Freeze PC and IF/ID register |
| FlushD, FlushE | Output | Clear IF/ID or ID/EX pipeline register (insert NOP) |
Flushing sets all control signals in the pipeline register to 0 — effectively making the pipeline register act as a NOP (no operation). RegWrite=0, MemWrite=0, Branch=0 etc. The instruction bits still travel through but produce no architectural effect.
Pipelined Performance
The pipelined clock period is determined by the slowest pipeline stage. The Execute stage (with the ALU and forwarding mux chain) is usually the bottleneck.
Critical Path: Execute Stage
In the Execute stage, data passes through: ForwardAE mux (3-way) → ALUSrcE mux (select reg or imm for SrcB) → ForwardBE mux → PCTargetE mux. Each mux adds tmux=30ps to the path. The AND-OR gate computes PCSrcE from BranchE AND ZeroE.
Decode + Writeback = Half-Cycle Each
Both Decode and Writeback use the register file in the same cycle: Decode reads, Writeback writes. The register file is designed to write on the falling edge and read on the rising edge, so each effectively gets Tc/2. The formula for these stages:
CPI Calculation with Hazards
SPECINT2000 Assumptions (textbook)
- 25% loads, 10% stores, 13% branches, 52% R-type
- 40% of loads used by immediately next instruction → stall
- 50% of branches mispredicted → 2-cycle penalty
// lw: 60% no stall (CPI=1), 40% stall (CPI=2)
CPI_lw = 1(0.6) + 2(0.4) = 1.4
// beq: 50% not taken (CPI=1), 50% taken (CPI=1+2=3)
CPI_beq = 1(0.5) + 3(0.5) = 2.0
// Other instructions: CPI=1 (no hazards)
CPI_sw = CPI_R = 1.0
// Weighted average CPI
CPI_avg = (0.25)(1.4) + (0.10)(1.0) + (0.13)(2.0) + (0.52)(1.0)
= 0.35 + 0.10 + 0.26 + 0.52 = 1.23
// Execution time (100B instructions)
Time = 100×10⁹ × 1.23 × 350×10⁻¹² = 43 seconds
Performance Summary Table
| Processor | Tc | CPI | Time (100B instr) | Speedup |
|---|---|---|---|---|
| Single-Cycle | 750 ps | 1.0 | 75 s | 1× (baseline) |
| Pipelined | 350 ps | 1.23 | 43 s | 1.75× faster |
The clock period drops from 750 ps to 350 ps (2.14× faster clock). Even though CPI rises from 1.0 to 1.23 due to hazards, the faster clock wins: 75s → 43s. The lesson: a faster clock period beats a slightly worse CPI.
Pipeline Diagram Tracing Skills
The most common 2-mark exam question: draw/trace which stages each instruction is in per clock cycle, including stalls and flushes. Master this process.
Process: Code → Diagram
1. Write instructions vertically. Add cycle columns (1, 2, 3...).
2. By default each instruction enters the pipeline one cycle after the previous: I1 at cycle 1, I2 at cycle 2, etc.
3. For each instruction, default pipeline: IM→RF→EX→DM→WB across 5 consecutive cycles.
4. Check for load-use hazard: if lw is followed immediately by an instruction using the loaded register → insert 1 stall bubble after lw's ID stage. Push subsequent instructions 1 cycle right.
5. Check for branch: mark EX stage of beq with * (decision point). If taken: mark next 2 instructions as FLUSH. Insert branch target instruction from cycle of beq's EX+1.
6. Mark forwarding arrows between stages where data is passed.
Worked Trace: Q10 from Exam Images
Code: add s8, s4, s5 / or s7, s7, s6 / sub s2, s3, s8
Cycle 5: sub is in EX, add is in WB. sub needs s8 (written by add). ForwardBE=01 (from WB). or doesn't need s8 so no hazard for or. No stalls needed here!
Worked Trace: lw stall
Code: lw s7, 4(s0) / and s5, s5, s7
✦ = lw data available here (end of MEM, cycle 4). ↑ = forwarded into EX in cycle 5. One stall inserted. Execution takes 1 extra cycle.
Worked Trace: beq taken
Code: beq s0,s1,target / sub s8,t1,s3 / or s9,t6,s5 / ... / target: add s7,s3,s4
★ = branch resolved in cycle 3 EX. FlushD=1 (clears sub in Decode), FlushE=1 (clears or in Execute). add fetched in cycle 4. 2-cycle penalty.
Practice MCQs — Click to Reveal
Modelled on the Q2, Q10, Q14, Q15 patterns from your past papers. Click to show each answer.
Sub: Rs1E=s3, Rs2E=s8. RdM=s7 (or), RdW=s8 (add).
ForwardAE: Rs1E(s3)==RdM(s7)? NO. Rs1E(s3)==RdW(s8)? NO. → 00 (use RF).
ForwardBE: Rs2E(s8)==RdM(s7)? NO. Rs2E(s8)==RdW(s8)? YES + RegWriteW=1 + s8≠0 → 01 (forward from WB).
CPI_lw = 1(0.6) + 2(0.4) = 1.4
CPI_beq = 1(0.5) + 3(0.5) = 2.0
CPI_avg = 0.25(1.4) + 0.10(1) + 0.13(2) + 0.52(1) = 0.35+0.10+0.26+0.52 = 1.23
0000_0000_1001_0100_0000_1010_0110_0011
op=1100011 (beq). B-type: imm = {inst[31], inst[7], inst[30:25], inst[11:8], 0}
= {0, 1, 000000, 0101, 0} = 0_1_000000_0101_0 = 0b0000_0000_1010_0 = 0x14 shifted... Actually imm[12:1] = {0,1,000000,0101} = 0b0_1_0000_0010_1 = 0x14 bytes = 20. Branch jumps to PC + 0x14. Among choices, 0x14 = 20 bytes forward. Answer: D) 0x14.
ForwardAE=10: Rs1E==RdM, RegWriteM=1 (forward from MEM)
ForwardAE=01: Rs1E==RdW, RegWriteW=1 (forward from WB)
lwStall: (Rs1D==RdE OR Rs2D==RdE) AND ResultSrcE[0] → StallF=StallD=1, FlushE=1
Branch penalty: 2 cycles (flush Decode + Execute stages)
FlushD = PCSrcE | FlushE = lwStall OR PCSrcE
PCSrcE = (BranchE AND ZeroE) OR JumpE
T_c_pipe = 40 + 4(30) + 120 + 20 + 50 = 350 ps (Execute stage bottleneck)
CPI_avg = 1.23 with textbook SPECINT2000 assumptions
Execution Time = 43 s for 100B instructions at 350 ps, CPI=1.23
