Pipelined Processor RISC V - A Detailed Guide

§1 Foundation

Pipeline Fundamentals

Pipelining achieves temporal parallelism: while instruction N is in Execute, instruction N+1 is in Decode, and N+2 is in Fetch — all simultaneously.

The Five Pipeline Stages

IF — Instruction Fetch

PC → Instruction Memory → Fetch instruction. Compute PC+4. Store in IF/ID pipeline register.

ID — Instruction Decode / RF Read

Decode opcode → generate control signals. Read registers rs1, rs2 from register file. Sign-extend immediate. Store in ID/EX register.

EX — Execute

ALU performs operation. Mux selects SrcA (reg or forwarded) and SrcB (reg, forwarded, or imm). Compute branch target PCTarget = PCE + ImmExtE. PCSrcE = BranchE AND ZeroE.

MEM — Memory Access

lw: reads data memory at ALUResultM. sw: writes WriteDataM to memory. Non-memory instructions pass through. Pass results to EX/MEM register.

WB — Write Back

ResultSrcW selects: ReadDataW (lw), ALUResultW (R-type), PCPlus4W (jal). Write to rd in register file. RegWriteW enables the write.

Single-Cycle vs Pipelined — Key Differences

Aspect	Single-Cycle	Pipelined
CPI	Always 1	~1.23 (with hazards)
Clock Period	750 ps (lw path)	350 ps (Execute stage)
Multiple instructions	One at a time	Up to 5 simultaneously
Hazards	None (one instruction)	Data & control hazards
Control signals	Decoded once	Travel with instruction through stages
Signal naming	No suffix	Suffix = stage (RegWriteD, ALUSrcE…)

💡 Signal Naming Convention

Control signals are generated in Decode and then travel with the instruction through pipeline registers. The suffix tells you where the signal currently lives:
RegWriteD → in Decode, RegWriteE → in Execute, RegWriteM → in Memory, RegWriteW → in Writeback.
Same for ResultSrc, MemWrite, ALUControl, etc.

§2 Registers

Pipeline Registers — What They Hold

Between each stage there is a set of flip-flops (pipeline registers) that capture all needed values at the rising clock edge. This isolates stages.

IF/ID Register holds:

InstrD (32-bit instruction)
PCPlus4D (PC+4 for the instruction)
PCD (current PC, for branch target)

ID/EX Register holds:

RD1E, RD2E (register file outputs)
PCE, PCPlus4E
RdE (destination register number)
ImmExtE (sign-extended immediate)
Rs1E, Rs2E (source register numbers for hazard unit)
All control signals: RegWriteE, ResultSrcE, MemWriteE, ALUControlE, ALUSrcE, JumpE, BranchE

EX/MEM Register holds:

ALUResultM
WriteDataM (value to store for sw)
RdM, PCPlus4M
RegWriteM, ResultSrcM, MemWriteM

MEM/WB Register holds:

ReadDataW (data from memory for lw)
ALUResultW
RdW, PCPlus4W
RegWriteW, ResultSrcW

⚠️ Why Rs1E and Rs2E in the ID/EX register?

The hazard unit needs to compare source registers in the Execute stage against destination registers in Memory and Writeback. So Rs1E, Rs2E (from the instruction in Execute) and RdM, RdW (from instructions further along) are all routed to the hazard unit simultaneously.

§3 Hazards

Pipeline Hazards

A hazard occurs when an instruction depends on a result that isn't yet available. Without handling, the processor produces wrong results.

Data Hazard (RAW)

Read After Write: instruction tries to read a register that a previous instruction hasn't written back yet. Most common hazard. Solved by forwarding or stalling.

Control Hazard

Branch decision not known until Execute stage. Instructions fetched after the branch (in IF and ID) may need to be flushed if the branch is taken.

Structural Hazard

Two instructions need the same hardware at the same time. Example: both Decode and Writeback use the register file. Handled by register file design (write on falling edge, read on rising edge).

Reading Pipeline Diagrams — Cycle-by-Cycle

How to read this diagram

Each row = one instruction. Each column = one clock cycle. Boxes show which stage the instruction is in. Blue arrows = data forwarding. Gaps = stalls.

Cyc 1

Cyc 2

Cyc 3

Cyc 4

Cyc 5

Cyc 6

Cyc 7

Cyc 8

add s8,s4,s5

or s7,s7,s6

sub s2,s3,s8

In cycle 5: add is in WB (writing s8), sub is in EX (needs s8) → ForwardAE = 01 (forward from WB)

Data Hazard: The Classic Example

add  s8, s4, s5    # writes s8 (available after EX, cycle 3)
or   s7, s7, s6    # no dependency on s8
sub  s2, s3, s8    # reads s8 — needs forward from add's result

✅ No hazard between add→or (or doesn't use s8)

Only sub depends on s8. By the time sub reaches Execute (cycle 5), add is in Writeback (cycle 5). The hazard unit detects this and sets ForwardAE=01, routing the WB result to SrcA of the ALU in Execute.

§4 Forwarding

Data Forwarding

Forwarding routes results directly from internal pipeline bus to the ALU inputs of the Execute stage — bypassing the register file. Eliminates most data hazards without stalls.

Forwarding Mux — 3-way selection

Both SrcA and SrcB of the ALU have a 3:1 mux (ForwardAE and ForwardBE). The hazard unit drives these based on register number comparisons.

// ForwardAE and ForwardBE encoding

ForwardAE = 00No hazard — use RD1E (value read from register file in Decode stage). Normal operation.

ForwardAE = 10Forward from Memory stage (ALUResultM). Instruction 2 cycles ahead wrote to a register this instruction needs.

ForwardAE = 01Forward from Writeback stage (ResultW). Instruction 3 cycles ahead wrote to a register this instruction needs.

Forwarding Logic Equations

// ForwardAE logic (hazard unit)
if ((Rs1E == RdM) AND RegWriteM) AND (Rs1E != 0) // Case 1: MEM
ForwardAE = 10
else if ((Rs1E == RdW) AND RegWriteW) AND (Rs1E != 0) // Case 2: WB
ForwardAE = 01
else ForwardAE = 00 // Case 3: no hazard

// ForwardBE: same logic, replace Rs1E with Rs2E

⚠️ Why "Rs1E != 0"?

Register x0 (zero register) is always 0 — it can never be written. If an instruction sources from x0, it should always get 0 from the register file, never forward a "written" value. The Rs1E != 0 condition prevents false forwarding to x0.

⚠️ Memory stage forward takes priority over Writeback!

If Case 1 (MEM forward) fires, we use it even if Case 2 would also fire. The Memory-stage result is more recent (from the instruction immediately before the current one), so it's always correct. The if-else ordering enforces this priority.

Forwarding Worked Example

Cycle 5 state:
  Execute:  sub s2, s3, s8   → Rs1E=s3, Rs2E=s8
  Memory:   or  s7, s7, s6   → RdM=s7, RegWriteM=1
  Writeback: add s8, s4, s5  → RdW=s8, RegWriteW=1

ForwardAE check for Rs1E=s3:
  Rs1E==RdM? s3==s7? NO
  Rs1E==RdW? s3==s8? NO
  → ForwardAE = 00 (use register file)

ForwardBE check for Rs2E=s8:
  Rs2E==RdM? s8==s7? NO
  Rs2E==RdW? s8==s8? YES, RegWriteW=1, s8!=0
  → ForwardBE = 01 (forward from Writeback)

§5 Load-Use Stall

The lw Stall — Load-Use Hazard

Forwarding cannot always help. When lw is immediately followed by an instruction that uses the loaded value, the data isn't ready until after Memory — one cycle too late for the next Execute.

Why forwarding fails for lw

The Problem

lw gets the data from memory in the Memory stage (cycle N+4). The next instruction needs it in Execute stage (cycle N+3). That's one cycle BEFORE the data exists — impossible to forward backwards in time.

The Solution: Stall

Freeze the Fetch and Decode stages for one cycle (insert a bubble/NOP into Execute). After one stall cycle, the lw data is available in the EX/MEM register and can be forwarded normally.

lw Stall Diagram

lw s7,0(s4)

DM✓

and s5,s5,s7

STALL

EX←

← The forwarding arrow in cycle 5→6: ReadData from lw's MEM forwarded to and's EX. One stall cycle added.

Stall Logic Equations

// Detect load-use hazard
lwStall = ((Rs1D == RdE) OR (Rs2D == RdE)) AND ResultSrcE[0]

// Effect on pipeline
StallF = StallD = lwStall
FlushE = lwStall OR PCSrcE

// StallF=1: PC register doesn't update (re-fetches same instruction)
// StallD=1: IF/ID register doesn't update (Decode re-processes same instr)
// FlushE=1: ID/EX register cleared → NOP bubble inserted into Execute

🎯 ResultSrcE[0] — Why bit 0?

ResultSrcE controls what gets written back: 01 = lw (memory read). Bit[0]=1 only when ResultSrcE=01 or 11 — effectively detecting a lw instruction in the Execute stage. Rs1D/Rs2D are the source registers of the instruction in Decode. RdE is the destination register of the instruction in Execute. If they match AND it's a lw in Execute → stall!

§6 Control Hazards

Branch (Control) Hazards

The branch target and taken/not-taken decision are resolved in the Execute stage. But by that time, 2 instructions have been fetched after the branch — and they must be flushed if the branch is taken.

Branch Resolution Timeline

beq s1,s2,L1

EX←

sub s8,t1,s3

FLUSH

or s9,t6,s5

FLUSH

L1: add s7,s3,s4

Branch resolved in EX (cycle 3). PCSrcE=1 → flush sub and or. 2-cycle branch misprediction penalty.

Flush Logic Equations

// Control hazard: branch taken in Execute
PCSrcE = (BranchE AND ZeroE) OR JumpE

// Flush the instructions fetched after the branch
FlushD = PCSrcE // clear IF/ID register (instruction in Decode)
FlushE = lwStall OR PCSrcE // clear ID/EX register (instruction in Execute)

// When PCSrcE=1: PC gets PCTargetE, pipeline goes to correct path

🎯 Branch Misprediction Penalty = 2 cycles

Two instructions were fetched after the branch (one in IF, one in Decode at cycle 3). Both get flushed. Each flushed instruction = 1 wasted cycle. So branch penalty = 2 cycles → branch CPI = 1 + 2 = 3 (if always mispredicted). If branch is NOT taken, no flush needed, CPI = 1.

§7 Hazard Unit

Complete Hazard Logic Reference

The Hazard Unit sits outside the pipeline and monitors register numbers and control signals to generate stall/flush/forward signals. This is the most exam-tested component.

// ═══════════ DATA FORWARDING ═══════════
// ForwardAE (for SrcA of ALU in Execute)
if ((Rs1E == RdM) AND RegWriteM) AND (Rs1E != 0)
ForwardAE = 10 // forward from Memory
else if ((Rs1E == RdW) AND RegWriteW) AND (Rs1E != 0)
ForwardAE = 01 // forward from Writeback
else ForwardAE = 00 // use register file

// ForwardBE: identical, use Rs2E instead of Rs1E

// ═══════════ LOAD-USE STALL ═══════════
lwStall = ((Rs1D == RdE) OR (Rs2D == RdE)) AND ResultSrcE[0]
StallF = StallD = lwStall

// ═══════════ CONTROL HAZARD FLUSH ═══════════
PCSrcE = (BranchE AND ZeroE) OR JumpE
FlushD = PCSrcE
FlushE = lwStall OR PCSrcE

Hazard Unit Inputs and Outputs

Signal	Direction	Description
Rs1E, Rs2E	Input	Source register numbers in Execute stage
RdM	Input	Destination register in Memory stage
RdW	Input	Destination register in Writeback stage
RegWriteM	Input	Will Memory stage write to register file?
RegWriteW	Input	Will Writeback stage write to register file?
Rs1D, Rs2D	Input	Source register numbers in Decode stage
RdE	Input	Destination register in Execute stage
ResultSrcE[0]	Input	Is instruction in Execute a lw? (bit 0 of ResultSrc)
PCSrcE, ZeroE	Input	Branch taken? (from ALU zero and control)
ForwardAE, ForwardBE	Output	Select forwarded value for ALU SrcA, SrcB
StallF, StallD	Output	Freeze PC and IF/ID register
FlushD, FlushE	Output	Clear IF/ID or ID/EX pipeline register (insert NOP)

💡 What does "flushing" actually do?

Flushing sets all control signals in the pipeline register to 0 — effectively making the pipeline register act as a NOP (no operation). RegWrite=0, MemWrite=0, Branch=0 etc. The instruction bits still travel through but produce no architectural effect.

§8 Performance

Pipelined Performance

The pipelined clock period is determined by the slowest pipeline stage. The Execute stage (with the ALU and forwarding mux chain) is usually the bottleneck.

Critical Path: Execute Stage

T_{c_pipe} = t_pcq + 4t_mux + t_ALU + t_AND-OR + t_setup

= 40 + 4(30) + 120 + 20 + 50 = 350 ps ← bottleneck stage

💡 Why 4 multiplexers?

In the Execute stage, data passes through: ForwardAE mux (3-way) → ALUSrcE mux (select reg or imm for SrcB) → ForwardBE mux → PCTargetE mux. Each mux adds t_mux=30ps to the path. The AND-OR gate computes PCSrcE from BranchE AND ZeroE.

Decode + Writeback = Half-Cycle Each

Both Decode and Writeback use the register file in the same cycle: Decode reads, Writeback writes. The register file is designed to write on the falling edge and read on the rising edge, so each effectively gets T_c/2. The formula for these stages:

Decode critical: 2(t_RFread + t_setup) ≤ T_c

Writeback critical: 2(t_pcq + t_mux + t_RFsetup) ≤ T_c

CPI Calculation with Hazards

SPECINT2000 Assumptions (textbook)

25% loads, 10% stores, 13% branches, 52% R-type
40% of loads used by immediately next instruction → stall
50% of branches mispredicted → 2-cycle penalty

// CPI contribution per instruction type
// lw: 60% no stall (CPI=1), 40% stall (CPI=2)
CPI_lw = 1(0.6) + 2(0.4) = 1.4

// beq: 50% not taken (CPI=1), 50% taken (CPI=1+2=3)
CPI_beq = 1(0.5) + 3(0.5) = 2.0

// Other instructions: CPI=1 (no hazards)
CPI_sw = CPI_R = 1.0

// Weighted average CPI
CPI_avg = (0.25)(1.4) + (0.10)(1.0) + (0.13)(2.0) + (0.52)(1.0)
= 0.35 + 0.10 + 0.26 + 0.52 = 1.23

// Execution time (100B instructions)
Time = 100×10⁹ × 1.23 × 350×10⁻¹² = 43 seconds

Performance Summary Table

Processor	T_c	CPI	Time (100B instr)	Speedup
Single-Cycle	750 ps	1.0	75 s	1× (baseline)
Pipelined	350 ps	1.23	43 s	1.75× faster

✅ Why pipelined is faster despite CPI > 1

The clock period drops from 750 ps to 350 ps (2.14× faster clock). Even though CPI rises from 1.0 to 1.23 due to hazards, the faster clock wins: 75s → 43s. The lesson: a faster clock period beats a slightly worse CPI.

§9 Diagram Tracing

Pipeline Diagram Tracing Skills

The most common 2-mark exam question: draw/trace which stages each instruction is in per clock cycle, including stalls and flushes. Master this process.

Process: Code → Diagram

📐 Step-by-step method

1. Write instructions vertically. Add cycle columns (1, 2, 3...).
2. By default each instruction enters the pipeline one cycle after the previous: I1 at cycle 1, I2 at cycle 2, etc.
3. For each instruction, default pipeline: IM→RF→EX→DM→WB across 5 consecutive cycles.
4. Check for load-use hazard: if lw is followed immediately by an instruction using the loaded register → insert 1 stall bubble after lw's ID stage. Push subsequent instructions 1 cycle right.
5. Check for branch: mark EX stage of beq with * (decision point). If taken: mark next 2 instructions as FLUSH. Insert branch target instruction from cycle of beq's EX+1.
6. Mark forwarding arrows between stages where data is passed.

Worked Trace: Q10 from Exam Images

Code: add s8, s4, s5 / or s7, s7, s6 / sub s2, s3, s8

add s8,s4,s5

WB↓

or s7,s7,s6

DM↓

sub s2,s3,s8

↑EX

Cycle 5: sub is in EX, add is in WB. sub needs s8 (written by add). ForwardBE=01 (from WB). or doesn't need s8 so no hazard for or. No stalls needed here!

Worked Trace: lw stall

Code: lw s7, 4(s0) / and s5, s5, s7

lw s7,4(s0)

DM✦

and s5,s5,s7

STALL

↑EX

✦ = lw data available here (end of MEM, cycle 4). ↑ = forwarded into EX in cycle 5. One stall inserted. Execution takes 1 extra cycle.

Worked Trace: beq taken

Code: beq s0,s1,target / sub s8,t1,s3 / or s9,t6,s5 / ... / target: add s7,s3,s4

beq s0,s1,L1

EX★

sub s8,t1,s3

FLUSH

or s9,t6,s5

FLUSH

L1: add s7,s3,s4

★ = branch resolved in cycle 3 EX. FlushD=1 (clears sub in Decode), FlushE=1 (clears or in Execute). add fetched in cycle 4. 2-cycle penalty.

§10 Exam Prep

Practice MCQs — Click to Reveal

Modelled on the Q2, Q10, Q14, Q15 patterns from your past papers. Click to show each answer.

Q: In cycle 5, the pipeline has: Execute=sub s2,s3,s8 | Memory=or s7,s7,s6 | Writeback=add s8,s4,s5. What are ForwardAE and ForwardBE for the sub instruction?

A) 00, 01 B) 00, 10 C) 10, 01 D) 01, 01

✅ A) ForwardAE=00, ForwardBE=01.
Sub: Rs1E=s3, Rs2E=s8. RdM=s7 (or), RdW=s8 (add).
ForwardAE: Rs1E(s3)==RdM(s7)? NO. Rs1E(s3)==RdW(s8)? NO. → 00 (use RF).
ForwardBE: Rs2E(s8)==RdM(s7)? NO. Rs2E(s8)==RdW(s8)? YES + RegWriteW=1 + s8≠0 → 01 (forward from WB).

▼ Show Answer

Q: Code: lw s7,0(s4) / and s5,s5,s7. Does a load-use stall occur? How many cycles?

A) No stall B) 1-cycle stall C) 2-cycle stall D) Flushed

✅ B) 1-cycle stall. Check lwStall: Rs1D=s5, Rs2D=s7 (instruction in Decode=and). RdE=s7 (lw in Execute). Rs2D==RdE? YES (s7==s7). ResultSrcE[0]=1 (lw). lwStall=1. StallF=StallD=1, FlushE=1. The and instruction waits one cycle before executing.

▼ Show Answer

Q: If 25% of instructions are lw and 40% of loads are used immediately, and 13% are beq with 50% misprediction rate, what is the average CPI?

A) 1.0 B) 1.15 C) 1.23 D) 1.4

✅ C) 1.23 (textbook result).
CPI_lw = 1(0.6) + 2(0.4) = 1.4
CPI_beq = 1(0.5) + 3(0.5) = 2.0
CPI_avg = 0.25(1.4) + 0.10(1) + 0.13(2) + 0.52(1) = 0.35+0.10+0.26+0.52 = 1.23

▼ Show Answer

Q14: Program starts at 0x0000_0000. What are Zero, PCTarget, PCSrc after "beq s0, s1, target" where s0=4 and s1=5? (beq is instruction 4, target is instruction 9)

A) 1, 20, 0 B) 0, 0x20, 1 C) 1, 20, 1 D) None of the rest

✅ B) 0, 0x20, 1. Wait — check: s0=4, s1=5 after executing addi s0,zero,4 then addi s1,zero,5 then addi s1,zero,-1 (→s1=4). So actually s0==s1! Zero=1. beq is at address 0xC (4th instruction, 0x0/4/8/C). target is the 9th instruction label. PCTarget = PC + offset. The answer from the image choices: Zero=0 means s0≠s1, PCTarget=0x20 (target address), PCSrc=1 means branch taken... Recalculate based on actual register values after each instruction: addi s0,zero,4→s0=4; addi s1,zero,5→s1=5; addi s1,zero,-1→s1=4. So s0=s1=4, Zero=1, PCSrc=Branch AND Zero = 1. PCTarget = 0xC + offset_to_target. Answer: Zero=1, PCSrc=1 — option 3 (1,20,1) seems right based on the choices shown.

▼ Show Answer

Q15: beq instruction 0x00940A63 — what is the branch offset (how many memory locations does it jump)?

A) None of the rest B) 0x20 C) 0x16 D) 0x14

✅ Decode 0x00940A63 in binary:
0000_0000_1001_0100_0000_1010_0110_0011
op=1100011 (beq). B-type: imm = {inst[31], inst[7], inst[30:25], inst[11:8], 0}
= {0, 1, 000000, 0101, 0} = 0_1_000000_0101_0 = 0b0000_0000_1010_0 = 0x14 shifted... Actually imm[12:1] = {0,1,000000,0101} = 0b0_1_0000_0010_1 = 0x14 bytes = 20. Branch jumps to PC + 0x14. Among choices, 0x14 = 20 bytes forward. Answer: D) 0x14.

▼ Show Answer

Q: What does FlushE = lwStall OR PCSrcE achieve?

A) Only flushes for branches B) Flushes Execute for BOTH load-use stalls and taken branches C) Stalls the pipeline for 2 cycles D) Prevents forwarding

✅ B). FlushE clears the ID/EX pipeline register (injects a NOP bubble into Execute) for two separate reasons: 1) lwStall=1: we're stalling for a load-use hazard — the instruction in Execute must be a NOP so it doesn't execute. 2) PCSrcE=1: branch taken — the instruction that entered Execute (which was after the branch) is wrong and must be discarded.

▼ Show Answer

Q: A pipeline diagram shows: Cycle 4: add is in DM, sub is in EX. Does forwarding occur? From which stage?

A) No forwarding B) Forward from MEM (ForwardAE=10) C) Forward from WB (ForwardAE=01) D) Stall required

✅ B) ForwardAE=10 (Memory stage). In cycle 4: add is in Memory (DM), sub is in Execute. If sub reads s8 and add writes s8, then Rs1E(or Rs2E) == RdM and RegWriteM=1. This is Case 1 (Memory forward). ForwardAE=10 routes ALUResultM to SrcA of the ALU. No stall needed — forwarding handles it in the same cycle.

▼ Show Answer

Q: Why does the pipelined processor's clock period correspond to the Execute stage and not the Memory stage (even though memory access is slow)?

A) Memory is cached B) Execute has forwarding muxes that add delay C) Memory only does stores D) Control signals add delay to Execute

✅ B) Execute has forwarding muxes that add delay. The forwarding mux chain (ForwardAE mux → SrcB mux → ForwardBE mux) plus the ALU and AND-OR gate for PCSrc = 40+4(30)+120+20+50 = 350ps. The Memory stage just does t_pcq + t_mem + t_setup = 40+200+50 = 290ps. So Execute is actually the bottleneck, not Memory!

▼ Show Answer

Q: From the Q2 abstract pipeline diagram: what code corresponds to the 4th instruction (labeled "and") which is fetched in cycle 4 with source registers s8 and t2?

A) and s7, s8, t2 B) and s7, t2, s8 C) and s7, s8, t2 (same) D) Need to see dest register

✅ From Q2's diagram: 4th instruction "and" with registers s8 and t2 visible. The destination register s7 is shown in the DM/WB portion. So: and s7, s8, t2. In RISC-V: "and rd, rs1, rs2" = and s7, s8, t2. Answer: A or C — same thing. Key: always read the diagram as rd, rs1, rs2 order for the instruction.

▼ Show Answer

🔑 Last-Minute Cheat Sheet

ForwardAE=10: Rs1E==RdM, RegWriteM=1 (forward from MEM)
ForwardAE=01: Rs1E==RdW, RegWriteW=1 (forward from WB)
lwStall: (Rs1D==RdE OR Rs2D==RdE) AND ResultSrcE[0] → StallF=StallD=1, FlushE=1
Branch penalty: 2 cycles (flush Decode + Execute stages)
FlushD = PCSrcE | FlushE = lwStall OR PCSrcE
PCSrcE = (BranchE AND ZeroE) OR JumpE
T_c_pipe = 40 + 4(30) + 120 + 20 + 50 = 350 ps (Execute stage bottleneck)
CPI_avg = 1.23 with textbook SPECINT2000 assumptions
Execution Time = 43 s for 100B instructions at 350 ps, CPI=1.23