Ch 7: Single-Cycle
Processor
RISC-V Microarchitecture — Datapath, Control Signals, Performance Analysis & Exam Practice
→ Page 2: Pipelined ProcessorMicroarchitecture Overview
Microarchitecture = how to implement an ISA in hardware. Same ISA (RISC-V) can have multiple microarchitectures with different performance/cost trade-offs.
Single-Cycle
Each instruction executes in exactly one clock cycle. Simple but slow — clock period limited by the longest instruction (lw).
Multicycle (not in syllabus)
Instruction broken into shorter steps. Each step takes one cycle. Faster clock, but more cycles per instruction.
Pipelined
Multiple instructions execute simultaneously in different stages. Best throughput, but hazards must be handled. (Page 2)
Performance Equation — The Most Important Formula
Single-Cycle CPI = 1
Every instruction, regardless of complexity, uses exactly 1 clock cycle. But Tc is long.
IPC = 1/CPI
Instructions per Cycle. Single-cycle IPC = 1. Pipelining aims to keep IPC near 1 with a faster clock.
Critical Path → Tc
Clock period is set by the longest combinational path through the datapath. For single-cycle, that's the lw instruction.
RISC-V Instructions We Implement
| Type | Instructions | Key Characteristic |
|---|---|---|
| R-type | add, sub, and, or, slt | Both operands from register file |
| I-type (Memory) | lw | ALU computes address; reads memory |
| S-type (Memory) | sw | ALU computes address; writes memory |
| B-type (Branch) | beq | Compares registers; may change PC |
| I-type (ALU) | addi | One operand is sign-extended immediate |
| J-type | jal | Jump and link; saves PC+4 to rd |
Building the Datapath
The datapath is built incrementally, starting with lw (the most complex instruction). All other instructions reuse this hardware with different control signals.
Key Datapath Components
Program Counter (PC)
32-bit register storing address of current instruction. Updated every cycle: PC ← PCNext. A mux selects between PC+4 (sequential) and PCTarget (branch).
Instruction Memory
Read-only during execution. Addressed by PC. Outputs 32-bit instruction word. In single-cycle, separate from data memory.
Register File
32 × 32-bit registers. Two read ports (A1→RD1, A2→RD2) and one write port (A3, WD3, WE3). Reads combinationally; writes on clock edge.
Sign Extend / ImmExt
Takes the immediate field from the instruction and sign-extends to 32 bits. ImmSrc[1:0] selects which format (I, S, B, J-type).
ALU
Performs arithmetic/logic. Controlled by ALUControl[2:0]. Outputs: ALUResult (32-bit) and Zero flag (used for branches).
Data Memory
For lw: reads data at address given by ALUResult. For sw: writes RD2 to that address. WE (write enable) controlled by MemWrite.
Critical Muxes & Their Control Signals
// Control signal → MUX → Effect
lw Datapath — The Critical Path
lw rd, imm(rs1) — data flows through ALL 6 elements:
Fetch
rs1→RD1
imm sign ext
rs1+imm=addr
Read addr
rd←data
This is why lw sets the critical path. It uses EVERY component: PC, InstMem, RegFile read, ImmExt, ALU, DataMem, RegFile write.
In a single clock cycle, we must simultaneously fetch the instruction (from instruction memory) AND read/write data (from data memory). With one shared memory, you'd get a conflict — both the instruction fetch and lw's data read happen at the same time.
How Each Instruction Uses the Datapath
Same hardware, different control signals. Understanding this is key for the "which datapath is active?" and "what instruction is this?" exam questions.
lw rd, imm(rs1) — Load Word
Active signals: RegWrite=1, ALUSrc=1, MemWrite=0, ResultSrc=01, PCSrc=0, ALUControl=Add
PC→InstrMem→decode rs1,rd,imm → RegFile reads rs1 → ImmExt extends imm → ALU adds rs1+imm (address) → DataMem reads at address → result written to rd → PC=PC+4
sw rs2, imm(rs1) — Store Word
Active signals: RegWrite=0, ALUSrc=1, MemWrite=1, ResultSrc=XX, PCSrc=0, ALUControl=Add
RegFile reads rs1 AND rs2 → ImmExt extends imm (S-type, ImmSrc=01) → ALU adds rs1+imm (address) → DataMem writes RD2 (rs2's value) to address → No write-back to RegFile → PC=PC+4
add/sub/and/or rd, rs1, rs2 — R-type
Active signals: RegWrite=1, ALUSrc=0, MemWrite=0, ResultSrc=00, PCSrc=0
RegFile reads rs1 AND rs2 → ALU operates on RD1 and RD2 directly (ALUSrc=0, no immediate) → ALUResult written to rd → DataMem NOT accessed → PC=PC+4
beq rs1, rs2, offset — Branch Equal
Active signals: RegWrite=0, ALUSrc=0, MemWrite=0, ALUControl=Sub, Branch=1
RegFile reads rs1 and rs2 → ALU subtracts → Zero flag = 1 if rs1==rs2 → PCSrc = Branch AND Zero → If taken: PC=PCTarget=PC+ImmExt, else PC=PC+4. Two adders: one for PC+4, one for branch target.
addi rd, rs1, imm — Add Immediate
Active signals: RegWrite=1, ALUSrc=1, MemWrite=0, ResultSrc=00, PCSrc=0, ALUControl=Add
Same as R-type ADD but ALUSrc=1 (second operand is ImmExt, not a register). ImmSrc=00 (I-type format). Result written to rd.
jal rd, offset — Jump and Link
Active signals: RegWrite=1, Jump=1, ResultSrc=10, MemWrite=0
PCTarget = PC + ImmExt (J-type). PC is set to PCTarget unconditionally. rd ← PC+4 (return address). ResultSrc=10 selects PCPlus4 for the write-back value.
PCSrc = Branch AND Zero — not just Zero! If Branch=0 (not a branch instruction), PCSrc is always 0 even if Zero happens to be 1. This is an AND gate in hardware.
The Control Unit
The control unit takes the opcode (and funct3, funct7) and generates all the control signals. It has two parts: Main Decoder and ALU Decoder.
Main Decoder — Opcode → Control Signals
The main decoder looks at bits [6:0] of the instruction (the opcode field).
| op[6:0] | Instruction | RegWrite | ImmSrc | ALUSrc | MemWrite | ResultSrc | Branch | ALUOp | Jump |
|---|---|---|---|---|---|---|---|---|---|
| 0000011 | lw | 1 | 00 | 1 | 0 | 01 | 0 | 00 | 0 |
| 0100011 | sw | 0 | 01 | 1 | 1 | XX | 0 | 00 | 0 |
| 0110011 | R-type | 1 | XX | 0 | 0 | 00 | 0 | 10 | 0 |
| 1100011 | beq | 0 | 10 | 0 | 0 | XX | 1 | 01 | 0 |
| 0010011 | I-type ALU (addi) | 1 | 00 | 1 | 0 | 00 | 0 | 10 | 0 |
| 1101111 | jal | 1 | 11 | X | 0 | 10 | 0 | XX | 1 |
ALU Decoder — ALUOp + funct3 + funct7 → ALUControl
| ALUOp | Instruction | funct3 | funct7[5] | ALUControl[2:0] | Operation |
|---|---|---|---|---|---|
| 00 | lw / sw | — | — | 000 | ADD |
| 01 | beq | — | — | 001 | SUB (check Zero) |
| 10 | R-type / I-type | 000 | 0 | 000 | ADD (add, addi) |
| 10 | R-type | 000 | 1 | 001 | SUB (sub) |
| 10 | R-type / I-type | 111 | — | 010 | AND (and, andi) |
| 10 | R-type / I-type | 110 | — | 011 | OR (or, ori) |
| 10 | R-type / I-type | 010 | — | 101 | SLT (slt, slti) |
ALUOp is an intermediate signal from the main decoder to the ALU decoder. It avoids looking at funct3/funct7 when they don't matter (e.g. lw always adds, regardless of funct fields). Think of it as: main decoder says "what kind of ALU operation?" and ALU decoder says "specifically which one?"
ImmSrc Encoding
| ImmSrc[1:0] | Type | Immediate Bits | Used by |
|---|---|---|---|
| 00 | I-type | inst[31:20] | lw, addi, jalr |
| 01 | S-type | inst[31:25], inst[11:7] | sw |
| 10 | B-type | inst[31], [7], [30:25], [11:8] | beq |
| 11 | J-type | inst[31], [19:12], [20], [30:21] | jal |
Reading Control Signal Tables — Exam Skill
The exam gives you signals and asks what instruction it is, or gives you code and asks which signals are active. Master this lookup table.
1. Identify the instruction type from the mnemonic.
2. Look up its opcode in the main decoder table.
3. If R-type/I-type ALU, also look at funct3 and funct7[5] for ALUControl.
4. List all the active signals: RegWrite, ImmSrc, ALUSrc, MemWrite, ResultSrc, Branch/Jump, ALUControl.
1. Check RegWrite, MemWrite to narrow down type.
2. Check ALUSrc: 0=R-type, 1=I/S-type.
3. Check ResultSrc: 01=lw, 10=jal, 00=ALU result.
4. Check Branch and Jump for beq/jal.
5. Use ALUControl + ImmSrc to pin down the exact instruction.
Worked Example: Identify Instruction from Signals
Given: RegWrite=1, ALUSrc=1, MemWrite=0, ResultSrc=01, Branch=0, Jump=0
Step 1: RegWrite=1 → writes to register file → not sw or beq.
Step 2: ALUSrc=1 → second ALU input is immediate → not R-type.
Step 3: ResultSrc=01 → result is ReadData from memory → this is lw!
Step 4: MemWrite=0 → confirms read (not write). Answer: lw
Worked Example: Signals for "or s4, s5, s6"
or rd, rs1, rs2 — R-type with funct3=110
RegWrite=1 (writes rd), ImmSrc=XX (don't care, no immediate used), ALUSrc=0 (both operands from RF), MemWrite=0 (no memory), ResultSrc=00 (ALUResult to RF), Branch=0, Jump=0, ALUOp=10 → ALUControl=011 (OR, since funct3=110).
Worked Example: Signals for "beq s0, s1, target"
beq rs1, rs2, offset — B-type
RegWrite=0, ImmSrc=10 (B-type), ALUSrc=0 (both from RF), MemWrite=0, ResultSrc=XX, Branch=1, Jump=0, ALUControl=001 (SUB). Zero = 1 if s0==s1. PCSrc = Branch AND Zero = 1 AND Zero.
Single-Cycle Performance Analysis
The clock period must accommodate the slowest instruction. With single-cycle, CPI=1 always, but Tc is expensive.
Critical Path Formula
lw accesses memory TWICE: once to fetch the instruction (instruction memory) and once to read data (data memory). Both happen in the same clock cycle, and both are on the critical path for lw.
Standard Component Delays (from textbook)
| Element | Parameter | Delay (ps) |
|---|---|---|
| Register (PC) clock-to-Q | tpcq_PC | 40 ps |
| Register setup time | tsetup | 50 ps |
| Multiplexer | tmux | 30 ps |
| AND-OR gate | tAND-OR | 20 ps |
| ALU | tALU | 120 ps |
| Decoder (Control Unit) | tdec | 25 ps |
| Extend unit | text | 35 ps |
| Memory read | tmem | 200 ps |
| Register file read | tRFread | 100 ps |
| Register file setup | tRFsetup | 60 ps |
Worked Calculation
// Calculate single-cycle clock period
T_c = t_pcq + 2*t_mem + t_RFread + t_ALU + t_mux + t_RFsetup
= 40 + 2(200) + 100 + 120 + 30 + 60
= 40 + 400 + 100 + 120 + 30 + 60
= 750 ps
For 100 billion instructions:
Execution Time = N × CPI × T_c
= (100 × 10^9) × 1 × (750 × 10^-12 s)
= 75 seconds
Single-cycle has CPI=1 which is optimal. But Tc=750ps is the problem. Pipelined reduces Tc to ~350ps while keeping average CPI near 1, achieving ~1.7× speedup.
Performance Comparison (textbook example)
| Processor | Tc | CPI | Execution Time (100B instr) | Speedup vs Single-Cycle |
|---|---|---|---|---|
| Single-Cycle | 750 ps | 1.0 | 75 s | 1× (baseline) |
| Multicycle | ~300 ps | 4.12 | 155 s | 0.5× (SLOWER!) |
| Pipelined | 350 ps | 1.23 | 43 s | 1.7× faster |
Practice MCQs — Click to Reveal
Based on the pattern of exam questions you've seen. Click any card to show the answer.
The B-type immediate is sign-extended and already in byte addresses. The instruction encodes bits [12:1] of the offset (bit 0 is always 0 since instructions are 4-byte aligned). So imm[12:1] from bits = actual byte offset, which is then added to PC.
From machine code 0x00940A63: op=1100011 (beq), extract imm fields → imm = {inst[31], inst[7], inst[30:25], inst[11:8], 0} = decode carefully!
