RISC-V Vector Assembly
Complete Study Notes
Everything you need for your midterm ā from SIMD fundamentals to RVV programming, memory modes, masking, and datapath behaviour.
Why Do We Need Vector Processing?
Modern workloads operate on enormous datasets. A traditional scalar processor grinds through them one element at a time ā vector processing attacks them in bulk.
š® Multimedia & Graphics
Rendering every pixel of a 4K frame, applying colour transformations ā each pixel is independent data. Process 8 at once instead of 8 iterations.
š¤ Machine Learning & AI
Neural network layers multiply huge matrices. Without vector units, a transformer inference would take seconds instead of milliseconds.
š¬ Scientific Computing
Simulating fluid dynamics, weather models, molecular dynamics ā all involve applying the same arithmetic over millions of data points.
š” Signal Processing
FFTs, audio filters, radar processing ā huge arrays of samples must be transformed in real time.
ā Scalar Approach ā z[i] = x[i] + y[i]
10 elements ā 10 loop iterations, 10 loads from x, 10 loads from y, 10 adds, 10 stores = 40+ memory/arithmetic operations.
ā Vector Approach ā vfadd.vv
10 elements ā 1 instruction (or 2 iterations if hardware fits 8 at once). Same result, fraction of the overhead.
Data-Level Parallelism & SIMD Landscape
Parallelism comes in three flavours. SIMD / Vector is the Data-Level kind ā same operation, many data items, same cycle.
| Parallelism Type | Example Mechanism | Managed By | Granularity |
|---|---|---|---|
| Instruction-Level (ILP) | Pipelining, Out-of-Order | Hardware | Single instructions |
| Thread-Level (TLP) | Multicore, Hyperthreading | OS + Hardware | Threads / processes |
| Data-Level (DLP) | SIMD / Vector Units | CPU ISA | Many data elements |
SIMD Extensions Across Architectures
š„ļø Intel / AMD (x86)
- MMX ā 64-bit, early multimedia
- SSE ā 128-bit, streaming SIMD
- AVX / AVX2 ā 256-bit vectors
- AVX-512 ā 512-bit on modern CPUs
š± ARM (Mobile / Server)
- NEON ā 128-bit, Cortex-A media
- SVE / SVE2 ā Scalable lengths, Neoverse HPC
šļø Historical
- Cray-1 / Cray-2 ā 1970sā80s pioneering vector supercomputers with massive pipeline registers
- NEC SX Series ā High-perf vector supers
š Open / Specialized
- RISC-V Vector (RVV) ā Open-source, flexible, scalable
- AltiVec ā PowerPC SIMD, Apple G4/G5, IBM Power
- Cell BE ā PlayStation 3 processor
Limitations of Traditional SIMD
Classic SIMD like x86 AVX works well but has hard-coded constraints that hurt portability and programmer experience.
š Fixed Vector Width
AVX-512 always uses 512 bits. If your data is 300 elements, you handle 256 elements with AVX, then process the rest (the "tail") separately. The hardware cannot adapt.
š Loop Tail Handling
When N is not a multiple of the vector width, you need extra scalar code (or more complex SIMD code) just for the last few elements. Easy to get wrong, always annoying.
š Recompile for New Widths
A binary built for AVX (256-bit) cannot simply run on AVX-512. You need separate builds ā or runtime detection ā for each target. Maintenance nightmare.
š« Portability Problems
AVX-512 intrinsics run only on CPUs that support AVX-512. Code becomes CPU-specific and breaks on older or different hardware.
RISC-V Vector Extension (RVV) uses a variable-length model ā the same binary runs correctly on hardware with 128-bit registers all the way up to 512-bit or beyond. The hardware tells software how many elements it can process, and the software adapts automatically at runtime.
RVV vs Traditional SIMD ā Side by Side
| Feature | Traditional SIMD (AVX etc.) | RISC-V Vector (RVV) |
|---|---|---|
| Vector width | Fixed at compile time | Variable ā hardware decides |
| Portability | Limited (arch-specific) | High ā one binary, all VLEN |
| Tail handling | Manual extra code | Automatic via vsetvli |
| Scalability | Limited | Excellent |
| ISA complexity | Many separate extensions | Single unified extension |
RVV Programming Model
RVV adds 32 vector registers (v0āv31) to the standard RISC-V register file, plus a set of control registers and a flexible configuration system.
Vector Register File Structure (VLEN = 256 bits, SEW = 32, so 8 elements per register)
v0 ā v31: Vector Registers
32 vector registers, each VLEN bits wide. They hold multiple data elements (elements = VLEN / SEW). Unlike scalar registers, they are wide enough to hold entire chunks of your array.
v0: The Mask Register
v0 is special ā it doubles as the mask register. Each bit of v0 controls whether the corresponding element lane is active (1) or inactive (0). Used for predicated execution.
The Five Key RVV Terms
These five concepts define how vector length is configured. Understand all five and how they relate ā they are the most exam-tested area.
VLEN ā Vector Length (bits)
The physical width of each vector register in bits. This is fixed by the chip designer when the silicon is made ā it could be 128, 256, 512 bits. Software cannot change it, but software can query it.
SEW ā Selected Element Width
How many bits each individual element occupies: 8, 16, 32, or 64 bits. You choose this per operation. e.g., e32 means each element is a 32-bit integer.
LMUL ā Length Multiplier
Groups multiple physical registers to form one logical "super-register". Valid values: 1, 2, 4, 8 (group more registers ā process more elements) or 1/2, 1/4 (use fraction of one register ā more vector registers available but fewer elements). Default is LMUL=1.
AVL ā Application Vector Length
How many elements your program wants to process. This is what you pass to vsetvli as the requested count. The hardware will give you as many as it can (up to VLMAX).
VL ā Vector Length (elements)
The number of elements the hardware actually processes in the current iteration. Set automatically by vsetvli as min(AVL, VLMAX). Also stored in a special CSR register.
Hardware: VLEN = 256 bits. Program uses: SEW = 32 bits, LMUL = 1.
ā VLMAX = 1 Ć 256 / 32 = 8 elements per iteration.
Application wants: AVL = 5 elements.
ā VL = min(5, 8) = 5. Hardware processes 5 elements; elements 5ā7 are tail (inactive).
LMUL Deep Dive
LMUL = 1 (default)
v0 is one logical register. With VLEN=256, SEW=32 ā 8 elements. You have 32 logical vector registers.
LMUL = 2
Two physical registers (e.g. v0+v1) act as one. With VLEN=256, SEW=32 ā 16 elements per operation. But now only 16 logical registers available.
LMUL = 1/2
Use half a register. Fewer elements per op (e.g. 4 instead of 8), but 64 "logical" narrower registers. Useful for mixing element widths.
The vsetvli Instruction
This is the most important RVV instruction. It configures the hardware for the upcoming vector operation and handles tail elements automatically.
Three Variants
| Variant | Syntax | When to Use |
|---|---|---|
| vsetvli | vsetvli rd, rs1, vtypei | AVL in a register, vtype as immediate (most common) |
| vsetivli | vsetivli rd, uimm, vtypei | AVL as a small unsigned immediate (uimm ⤠31) |
| vsetvl | vsetvl rd, rs1, rs2 | Both AVL and vtype come from registers (fully dynamic) |
Anatomy of a vsetvli Instruction
rs1 = what you ask for (AVL). rd = what you get (VL). The hardware sets rd to min(AVL, VLMAX). Always use rd (t0 in examples) to know the actual batch size processed ā don't assume you got all of AVL!
SEW Encoding Reference
| Encoding | SEW | Element Type |
|---|---|---|
| e8 | 8 bits | Byte / uint8 / int8 |
| e16 | 16 bits | Half / int16 |
| e32 | 32 bits | Word / int32 / float32 |
| e64 | 64 bits | Doubleword / int64 / float64 |
LMUL Encoding Reference
| Encoding | LMUL | Effect |
|---|---|---|
| m1 | 1 | 1 register per group (default) |
| m2 | 2 | 2 registers per group ā 2Ć elements |
| m4 | 4 | 4 registers per group ā 4Ć elements |
| m8 | 8 | 8 registers per group ā 8Ć elements |
| mf2 | 1/2 | Half register ā 1/2 elements |
| mf4 | 1/4 | Quarter register ā 1/4 elements |
Tail & Mask Policies
Two orthogonal policies control what happens to vector elements that aren't actively computed: tail elements (beyond VL) and masked-off elements (where mask bit = 0).
Tail Elements ā Beyond VL
Say VL = 5 and VLMAX = 8. Elements 0ā4 are active. Elements 5ā7 are the tail. The tail policy says what happens to those three slots in the destination register.
Use ta, ma (tail agnostic, mask agnostic) whenever you don't care about the tail/masked elements. This gives hardware freedom to optimise (e.g. skip zeroing). Use tu, mu (undisturbed) only when you explicitly need to preserve prior values ā it forces the hardware to do extra work.
Masking ā Selective Lane Execution
Masks let you process only some elements in a vector register, based on a condition. The mask is stored in v0, one bit per element.
Policy Summary Table
| Policy Code | Name | Behaviour | Cost |
|---|---|---|---|
| ta | Tail Agnostic | Tail elements: hardware does whatever it wants | Faster |
| tu | Tail Undisturbed | Tail elements: preserved from destination register | Slower |
| ma | Mask Agnostic | Masked-off elements: hardware does whatever | Faster |
| mu | Mask Undisturbed | Masked-off elements: preserved from destination | Slower |
Vector Addition ā Complete Walkthrough
We implement z[i] = x[i] + y[i] for n=6 elements using RVV with VLMAX=4, watching the microarchitecture execute each step.
The C Function
// Add two int32 arrays element-wise: z[i] = x[i] + y[i] void vvadd_int32(size_t n, const int* x, const int* y, int* z) { for (size_t i = 0; i < n; i++) { z[i] = x[i] + y[i]; } }
Register Mapping
| Register | Contents |
|---|---|
| a0 | n ā total number of elements |
| a1 | Pointer to array x (advances each iteration) |
| a2 | Pointer to array y (advances each iteration) |
| a3 | Pointer to array z (output, advances each iteration) |
| t0 | VL ā actual elements processed this iteration (set by vsetvli) |
| v0, v1 | Vector regs for loaded chunks of x and y |
| v2 | Vector result (x chunk + y chunk) |
Input Data
x: .word 10, 20, 30, 40, 50, 60 # 6 Ć 32-bit integers, 24 bytes y: .word 5, 15, 25, 35, 45, 55 # 6 Ć 32-bit integers, 24 bytes z: .word 0, 0, 0, 0, 0, 0 # 6 Ć 32-bit output slots n: .word 6 # element count
The Full Assembly Routine
vvadd_int32: vsetvli t0, a0, e32, m1, ta, ma # t0 = min(a0, VLMAX); configure hw for 32-bit vle32.v v0, (a1) # Load t0 elements from x into v0 sub a0, a0, t0 # a0 -= t0 (how many elements remain) slli t0, t0, 2 # t0 = t0 * 4 (convert elements ā bytes) add a1, a1, t0 # Advance x pointer by byte offset vle32.v v1, (a2) # Load t0/4 elements from y into v1 add a2, a2, t0 # Advance y pointer vadd.vv v2, v0, v1 # Element-wise add: v2[i] = v0[i] + v1[i] vse32.v v2, (a3) # Store result chunk to z add a3, a3, t0 # Advance z pointer bnez a0, vvadd_int32 # Loop if elements remain ret # Done!
After vsetvli, t0 holds the number of elements processed. But memory pointers advance in bytes. Each int32 is 4 bytes = 2² bytes. Shifting left by 2 multiplies by 4: t0 à 4 = byte offset. This happens after decrementing a0 (which uses t0 as element count), so the order matters!
Iteration 1 ā Step by Step
vsetvli t0, a0, e32, m1, ta, ma
a0 = 6 (AVL). VLMAX = 1 Ć 128 / 32 = 4 (assume VLEN=128). ā t0 = min(6, 4) = 4. Hardware is now configured for 4 elements.
vle32.v v0, (a1)
Load 4 Ć 32-bit integers from memory at a1 (= &x[0]). ā v0 = [10, 20, 30, 40]
sub a0, a0, t0 ā a0 = 6 ā 4 = 2
Decrement remaining count. 2 elements left after this batch.
slli t0, t0, 2 ā t0 = 4 Ć 4 = 16 bytes
Convert element count to byte offset. 4 elements Ć 4 bytes each = 16 bytes.
add a1, a1, t0 ā advance x pointer
a1 now points to x[4] (skipped 16 bytes = 4 elements).
vle32.v v1, (a2) ā v1 = [5, 15, 25, 35]
Load 4 elements from y, then advance a2 by 16 bytes.
vadd.vv v2, v0, v1 ā element-wise add
v2 = [10+5, 20+15, 30+25, 40+35] = [15, 35, 55, 75]
vse32.v v2, (a3) ā store results
Write [15, 35, 55, 75] to z[0..3]. Advance a3 by 16 bytes.
bnez a0, vvadd_int32
a0 = 2 ā 0 ā loop back!
Iteration 2 ā Handling the "Tail"
On re-entry, a0 = 2. vsetvli is called again with AVL=2. Since VLMAX=4, VL = min(2,4) = 2. Hardware automatically processes only 2 elements ā no extra code, no bounds checks.
Final Output
| Index | x[i] | y[i] | z[i] = x[i]+y[i] | Computed in |
|---|---|---|---|---|
| 0 | 10 | 5 | 15 | Iteration 1 |
| 1 | 20 | 15 | 35 | Iteration 1 |
| 2 | 30 | 25 | 55 | Iteration 1 |
| 3 | 40 | 35 | 75 | Iteration 1 |
| 4 | 50 | 45 | 95 | Iteration 2 |
| 5 | 60 | 55 | 115 | Iteration 2 |
Scalar: 6 loop iterations Ć (2 loads + 1 add + 1 store) = 24 operations.
RVV with VLMAX=4: 2 iterations Ć ~8 instructions = ā16 instructions, but the heavy lifting (loads/add/store) operates on 4 elements simultaneously. With VLMAX=8, this entire computation fits in 1 iteration!
Vector Memory Addressing Modes
RVV supports five addressing modes for loads and stores. Each has a different memory access pattern, performance profile, and use case.
Unit-Stride
Elements are packed sequentially in memory. The default and fastest mode.
vse32.v v1, (a0) # store
Use case: Array processing, most loops
Strided
Elements spaced by a fixed byte stride. Stride stored in a register. Stride=0 broadcasts one value to all lanes.
vlse32.v v1, (a0), t0
vsse32.v v1, (a0), t0
Use case: Column of row-major matrix
Indexed (Gather/Scatter)
Each element loaded from an arbitrary offset stored in an index vector. Maximum flexibility, minimum cache friendliness.
vsuxei32.v v1,(a0),vidx # scatter
Use case: Sparse matrices, permutations, hash lookups
Whole-Register
Transfers exactly N Ć VLEN bits. Bypasses VL and masking entirely. Always loads/stores the full register width.
vl4re32.v v4, (a0) # load 4 regs
vs1r.v v1, (a0) # store
Use case: Context save/restore (OS task switch)
Fault-Only-First
Like unit-stride but silently reduces VL if a memory fault occurs mid-load. No exception raised ā check returned VL to see how many elements arrived valid.
# t0 (vl) silently reduced
# if fault occurs mid-load
Use case: Null-terminated strings, unknown-length data
Indexed: Ordered vs Unordered
vluxei32.v / vsuxei32.v ā Unordered
Hardware may reorder memory accesses for better performance. Faster, but only correct when accesses have no side-effects and no ordering dependency.
vloxei32.v / vsoxei32.v ā Ordered
Accesses occur in element order (0, 1, 2, ā¦). Required when memory-mapped I/O has side-effects or when ordering matters for correctness. Potentially slower.
Masking on Memory Instructions
All load/store modes (except whole-register) support the optional mask register v0. Append .v0.t to the instruction to enable masking:
# Masked store ā only write where v0[i] = 1 vse32.v v1, (a0), v0.t # Memory positions where mask=0 are untouched! # Masked load ā only load where v0[i] = 1 vle32.v v1, (a0), v0.t # Inactive lanes: follow mask policy (mu/ma)
| Use Case | Technique | Benefit |
|---|---|---|
| Boundary condition | Masked store | Don't write past end of array |
| Conditional update | Masked store | Only store where predicate is true |
| Sparse writes | Masked scatter | Write to selected positions only |
| Conditional load | Masked load | Load only valid elements |
Strided: Loading a Matrix Column
# 4Ć4 matrix stored row-major: M[0][0] M[0][1] M[0][2] M[0][3] | M[1][0] ... # Row width = 4 floats = 4 Ć 4 = 16 bytes li t0, 16 # stride = 16 bytes vsetvli t1, a3, e32, m1, ta, ma vlse32.v v1, (a0), t0 # v1 = [M[0][0], M[1][0], M[2][0], M[3][0]] ā whole column in one instruction!
Midterm Exam Preparation
Click any question to reveal the answer. These target exactly the kind of datapath/microarchitecture MCQ questions common in computer architecture exams.
⢠1 = LMUL (register group size)
⢠256 = VLEN (register width in bits)
⢠32 = SEW (element size in bits)
So 8 elements of 32 bits fit in one 256-bit vector register.
slli t0, t0, 2 executed AFTER sub a0, a0, t0?sub a0, a0, t0 needs t0 to be in elements (to decrement the element count). After the sub, t0 is converted to bytes (Ć4) for pointer arithmetic. If the slli happened first, a0 would be decremented by bytes, not elements ā catastrophically wrong.vsetvli return in its destination register (rd)?min(AVL, VLMAX). This is the amount you actually processed ā use it to advance pointers and decrement your counter.VL = min(15, 16) = 15
All 15 elements are processed in one iteration.
⢠Tail Undisturbed (tu): destination bits beyond VL are guaranteed to keep their old value.
⢠Tail Agnostic (ta): hardware can write 1s, keep old values, or anything ā program must not rely on them. ta is faster (hardware has freedom).
v0.t.vluxei32.v and vloxei32.v?⢠vlux (unordered): hardware may reorder accesses ā faster.
⢠vlox (ordered): accesses happen in element order ā needed when memory side-effects or ordering correctness matters (e.g. memory-mapped I/O).
⢠Iteration 1: VL = min(6,4) = 4. Process elements 0ā3. a0 = 6ā4 = 2.
⢠Iteration 2: VL = min(2,4) = 2. Process elements 4ā5. a0 = 2ā2 = 0. Loop exits.
Trade-off: You only have 32/8 = 4 logical vector registers instead of 32. Fewer registers means more register pressure and potential spilling.
To jump from M[r][j] to M[r+1][j], you skip one entire row = N elements Ć 4 bytes/element. Example: 8-column matrix ā stride = 8 Ć 4 = 32 bytes. You'd use
li t0, 32; vlse32.v v1, (a0), t0.Based on your course focus, pay special attention to: VLMAX calculation (LMUL Ć VLEN / SEW), VL = min(AVL, VLMAX), why slli is needed and when it executes relative to sub, what rd holds after vsetvli, tail vs mask policies, and the difference between unit-stride, strided, and indexed modes. These are the datapath-level details most likely to appear as MCQs.
Quick Reference ā Common Instructions
| Instruction | Operation | Notes |
|---|---|---|
| vsetvli rd, rs1, eN, mM, tp, mp | Configure VL and vtype | Returns actual VL in rd |
| vle32.v vd, (rs1) | Unit-stride load 32-bit | Loads VL elements |
| vse32.v vs3, (rs1) | Unit-stride store 32-bit | Stores VL elements |
| vlse32.v vd, (rs1), rs2 | Strided load | rs2 = byte stride |
| vsse32.v vs3, (rs1), rs2 | Strided store | rs2 = byte stride |
| vluxei32.v vd, (rs1), vs2 | Indexed unordered load | vs2 holds byte offsets |
| vsuxei32.v vs3, (rs1), vs2 | Indexed unordered store | vs2 holds byte offsets |
| vle32ff.v vd, (rs1) | Fault-only-first load | VL reduced on fault |
| vl1re32.v vd, (rs1) | Whole-register load (1 reg) | Bypasses VL/mask |
| vadd.vv vd, vs2, vs1 | Vector + Vector | vd[i] = vs2[i] + vs1[i] |
| vadd.vx vd, vs2, rs1 | Vector + Scalar | vd[i] = vs2[i] + rs1 |
| vfadd.vv vd, vs2, vs1 | FP vector + vector | Floating-point add |
