Vector Assembly RISC - V Processors - A Detailed

1 Motivation

Why Do We Need Vector Processing?

Modern workloads operate on enormous datasets. A traditional scalar processor grinds through them one element at a time — vector processing attacks them in bulk.

🎮 Multimedia & Graphics

Rendering every pixel of a 4K frame, applying colour transformations — each pixel is independent data. Process 8 at once instead of 8 iterations.

🤖 Machine Learning & AI

Neural network layers multiply huge matrices. Without vector units, a transformer inference would take seconds instead of milliseconds.

🔬 Scientific Computing

Simulating fluid dynamics, weather models, molecular dynamics — all involve applying the same arithmetic over millions of data points.

📡 Signal Processing

FFTs, audio filters, radar processing — huge arrays of samples must be transformed in real time.

❌ Scalar Approach — z[i] = x[i] + y[i]

10 elements → 10 loop iterations, 10 loads from x, 10 loads from y, 10 adds, 10 stores = 40+ memory/arithmetic operations.

✅ Vector Approach — vfadd.vv

10 elements → 1 instruction (or 2 iterations if hardware fits 8 at once). Same result, fraction of the overhead.

2 Foundations

Data-Level Parallelism & SIMD Landscape

Parallelism comes in three flavours. SIMD / Vector is the Data-Level kind — same operation, many data items, same cycle.

Parallelism Type	Example Mechanism	Managed By	Granularity
Instruction-Level (ILP)	Pipelining, Out-of-Order	Hardware	Single instructions
Thread-Level (TLP)	Multicore, Hyperthreading	OS + Hardware	Threads / processes
Data-Level (DLP)	SIMD / Vector Units	CPU ISA	Many data elements

SIMD Extensions Across Architectures

🖥️ Intel / AMD (x86)

MMX — 64-bit, early multimedia
SSE — 128-bit, streaming SIMD
AVX / AVX2 — 256-bit vectors
AVX-512 — 512-bit on modern CPUs

📱 ARM (Mobile / Server)

NEON — 128-bit, Cortex-A media
SVE / SVE2 — Scalable lengths, Neoverse HPC

🏛️ Historical

Cray-1 / Cray-2 — 1970s–80s pioneering vector supercomputers with massive pipeline registers
NEC SX Series — High-perf vector supers

🔓 Open / Specialized

RISC-V Vector (RVV) — Open-source, flexible, scalable
AltiVec — PowerPC SIMD, Apple G4/G5, IBM Power
Cell BE — PlayStation 3 processor

3 Problem Space

Limitations of Traditional SIMD

Classic SIMD like x86 AVX works well but has hard-coded constraints that hurt portability and programmer experience.

🔒 Fixed Vector Width

AVX-512 always uses 512 bits. If your data is 300 elements, you handle 256 elements with AVX, then process the rest (the "tail") separately. The hardware cannot adapt.

📝 Loop Tail Handling

When N is not a multiple of the vector width, you need extra scalar code (or more complex SIMD code) just for the last few elements. Easy to get wrong, always annoying.

🔄 Recompile for New Widths

A binary built for AVX (256-bit) cannot simply run on AVX-512. You need separate builds — or runtime detection — for each target. Maintenance nightmare.

🚫 Portability Problems

AVX-512 intrinsics run only on CPUs that support AVX-512. Code becomes CPU-specific and breaks on older or different hardware.

💡 This Is Exactly What RVV Fixes

RISC-V Vector Extension (RVV) uses a variable-length model — the same binary runs correctly on hardware with 128-bit registers all the way up to 512-bit or beyond. The hardware tells software how many elements it can process, and the software adapts automatically at runtime.

RVV vs Traditional SIMD — Side by Side

Feature	Traditional SIMD (AVX etc.)	RISC-V Vector (RVV)
Vector width	Fixed at compile time	Variable — hardware decides
Portability	Limited (arch-specific)	High — one binary, all VLEN
Tail handling	Manual extra code	Automatic via vsetvli
Scalability	Limited	Excellent
ISA complexity	Many separate extensions	Single unified extension

4 Architecture

RVV Programming Model

RVV adds 32 vector registers (v0–v31) to the standard RISC-V register file, plus a set of control registers and a flexible configuration system.

Vector Register File Structure (VLEN = 256 bits, SEW = 32, so 8 elements per register)

v0 (one vector register):

With VL=5 (tail elements 5,6,7 are inactive):

tail

v0 – v31: Vector Registers

32 vector registers, each VLEN bits wide. They hold multiple data elements (elements = VLEN / SEW). Unlike scalar registers, they are wide enough to hold entire chunks of your array.

v0: The Mask Register

v0 is special — it doubles as the mask register. Each bit of v0 controls whether the corresponding element lane is active (1) or inactive (0). Used for predicated execution.

5 Terminology

The Five Key RVV Terms

These five concepts define how vector length is configured. Understand all five and how they relate — they are the most exam-tested area.

`VLEN` — Vector Length (bits)

The physical width of each vector register in bits. This is fixed by the chip designer when the silicon is made — it could be 128, 256, 512 bits. Software cannot change it, but software can query it.

`SEW` — Selected Element Width

How many bits each individual element occupies: 8, 16, 32, or 64 bits. You choose this per operation. e.g., e32 means each element is a 32-bit integer.

`LMUL` — Length Multiplier

Groups multiple physical registers to form one logical "super-register". Valid values: 1, 2, 4, 8 (group more registers → process more elements) or 1/2, 1/4 (use fraction of one register → more vector registers available but fewer elements). Default is LMUL=1.

`AVL` — Application Vector Length

How many elements your program wants to process. This is what you pass to vsetvli as the requested count. The hardware will give you as many as it can (up to VLMAX).

`VL` — Vector Length (elements)

The number of elements the hardware actually processes in the current iteration. Set automatically by vsetvli as min(AVL, VLMAX). Also stored in a special CSR register.

VLMAX = LMUL × VLEN / SEW

Maximum elements one hardware iteration can process

VL = min(AVL, VLMAX)

Actual elements processed this iteration — returned by vsetvli

📐 Worked Example

Hardware: VLEN = 256 bits. Program uses: SEW = 32 bits, LMUL = 1.
→ VLMAX = 1 × 256 / 32 = 8 elements per iteration.
Application wants: AVL = 5 elements.
→ VL = min(5, 8) = 5. Hardware processes 5 elements; elements 5–7 are tail (inactive).

LMUL Deep Dive

LMUL = 1 (default)

v0 is one logical register. With VLEN=256, SEW=32 → 8 elements. You have 32 logical vector registers.

LMUL = 2

Two physical registers (e.g. v0+v1) act as one. With VLEN=256, SEW=32 → 16 elements per operation. But now only 16 logical registers available.

LMUL = 1/2

Use half a register. Fewer elements per op (e.g. 4 instead of 8), but 64 "logical" narrower registers. Useful for mixing element widths.

6 Core Instruction

The vsetvli Instruction

This is the most important RVV instruction. It configures the hardware for the upcoming vector operation and handles tail elements automatically.

Three Variants

Variant	Syntax	When to Use
vsetvli	`vsetvli rd, rs1, vtypei`	AVL in a register, vtype as immediate (most common)
vsetivli	`vsetivli rd, uimm, vtypei`	AVL as a small unsigned immediate (uimm ≤ 31)
vsetvl	`vsetvl rd, rs1, rs2`	Both AVL and vtype come from registers (fully dynamic)

Anatomy of a vsetvli Instruction

vsetvli t0 , a0 , e32 , m1 , ta , ma

vsetvliSet Vector Length (imm vtype)

t0 (rd)Receives actual VL set by HW

a0 (rs1)AVL — elements you want

e32SEW = 32 bits per element

m1LMUL = 1 (single register group)

taTail Agnostic policy

maMask Agnostic policy

⚠️ Critical: rd vs rs1 vs VL

rs1 = what you ask for (AVL). rd = what you get (VL). The hardware sets rd to min(AVL, VLMAX). Always use rd (t0 in examples) to know the actual batch size processed — don't assume you got all of AVL!

SEW Encoding Reference

Encoding	SEW	Element Type
e8	8 bits	Byte / uint8 / int8
e16	16 bits	Half / int16
e32	32 bits	Word / int32 / float32
e64	64 bits	Doubleword / int64 / float64

LMUL Encoding Reference

Encoding	LMUL	Effect
m1	1	1 register per group (default)
m2	2	2 registers per group → 2× elements
m4	4	4 registers per group → 4× elements
m8	8	8 registers per group → 8× elements
mf2	1/2	Half register → 1/2 elements
mf4	1/4	Quarter register → 1/4 elements

7 Policies

Tail & Mask Policies

Two orthogonal policies control what happens to vector elements that aren't actively computed: tail elements (beyond VL) and masked-off elements (where mask bit = 0).

Tail Elements — Beyond VL

Say VL = 5 and VLMAX = 8. Elements 0–4 are active. Elements 5–7 are the tail. The tail policy says what happens to those three slots in the destination register.

With VL=5, after operation (active = indices 0–4, tail = 5–7):

Tail Undisturbed (tu):

Tail keeps its old values — guaranteed.

Tail Agnostic (ta):

Hardware can write 1s, keep old values, or anything. Program must not rely on tail values.

✅ Best Practice

Use ta, ma (tail agnostic, mask agnostic) whenever you don't care about the tail/masked elements. This gives hardware freedom to optimise (e.g. skip zeroing). Use tu, mu (undisturbed) only when you explicitly need to preserve prior values — it forces the hardware to do extra work.

Masking — Selective Lane Execution

Masks let you process only some elements in a vector register, based on a condition. The mask is stored in v0, one bit per element.

Mask in v0 (1=active, 0=inactive): VL=8

Source v1: [a, b, c, d, e, f, g, h]

Result with Mask-Undisturbed (mu): [a, 20, c, 40, e, f, 70, h] — inactive positions keep original value

Result with Mask-Agnostic (ma): [a, ?, c, ?, e, f, ?, h] — inactive positions are "don't care"

Policy Summary Table

Policy Code	Name	Behaviour	Cost
ta	Tail Agnostic	Tail elements: hardware does whatever it wants	Faster
tu	Tail Undisturbed	Tail elements: preserved from destination register	Slower
ma	Mask Agnostic	Masked-off elements: hardware does whatever	Faster
mu	Mask Undisturbed	Masked-off elements: preserved from destination	Slower

8 Worked Example

Vector Addition — Complete Walkthrough

We implement z[i] = x[i] + y[i] for n=6 elements using RVV with VLMAX=4, watching the microarchitecture execute each step.

The C Function

// Add two int32 arrays element-wise: z[i] = x[i] + y[i]
void vvadd_int32(size_t n, const int* x, const int* y, int* z) {
    for (size_t i = 0; i < n; i++) {
        z[i] = x[i] + y[i];
    }
}

Register Mapping

Register	Contents
a0	n — total number of elements
a1	Pointer to array x (advances each iteration)
a2	Pointer to array y (advances each iteration)
a3	Pointer to array z (output, advances each iteration)
t0	VL — actual elements processed this iteration (set by vsetvli)
v0, v1	Vector regs for loaded chunks of x and y
v2	Vector result (x chunk + y chunk)

Input Data

RISC-V Assembly — .data

x:  .word  10, 20, 30, 40, 50, 60   # 6 × 32-bit integers, 24 bytes
y:  .word   5, 15, 25, 35, 45, 55   # 6 × 32-bit integers, 24 bytes
z:  .word   0,  0,  0,  0,  0,  0   # 6 × 32-bit output slots
n:  .word   6                        # element count

The Full Assembly Routine

RISC-V Assembly — vvadd_int32

vvadd_int32:
    vsetvli  t0, a0, e32, m1, ta, ma  # t0 = min(a0, VLMAX); configure hw for 32-bit
    vle32.v  v0, (a1)                    # Load t0 elements from x into v0
    sub      a0, a0, t0                  # a0 -= t0  (how many elements remain)
    slli     t0, t0, 2                   # t0 = t0 * 4  (convert elements → bytes)
    add      a1, a1, t0                  # Advance x pointer by byte offset
    vle32.v  v1, (a2)                    # Load t0/4 elements from y into v1
    add      a2, a2, t0                  # Advance y pointer
    vadd.vv  v2, v0, v1                  # Element-wise add: v2[i] = v0[i] + v1[i]
    vse32.v  v2, (a3)                    # Store result chunk to z
    add      a3, a3, t0                  # Advance z pointer
    bnez     a0, vvadd_int32            # Loop if elements remain
    ret                                  # Done!

⚠️ Why slli t0, t0, 2 ?

After vsetvli, t0 holds the number of elements processed. But memory pointers advance in bytes. Each int32 is 4 bytes = 2² bytes. Shifting left by 2 multiplies by 4: t0 × 4 = byte offset. This happens after decrementing a0 (which uses t0 as element count), so the order matters!

Iteration 1 — Step by Step

vsetvli t0, a0, e32, m1, ta, ma

a0 = 6 (AVL). VLMAX = 1 × 128 / 32 = 4 (assume VLEN=128). → t0 = min(6, 4) = 4. Hardware is now configured for 4 elements.

vle32.v v0, (a1)

Load 4 × 32-bit integers from memory at a1 (= &x[0]). → v0 = [10, 20, 30, 40]

sub a0, a0, t0 → a0 = 6 − 4 = 2

Decrement remaining count. 2 elements left after this batch.

slli t0, t0, 2 → t0 = 4 × 4 = 16 bytes

Convert element count to byte offset. 4 elements × 4 bytes each = 16 bytes.

add a1, a1, t0 — advance x pointer

a1 now points to x[4] (skipped 16 bytes = 4 elements).

vle32.v v1, (a2) → v1 = [5, 15, 25, 35]

Load 4 elements from y, then advance a2 by 16 bytes.

vadd.vv v2, v0, v1 — element-wise add

v2 = [10+5, 20+15, 30+25, 40+35] = [15, 35, 55, 75]

v0:

v1:

↓ vadd.vv

v2 (result):

vse32.v v2, (a3) — store results

Write [15, 35, 55, 75] to z[0..3]. Advance a3 by 16 bytes.

bnez a0, vvadd_int32

a0 = 2 ≠ 0 → loop back!

Iteration 2 — Handling the "Tail"

🎯 This is the Key RVV Trick

On re-entry, a0 = 2. vsetvli is called again with AVL=2. Since VLMAX=4, VL = min(2,4) = 2. Hardware automatically processes only 2 elements — no extra code, no bounds checks.

Iteration 2 — vsetvli sets t0 = 2 (only 2 remaining)

—

v0 loads only x[4], x[5]; tail positions are inactive

—

↓ vadd.vv (only 2 active lanes)

115

—

z[4..5] = [95, 115] — stored. a0 = 0 → loop exits.

Final Output

Index	x[i]	y[i]	z[i] = x[i]+y[i]	Computed in
0	10	5	15	Iteration 1
1	20	15	35	Iteration 1
2	30	25	55	Iteration 1
3	40	35	75	Iteration 1
4	50	45	95	Iteration 2
5	60	55	115	Iteration 2

✅ Performance Insight

Scalar: 6 loop iterations × (2 loads + 1 add + 1 store) = 24 operations.
RVV with VLMAX=4: 2 iterations × ~8 instructions = ≈16 instructions, but the heavy lifting (loads/add/store) operates on 4 elements simultaneously. With VLMAX=8, this entire computation fits in 1 iteration!

9 Memory Modes

Vector Memory Addressing Modes

RVV supports five addressing modes for loads and stores. Each has a different memory access pattern, performance profile, and use case.

Mode 1

Unit-Stride

Elements are packed sequentially in memory. The default and fastest mode.

Speed: Fastest ●●●●●

vle32.v v1, (a0) # load
vse32.v v1, (a0) # store

Use case: Array processing, most loops

Mode 2

Strided

Elements spaced by a fixed byte stride. Stride stored in a register. Stride=0 broadcasts one value to all lanes.

Speed: Medium ●●●○○

li t0, 16
vlse32.v v1, (a0), t0
vsse32.v v1, (a0), t0

Use case: Column of row-major matrix

Mode 3

Indexed (Gather/Scatter)

Each element loaded from an arbitrary offset stored in an index vector. Maximum flexibility, minimum cache friendliness.

Speed: Slowest ●○○○○

vluxei32.v v1,(a0),vidx # gather
vsuxei32.v v1,(a0),vidx # scatter

Use case: Sparse matrices, permutations, hash lookups

Special A

Whole-Register

Transfers exactly N × VLEN bits. Bypasses VL and masking entirely. Always loads/stores the full register width.

Speed: Fast ●●●●○

vl1re32.v v1, (a0) # load 1 reg
vl4re32.v v4, (a0) # load 4 regs
vs1r.v v1, (a0) # store

Use case: Context save/restore (OS task switch)

Special B

Fault-Only-First

Like unit-stride but silently reduces VL if a memory fault occurs mid-load. No exception raised — check returned VL to see how many elements arrived valid.

Speed: Fast (normal) ●●●●○

vle32ff.v v1, (a0)
# t0 (vl) silently reduced
# if fault occurs mid-load

Use case: Null-terminated strings, unknown-length data

Indexed: Ordered vs Unordered

vluxei32.v / vsuxei32.v — Unordered

Hardware may reorder memory accesses for better performance. Faster, but only correct when accesses have no side-effects and no ordering dependency.

vloxei32.v / vsoxei32.v — Ordered

Accesses occur in element order (0, 1, 2, …). Required when memory-mapped I/O has side-effects or when ordering matters for correctness. Potentially slower.

Masking on Memory Instructions

All load/store modes (except whole-register) support the optional mask register v0. Append .v0.t to the instruction to enable masking:

RISC-V Assembly

# Masked store — only write where v0[i] = 1
vse32.v  v1, (a0), v0.t
# Memory positions where mask=0 are untouched!

# Masked load — only load where v0[i] = 1
vle32.v  v1, (a0), v0.t
# Inactive lanes: follow mask policy (mu/ma)

Use Case	Technique	Benefit
Boundary condition	Masked store	Don't write past end of array
Conditional update	Masked store	Only store where predicate is true
Sparse writes	Masked scatter	Write to selected positions only
Conditional load	Masked load	Load only valid elements

Strided: Loading a Matrix Column

RISC-V Assembly — Load column 0 of 4×4 float matrix

# 4×4 matrix stored row-major: M[0][0] M[0][1] M[0][2] M[0][3] | M[1][0] ...
# Row width = 4 floats = 4 × 4 = 16 bytes
li        t0, 16             # stride = 16 bytes
vsetvli   t1, a3, e32, m1, ta, ma
vlse32.v  v1, (a0), t0
# v1 = [M[0][0], M[1][0], M[2][0], M[3][0]] — whole column in one instruction!

10 Exam Prep

Midterm Exam Preparation

Click any question to reveal the answer. These target exactly the kind of datapath/microarchitecture MCQ questions common in computer architecture exams.

Q: VLMAX = 1 × 256 / 32. What does this equal, and what does each term represent?

VLMAX = 8 elements.
• 1 = LMUL (register group size)
• 256 = VLEN (register width in bits)
• 32 = SEW (element size in bits)
So 8 elements of 32 bits fit in one 256-bit vector register.

▼ Show Answer

Q: Why is slli t0, t0, 2 executed AFTER sub a0, a0, t0?

Because sub a0, a0, t0 needs t0 to be in elements (to decrement the element count). After the sub, t0 is converted to bytes (×4) for pointer arithmetic. If the slli happened first, a0 would be decremented by bytes, not elements — catastrophically wrong.

▼ Show Answer

Q: What does vsetvli return in its destination register (rd)?

rd receives VL — the actual number of elements the hardware will process this iteration. It equals min(AVL, VLMAX). This is the amount you actually processed — use it to advance pointers and decrement your counter.

▼ Show Answer

Q: VLEN=256, SEW=32, LMUL=2, AVL=15. What is VLMAX? What is VL?

VLMAX = LMUL × VLEN / SEW = 2 × 256 / 32 = 16
VL = min(15, 16) = 15
All 15 elements are processed in one iteration.

▼ Show Answer

Q: What is the difference between "tail agnostic" and "tail undisturbed"?

Tail elements are those beyond VL in the destination register.
• Tail Undisturbed (tu): destination bits beyond VL are guaranteed to keep their old value.
• Tail Agnostic (ta): hardware can write 1s, keep old values, or anything — program must not rely on them. ta is faster (hardware has freedom).

▼ Show Answer

Q: Which vector register is reserved for masking in RVV?

v0 is the dedicated mask register. Each bit corresponds to one vector lane. Bit=1 means that element is active; bit=0 means inactive (masked off). Masked instructions use the suffix v0.t.

▼ Show Answer

Q: What is the key difference between vluxei32.v and vloxei32.v?

Both are indexed (gather) loads using offsets from a vector index register.
• vlux (unordered): hardware may reorder accesses — faster.
• vlox (ordered): accesses happen in element order — needed when memory side-effects or ordering correctness matters (e.g. memory-mapped I/O).

▼ Show Answer

Q: For n=6, VLMAX=4: how many loop iterations does the RVV routine execute?

2 iterations.
• Iteration 1: VL = min(6,4) = 4. Process elements 0–3. a0 = 6−4 = 2.
• Iteration 2: VL = min(2,4) = 2. Process elements 4–5. a0 = 2−2 = 0. Loop exits.

▼ Show Answer

Q: What does LMUL=8 mean, and what is the trade-off?

LMUL=8 groups 8 physical registers into one logical register, allowing 8× more elements per operation (e.g. 64 instead of 8 at VLEN=256, SEW=32).
Trade-off: You only have 32/8 = 4 logical vector registers instead of 32. Fewer registers means more register pressure and potential spilling.

▼ Show Answer

Q: When would you use Fault-Only-First load (vle32ff.v)?

When processing data of unknown length, like a null-terminated string. You load a chunk — if a page fault occurs mid-vector (string ended on a page boundary), the HW silently reduces VL to only the successfully loaded elements instead of raising an exception. You then read VL from the CSR to know how many elements are valid.

▼ Show Answer

Q: What is the stride value to load column j of a row-major matrix where each row has N float32 elements?

stride = N × 4 bytes.
To jump from M[r][j] to M[r+1][j], you skip one entire row = N elements × 4 bytes/element. Example: 8-column matrix → stride = 8 × 4 = 32 bytes. You'd use li t0, 32; vlse32.v v1, (a0), t0.

▼ Show Answer

Q: In the vvadd loop, why is vsetvli called at the TOP of the loop every iteration (not once before)?

Because AVL (a0) changes each iteration. On the last iteration, there may be fewer elements remaining than VLMAX. vsetvli must be called again so VL is correctly reduced to match the remaining elements — avoiding out-of-bounds memory access without any extra conditional code.

▼ Show Answer

🎯 Hot Topics for Microarchitecture MCQs

Based on your course focus, pay special attention to: VLMAX calculation (LMUL × VLEN / SEW), VL = min(AVL, VLMAX), why slli is needed and when it executes relative to sub, what rd holds after vsetvli, tail vs mask policies, and the difference between unit-stride, strided, and indexed modes. These are the datapath-level details most likely to appear as MCQs.

Quick Reference — Common Instructions

Instruction	Operation	Notes
vsetvli rd, rs1, eN, mM, tp, mp	Configure VL and vtype	Returns actual VL in rd
vle32.v vd, (rs1)	Unit-stride load 32-bit	Loads VL elements
vse32.v vs3, (rs1)	Unit-stride store 32-bit	Stores VL elements
vlse32.v vd, (rs1), rs2	Strided load	rs2 = byte stride
vsse32.v vs3, (rs1), rs2	Strided store	rs2 = byte stride
vluxei32.v vd, (rs1), vs2	Indexed unordered load	vs2 holds byte offsets
vsuxei32.v vs3, (rs1), vs2	Indexed unordered store	vs2 holds byte offsets
vle32ff.v vd, (rs1)	Fault-only-first load	VL reduced on fault
vl1re32.v vd, (rs1)	Whole-register load (1 reg)	Bypasses VL/mask
vadd.vv vd, vs2, vs1	Vector + Vector	vd[i] = vs2[i] + vs1[i]
vadd.vx vd, vs2, rs1	Vector + Scalar	vd[i] = vs2[i] + rs1
vfadd.vv vd, vs2, vs1	FP vector + vector	Floating-point add

Why Do We Need Vector Processing?

🎮 Multimedia & Graphics

🤖 Machine Learning & AI

🔬 Scientific Computing

📡 Signal Processing

❌ Scalar Approach — z[i] = x[i] + y[i]

✅ Vector Approach — vfadd.vv

Data-Level Parallelism & SIMD Landscape

SIMD Extensions Across Architectures

🖥️ Intel / AMD (x86)

📱 ARM (Mobile / Server)

🏛️ Historical

🔓 Open / Specialized

Limitations of Traditional SIMD

🔒 Fixed Vector Width

📝 Loop Tail Handling

🔄 Recompile for New Widths

🚫 Portability Problems

RVV vs Traditional SIMD — Side by Side

RVV Programming Model

v0 – v31: Vector Registers

v0: The Mask Register

The Five Key RVV Terms

VLEN — Vector Length (bits)

SEW — Selected Element Width

LMUL — Length Multiplier

AVL — Application Vector Length

VL — Vector Length (elements)

LMUL Deep Dive

LMUL = 1 (default)

LMUL = 2

LMUL = 1/2

The vsetvli Instruction

Three Variants

Anatomy of a vsetvli Instruction

SEW Encoding Reference

LMUL Encoding Reference

Tail & Mask Policies

Tail Elements — Beyond VL

Masking — Selective Lane Execution

Policy Summary Table

Vector Addition — Complete Walkthrough

The C Function

Register Mapping

Input Data

The Full Assembly Routine

Iteration 1 — Step by Step

vsetvli t0, a0, e32, m1, ta, ma

vle32.v v0, (a1)

sub a0, a0, t0 → a0 = 6 − 4 = 2

slli t0, t0, 2 → t0 = 4 × 4 = 16 bytes

add a1, a1, t0 — advance x pointer

vle32.v v1, (a2) → v1 = [5, 15, 25, 35]

vadd.vv v2, v0, v1 — element-wise add

vse32.v v2, (a3) — store results

bnez a0, vvadd_int32

Iteration 2 — Handling the "Tail"

Final Output

Vector Memory Addressing Modes

Unit-Stride

Strided

Indexed (Gather/Scatter)

Whole-Register

Fault-Only-First

Indexed: Ordered vs Unordered

vluxei32.v / vsuxei32.v — Unordered

vloxei32.v / vsoxei32.v — Ordered

Masking on Memory Instructions

Strided: Loading a Matrix Column

Midterm Exam Preparation

Quick Reference — Common Instructions

`VLEN` — Vector Length (bits)

`SEW` — Selected Element Width

`LMUL` — Length Multiplier

`AVL` — Application Vector Length

`VL` — Vector Length (elements)