RISC-V Vector Assembly — PainlessProgramming.com
PainlessProgramming.com

RISC-V Vector Assembly
Complete Study Notes

Everything you need for your midterm — from SIMD fundamentals to RVV programming, memory modes, masking, and datapath behaviour.

Computer Architecture Data-Level Parallelism RVV 1.0 Assembly Language Exam Ready āœ“

Why Do We Need Vector Processing?

Modern workloads operate on enormous datasets. A traditional scalar processor grinds through them one element at a time — vector processing attacks them in bulk.

šŸŽ® Multimedia & Graphics

Rendering every pixel of a 4K frame, applying colour transformations — each pixel is independent data. Process 8 at once instead of 8 iterations.

šŸ¤– Machine Learning & AI

Neural network layers multiply huge matrices. Without vector units, a transformer inference would take seconds instead of milliseconds.

šŸ”¬ Scientific Computing

Simulating fluid dynamics, weather models, molecular dynamics — all involve applying the same arithmetic over millions of data points.

šŸ“” Signal Processing

FFTs, audio filters, radar processing — huge arrays of samples must be transformed in real time.

āŒ Scalar Approach — z[i] = x[i] + y[i]

10 elements → 10 loop iterations, 10 loads from x, 10 loads from y, 10 adds, 10 stores = 40+ memory/arithmetic operations.

āœ… Vector Approach — vfadd.vv

10 elements → 1 instruction (or 2 iterations if hardware fits 8 at once). Same result, fraction of the overhead.


Data-Level Parallelism & SIMD Landscape

Parallelism comes in three flavours. SIMD / Vector is the Data-Level kind — same operation, many data items, same cycle.

Parallelism TypeExample MechanismManaged ByGranularity
Instruction-Level (ILP)Pipelining, Out-of-OrderHardwareSingle instructions
Thread-Level (TLP)Multicore, HyperthreadingOS + HardwareThreads / processes
Data-Level (DLP)SIMD / Vector UnitsCPU ISAMany data elements

SIMD Extensions Across Architectures

šŸ–„ļø Intel / AMD (x86)

  • MMX — 64-bit, early multimedia
  • SSE — 128-bit, streaming SIMD
  • AVX / AVX2 — 256-bit vectors
  • AVX-512 — 512-bit on modern CPUs

šŸ“± ARM (Mobile / Server)

  • NEON — 128-bit, Cortex-A media
  • SVE / SVE2 — Scalable lengths, Neoverse HPC

šŸ›ļø Historical

  • Cray-1 / Cray-2 — 1970s–80s pioneering vector supercomputers with massive pipeline registers
  • NEC SX Series — High-perf vector supers

šŸ”“ Open / Specialized

  • RISC-V Vector (RVV) — Open-source, flexible, scalable
  • AltiVec — PowerPC SIMD, Apple G4/G5, IBM Power
  • Cell BE — PlayStation 3 processor

Limitations of Traditional SIMD

Classic SIMD like x86 AVX works well but has hard-coded constraints that hurt portability and programmer experience.

šŸ”’ Fixed Vector Width

AVX-512 always uses 512 bits. If your data is 300 elements, you handle 256 elements with AVX, then process the rest (the "tail") separately. The hardware cannot adapt.

šŸ“ Loop Tail Handling

When N is not a multiple of the vector width, you need extra scalar code (or more complex SIMD code) just for the last few elements. Easy to get wrong, always annoying.

šŸ”„ Recompile for New Widths

A binary built for AVX (256-bit) cannot simply run on AVX-512. You need separate builds — or runtime detection — for each target. Maintenance nightmare.

🚫 Portability Problems

AVX-512 intrinsics run only on CPUs that support AVX-512. Code becomes CPU-specific and breaks on older or different hardware.

šŸ’” This Is Exactly What RVV Fixes

RISC-V Vector Extension (RVV) uses a variable-length model — the same binary runs correctly on hardware with 128-bit registers all the way up to 512-bit or beyond. The hardware tells software how many elements it can process, and the software adapts automatically at runtime.

RVV vs Traditional SIMD — Side by Side

FeatureTraditional SIMD (AVX etc.)RISC-V Vector (RVV)
Vector widthFixed at compile timeVariable — hardware decides
PortabilityLimited (arch-specific)High — one binary, all VLEN
Tail handlingManual extra codeAutomatic via vsetvli
ScalabilityLimitedExcellent
ISA complexityMany separate extensionsSingle unified extension

RVV Programming Model

RVV adds 32 vector registers (v0–v31) to the standard RISC-V register file, plus a set of control registers and a flexible configuration system.

Vector Register File Structure (VLEN = 256 bits, SEW = 32, so 8 elements per register)

v0 (one vector register):
e0
e1
e2
e3
e4
e5
e6
e7
With VL=5 (tail elements 5,6,7 are inactive):
e0
e1
e2
e3
e4
tail
tail
tail

v0 – v31: Vector Registers

32 vector registers, each VLEN bits wide. They hold multiple data elements (elements = VLEN / SEW). Unlike scalar registers, they are wide enough to hold entire chunks of your array.

v0: The Mask Register

v0 is special — it doubles as the mask register. Each bit of v0 controls whether the corresponding element lane is active (1) or inactive (0). Used for predicated execution.


The Five Key RVV Terms

These five concepts define how vector length is configured. Understand all five and how they relate — they are the most exam-tested area.

VLEN — Vector Length (bits)

The physical width of each vector register in bits. This is fixed by the chip designer when the silicon is made — it could be 128, 256, 512 bits. Software cannot change it, but software can query it.

SEW — Selected Element Width

How many bits each individual element occupies: 8, 16, 32, or 64 bits. You choose this per operation. e.g., e32 means each element is a 32-bit integer.

LMUL — Length Multiplier

Groups multiple physical registers to form one logical "super-register". Valid values: 1, 2, 4, 8 (group more registers → process more elements) or 1/2, 1/4 (use fraction of one register → more vector registers available but fewer elements). Default is LMUL=1.

AVL — Application Vector Length

How many elements your program wants to process. This is what you pass to vsetvli as the requested count. The hardware will give you as many as it can (up to VLMAX).

VL — Vector Length (elements)

The number of elements the hardware actually processes in the current iteration. Set automatically by vsetvli as min(AVL, VLMAX). Also stored in a special CSR register.

VLMAX = LMUL Ɨ VLEN / SEW
Maximum elements one hardware iteration can process
VL = min(AVL, VLMAX)
Actual elements processed this iteration — returned by vsetvli
šŸ“ Worked Example

Hardware: VLEN = 256 bits. Program uses: SEW = 32 bits, LMUL = 1.
→ VLMAX = 1 Ɨ 256 / 32 = 8 elements per iteration.
Application wants: AVL = 5 elements.
→ VL = min(5, 8) = 5. Hardware processes 5 elements; elements 5–7 are tail (inactive).

LMUL Deep Dive

LMUL = 1 (default)

v0 is one logical register. With VLEN=256, SEW=32 → 8 elements. You have 32 logical vector registers.

LMUL = 2

Two physical registers (e.g. v0+v1) act as one. With VLEN=256, SEW=32 → 16 elements per operation. But now only 16 logical registers available.

LMUL = 1/2

Use half a register. Fewer elements per op (e.g. 4 instead of 8), but 64 "logical" narrower registers. Useful for mixing element widths.


The vsetvli Instruction

This is the most important RVV instruction. It configures the hardware for the upcoming vector operation and handles tail elements automatically.

Three Variants

VariantSyntaxWhen to Use
vsetvlivsetvli rd, rs1, vtypeiAVL in a register, vtype as immediate (most common)
vsetivlivsetivli rd, uimm, vtypeiAVL as a small unsigned immediate (uimm ≤ 31)
vsetvlvsetvl rd, rs1, rs2Both AVL and vtype come from registers (fully dynamic)

Anatomy of a vsetvli Instruction

vsetvli   t0 , a0 , e32 , m1 , ta , ma
vsetvliSet Vector Length (imm vtype)
t0 (rd)Receives actual VL set by HW
a0 (rs1)AVL — elements you want
e32SEW = 32 bits per element
m1LMUL = 1 (single register group)
taTail Agnostic policy
maMask Agnostic policy
āš ļø Critical: rd vs rs1 vs VL

rs1 = what you ask for (AVL). rd = what you get (VL). The hardware sets rd to min(AVL, VLMAX). Always use rd (t0 in examples) to know the actual batch size processed — don't assume you got all of AVL!

SEW Encoding Reference

EncodingSEWElement Type
e88 bitsByte / uint8 / int8
e1616 bitsHalf / int16
e3232 bitsWord / int32 / float32
e6464 bitsDoubleword / int64 / float64

LMUL Encoding Reference

EncodingLMULEffect
m111 register per group (default)
m222 registers per group → 2Ɨ elements
m444 registers per group → 4Ɨ elements
m888 registers per group → 8Ɨ elements
mf21/2Half register → 1/2 elements
mf41/4Quarter register → 1/4 elements

Tail & Mask Policies

Two orthogonal policies control what happens to vector elements that aren't actively computed: tail elements (beyond VL) and masked-off elements (where mask bit = 0).

Tail Elements — Beyond VL

Say VL = 5 and VLMAX = 8. Elements 0–4 are active. Elements 5–7 are the tail. The tail policy says what happens to those three slots in the destination register.

Register before operation: [10, 20, 30, 40, 50, 60, 70, 80]
10
20
30
40
50
60
70
80
With VL=5, after operation (active = indices 0–4, tail = 5–7):
Tail Undisturbed (tu):
a
b
c
d
e
60
70
80
Tail keeps its old values — guaranteed.
Tail Agnostic (ta):
a
b
c
d
e
?
?
?
Hardware can write 1s, keep old values, or anything. Program must not rely on tail values.
āœ… Best Practice

Use ta, ma (tail agnostic, mask agnostic) whenever you don't care about the tail/masked elements. This gives hardware freedom to optimise (e.g. skip zeroing). Use tu, mu (undisturbed) only when you explicitly need to preserve prior values — it forces the hardware to do extra work.

Masking — Selective Lane Execution

Masks let you process only some elements in a vector register, based on a condition. The mask is stored in v0, one bit per element.

Mask in v0 (1=active, 0=inactive): VL=8
1
0
1
0
1
1
0
1
Source v1: [a, b, c, d, e, f, g, h]
Result with Mask-Undisturbed (mu): [a, 20, c, 40, e, f, 70, h] — inactive positions keep original value
Result with Mask-Agnostic (ma): [a, ?, c, ?, e, f, ?, h] — inactive positions are "don't care"

Policy Summary Table

Policy CodeNameBehaviourCost
taTail AgnosticTail elements: hardware does whatever it wantsFaster
tuTail UndisturbedTail elements: preserved from destination registerSlower
maMask AgnosticMasked-off elements: hardware does whateverFaster
muMask UndisturbedMasked-off elements: preserved from destinationSlower

Vector Addition — Complete Walkthrough

We implement z[i] = x[i] + y[i] for n=6 elements using RVV with VLMAX=4, watching the microarchitecture execute each step.

The C Function

C
// Add two int32 arrays element-wise: z[i] = x[i] + y[i]
void vvadd_int32(size_t n, const int* x, const int* y, int* z) {
    for (size_t i = 0; i < n; i++) {
        z[i] = x[i] + y[i];
    }
}

Register Mapping

RegisterContents
a0n — total number of elements
a1Pointer to array x (advances each iteration)
a2Pointer to array y (advances each iteration)
a3Pointer to array z (output, advances each iteration)
t0VL — actual elements processed this iteration (set by vsetvli)
v0, v1Vector regs for loaded chunks of x and y
v2Vector result (x chunk + y chunk)

Input Data

RISC-V Assembly — .data
x:  .word  10, 20, 30, 40, 50, 60   # 6 Ɨ 32-bit integers, 24 bytes
y:  .word   5, 15, 25, 35, 45, 55   # 6 Ɨ 32-bit integers, 24 bytes
z:  .word   0,  0,  0,  0,  0,  0   # 6 Ɨ 32-bit output slots
n:  .word   6                        # element count

The Full Assembly Routine

RISC-V Assembly — vvadd_int32
vvadd_int32:
    vsetvli  t0, a0, e32, m1, ta, ma  # t0 = min(a0, VLMAX); configure hw for 32-bit
    vle32.v  v0, (a1)                    # Load t0 elements from x into v0
    sub      a0, a0, t0                  # a0 -= t0  (how many elements remain)
    slli     t0, t0, 2                   # t0 = t0 * 4  (convert elements → bytes)
    add      a1, a1, t0                  # Advance x pointer by byte offset
    vle32.v  v1, (a2)                    # Load t0/4 elements from y into v1
    add      a2, a2, t0                  # Advance y pointer
    vadd.vv  v2, v0, v1                  # Element-wise add: v2[i] = v0[i] + v1[i]
    vse32.v  v2, (a3)                    # Store result chunk to z
    add      a3, a3, t0                  # Advance z pointer
    bnez     a0, vvadd_int32            # Loop if elements remain
    ret                                  # Done!
āš ļø Why slli t0, t0, 2 ?

After vsetvli, t0 holds the number of elements processed. But memory pointers advance in bytes. Each int32 is 4 bytes = 2² bytes. Shifting left by 2 multiplies by 4: t0 Ɨ 4 = byte offset. This happens after decrementing a0 (which uses t0 as element count), so the order matters!

Iteration 1 — Step by Step

1
vsetvli t0, a0, e32, m1, ta, ma

a0 = 6 (AVL). VLMAX = 1 Ɨ 128 / 32 = 4 (assume VLEN=128). → t0 = min(6, 4) = 4. Hardware is now configured for 4 elements.

2
vle32.v v0, (a1)

Load 4 Ɨ 32-bit integers from memory at a1 (= &x[0]). → v0 = [10, 20, 30, 40]

3
sub a0, a0, t0 → a0 = 6 āˆ’ 4 = 2

Decrement remaining count. 2 elements left after this batch.

4
slli t0, t0, 2 → t0 = 4 Ɨ 4 = 16 bytes

Convert element count to byte offset. 4 elements Ɨ 4 bytes each = 16 bytes.

5
add a1, a1, t0 — advance x pointer

a1 now points to x[4] (skipped 16 bytes = 4 elements).

6
vle32.v v1, (a2) → v1 = [5, 15, 25, 35]

Load 4 elements from y, then advance a2 by 16 bytes.

7
vadd.vv v2, v0, v1 — element-wise add

v2 = [10+5, 20+15, 30+25, 40+35] = [15, 35, 55, 75]

v0:
10
20
30
40
v1:
5
15
25
35
↓ vadd.vv
v2 (result):
15
35
55
75
8
vse32.v v2, (a3) — store results

Write [15, 35, 55, 75] to z[0..3]. Advance a3 by 16 bytes.

9
bnez a0, vvadd_int32

a0 = 2 ≠ 0 → loop back!

Iteration 2 — Handling the "Tail"

šŸŽÆ This is the Key RVV Trick

On re-entry, a0 = 2. vsetvli is called again with AVL=2. Since VLMAX=4, VL = min(2,4) = 2. Hardware automatically processes only 2 elements — no extra code, no bounds checks.

Iteration 2 — vsetvli sets t0 = 2 (only 2 remaining)
50
60
—
—
v0 loads only x[4], x[5]; tail positions are inactive
45
55
—
—
↓ vadd.vv (only 2 active lanes)
95
115
—
—
z[4..5] = [95, 115] — stored. a0 = 0 → loop exits.

Final Output

Indexx[i]y[i]z[i] = x[i]+y[i]Computed in
010515Iteration 1
1201535Iteration 1
2302555Iteration 1
3403575Iteration 1
4504595Iteration 2
56055115Iteration 2
āœ… Performance Insight

Scalar: 6 loop iterations Ɨ (2 loads + 1 add + 1 store) = 24 operations.
RVV with VLMAX=4: 2 iterations Ɨ ~8 instructions = ā‰ˆ16 instructions, but the heavy lifting (loads/add/store) operates on 4 elements simultaneously. With VLMAX=8, this entire computation fits in 1 iteration!


Vector Memory Addressing Modes

RVV supports five addressing modes for loads and stores. Each has a different memory access pattern, performance profile, and use case.

Mode 1

Unit-Stride

Elements are packed sequentially in memory. The default and fastest mode.

Speed: Fastest ā—ā—ā—ā—ā—
vle32.v v1, (a0) # load
vse32.v v1, (a0) # store

Use case: Array processing, most loops

Mode 2

Strided

Elements spaced by a fixed byte stride. Stride stored in a register. Stride=0 broadcasts one value to all lanes.

Speed: Medium ā—ā—ā—ā—‹ā—‹
li t0, 16
vlse32.v v1, (a0), t0
vsse32.v v1, (a0), t0

Use case: Column of row-major matrix

Mode 3

Indexed (Gather/Scatter)

Each element loaded from an arbitrary offset stored in an index vector. Maximum flexibility, minimum cache friendliness.

Speed: Slowest ā—ā—‹ā—‹ā—‹ā—‹
vluxei32.v v1,(a0),vidx # gather
vsuxei32.v v1,(a0),vidx # scatter

Use case: Sparse matrices, permutations, hash lookups

Special A

Whole-Register

Transfers exactly N Ɨ VLEN bits. Bypasses VL and masking entirely. Always loads/stores the full register width.

Speed: Fast ā—ā—ā—ā—ā—‹
vl1re32.v v1, (a0) # load 1 reg
vl4re32.v v4, (a0) # load 4 regs
vs1r.v v1, (a0) # store

Use case: Context save/restore (OS task switch)

Special B

Fault-Only-First

Like unit-stride but silently reduces VL if a memory fault occurs mid-load. No exception raised — check returned VL to see how many elements arrived valid.

Speed: Fast (normal) ā—ā—ā—ā—ā—‹
vle32ff.v v1, (a0)
# t0 (vl) silently reduced
# if fault occurs mid-load

Use case: Null-terminated strings, unknown-length data

Indexed: Ordered vs Unordered

vluxei32.v / vsuxei32.v — Unordered

Hardware may reorder memory accesses for better performance. Faster, but only correct when accesses have no side-effects and no ordering dependency.

vloxei32.v / vsoxei32.v — Ordered

Accesses occur in element order (0, 1, 2, …). Required when memory-mapped I/O has side-effects or when ordering matters for correctness. Potentially slower.

Masking on Memory Instructions

All load/store modes (except whole-register) support the optional mask register v0. Append .v0.t to the instruction to enable masking:

RISC-V Assembly
# Masked store — only write where v0[i] = 1
vse32.v  v1, (a0), v0.t
# Memory positions where mask=0 are untouched!

# Masked load — only load where v0[i] = 1
vle32.v  v1, (a0), v0.t
# Inactive lanes: follow mask policy (mu/ma)
Use CaseTechniqueBenefit
Boundary conditionMasked storeDon't write past end of array
Conditional updateMasked storeOnly store where predicate is true
Sparse writesMasked scatterWrite to selected positions only
Conditional loadMasked loadLoad only valid elements

Strided: Loading a Matrix Column

RISC-V Assembly — Load column 0 of 4Ɨ4 float matrix
# 4Ɨ4 matrix stored row-major: M[0][0] M[0][1] M[0][2] M[0][3] | M[1][0] ...
# Row width = 4 floats = 4 Ɨ 4 = 16 bytes
li        t0, 16             # stride = 16 bytes
vsetvli   t1, a3, e32, m1, ta, ma
vlse32.v  v1, (a0), t0
# v1 = [M[0][0], M[1][0], M[2][0], M[3][0]] — whole column in one instruction!

Midterm Exam Preparation

Click any question to reveal the answer. These target exactly the kind of datapath/microarchitecture MCQ questions common in computer architecture exams.

Q: VLMAX = 1 Ɨ 256 / 32. What does this equal, and what does each term represent?
VLMAX = 8 elements.
• 1 = LMUL (register group size)
• 256 = VLEN (register width in bits)
• 32 = SEW (element size in bits)
So 8 elements of 32 bits fit in one 256-bit vector register.
ā–¼ Show Answer
Q: Why is slli t0, t0, 2 executed AFTER sub a0, a0, t0?
Because sub a0, a0, t0 needs t0 to be in elements (to decrement the element count). After the sub, t0 is converted to bytes (Ɨ4) for pointer arithmetic. If the slli happened first, a0 would be decremented by bytes, not elements — catastrophically wrong.
ā–¼ Show Answer
Q: What does vsetvli return in its destination register (rd)?
rd receives VL — the actual number of elements the hardware will process this iteration. It equals min(AVL, VLMAX). This is the amount you actually processed — use it to advance pointers and decrement your counter.
ā–¼ Show Answer
Q: VLEN=256, SEW=32, LMUL=2, AVL=15. What is VLMAX? What is VL?
VLMAX = LMUL Ɨ VLEN / SEW = 2 Ɨ 256 / 32 = 16
VL = min(15, 16) = 15
All 15 elements are processed in one iteration.
ā–¼ Show Answer
Q: What is the difference between "tail agnostic" and "tail undisturbed"?
Tail elements are those beyond VL in the destination register.
• Tail Undisturbed (tu): destination bits beyond VL are guaranteed to keep their old value.
• Tail Agnostic (ta): hardware can write 1s, keep old values, or anything — program must not rely on them. ta is faster (hardware has freedom).
ā–¼ Show Answer
Q: Which vector register is reserved for masking in RVV?
v0 is the dedicated mask register. Each bit corresponds to one vector lane. Bit=1 means that element is active; bit=0 means inactive (masked off). Masked instructions use the suffix v0.t.
ā–¼ Show Answer
Q: What is the key difference between vluxei32.v and vloxei32.v?
Both are indexed (gather) loads using offsets from a vector index register.
• vlux (unordered): hardware may reorder accesses — faster.
• vlox (ordered): accesses happen in element order — needed when memory side-effects or ordering correctness matters (e.g. memory-mapped I/O).
ā–¼ Show Answer
Q: For n=6, VLMAX=4: how many loop iterations does the RVV routine execute?
2 iterations.
• Iteration 1: VL = min(6,4) = 4. Process elements 0–3. a0 = 6āˆ’4 = 2.
• Iteration 2: VL = min(2,4) = 2. Process elements 4–5. a0 = 2āˆ’2 = 0. Loop exits.
ā–¼ Show Answer
Q: What does LMUL=8 mean, and what is the trade-off?
LMUL=8 groups 8 physical registers into one logical register, allowing 8Ɨ more elements per operation (e.g. 64 instead of 8 at VLEN=256, SEW=32).
Trade-off: You only have 32/8 = 4 logical vector registers instead of 32. Fewer registers means more register pressure and potential spilling.
ā–¼ Show Answer
Q: When would you use Fault-Only-First load (vle32ff.v)?
When processing data of unknown length, like a null-terminated string. You load a chunk — if a page fault occurs mid-vector (string ended on a page boundary), the HW silently reduces VL to only the successfully loaded elements instead of raising an exception. You then read VL from the CSR to know how many elements are valid.
ā–¼ Show Answer
Q: What is the stride value to load column j of a row-major matrix where each row has N float32 elements?
stride = N Ɨ 4 bytes.
To jump from M[r][j] to M[r+1][j], you skip one entire row = N elements Ɨ 4 bytes/element. Example: 8-column matrix → stride = 8 Ɨ 4 = 32 bytes. You'd use li t0, 32; vlse32.v v1, (a0), t0.
ā–¼ Show Answer
Q: In the vvadd loop, why is vsetvli called at the TOP of the loop every iteration (not once before)?
Because AVL (a0) changes each iteration. On the last iteration, there may be fewer elements remaining than VLMAX. vsetvli must be called again so VL is correctly reduced to match the remaining elements — avoiding out-of-bounds memory access without any extra conditional code.
ā–¼ Show Answer
šŸŽÆ Hot Topics for Microarchitecture MCQs

Based on your course focus, pay special attention to: VLMAX calculation (LMUL Ɨ VLEN / SEW), VL = min(AVL, VLMAX), why slli is needed and when it executes relative to sub, what rd holds after vsetvli, tail vs mask policies, and the difference between unit-stride, strided, and indexed modes. These are the datapath-level details most likely to appear as MCQs.

Quick Reference — Common Instructions

InstructionOperationNotes
vsetvli rd, rs1, eN, mM, tp, mpConfigure VL and vtypeReturns actual VL in rd
vle32.v vd, (rs1)Unit-stride load 32-bitLoads VL elements
vse32.v vs3, (rs1)Unit-stride store 32-bitStores VL elements
vlse32.v vd, (rs1), rs2Strided loadrs2 = byte stride
vsse32.v vs3, (rs1), rs2Strided storers2 = byte stride
vluxei32.v vd, (rs1), vs2Indexed unordered loadvs2 holds byte offsets
vsuxei32.v vs3, (rs1), vs2Indexed unordered storevs2 holds byte offsets
vle32ff.v vd, (rs1)Fault-only-first loadVL reduced on fault
vl1re32.v vd, (rs1)Whole-register load (1 reg)Bypasses VL/mask
vadd.vv vd, vs2, vs1Vector + Vectorvd[i] = vs2[i] + vs1[i]
vadd.vx vd, vs2, rs1Vector + Scalarvd[i] = vs2[i] + rs1
vfadd.vv vd, vs2, vs1FP vector + vectorFloating-point add

PainlessProgramming.com — RISC-V Vector Assembly Notes

Based on the RISC-V Vector Extension v1.0 specification & course materials by Salman Zaffar, Computer Architecture 2026.

Good luck on your midterm! šŸš€

Scroll to Top