Vector ALU (VALU)¶
- Vector ALU (VALU)
- Notation
- Architecture Overview
- Parameters and Data Types
- Instruction Reference
- VALU_ARITH — Elementwise arithmetic
- VALU_LOGIC — Bitwise and shift
- VALU_REDUCE — Horizontal reductions
- VALU_LUT — Programmable two-bank 256-entry lookup table
- VALU_CVT — Type conversion
- VALU_BCAST — Scalar broadcast
- VALU_FP — FP32 arithmetic
- VALU_FP_FMA — Fused multiply-add (S-format, 2 ticks)
- BF16 and BF8 Encoding
- Timing Summary
- Backend Integration
- Activation Functions via Primitives
- Implementation Notes
The Vector Arithmetic Logic Unit (VALU) is a K-lane, multi-width coprocessor running
alongside the Systolic Array.
It handles all post-GEMM work: elementwise arithmetic, bitwise ops, horizontal reductions,
programmable two-bank LUT lookups (vlut/vsetlut), type conversions (INT/FP32/BF16/BF8), scalar broadcasts, and
FP32 fused multiply-add. All output is written back to the shared
MultiWidthRegisterBlock.
Notation¶
| Symbol | Meaning | Default (test) | Default (top) |
|---|---|---|---|
N (N(bits)) |
Base lane width in bits (VX lane = N bits) | 8 | 8 |
L |
Number of base VX registers (must be div-by-4) | 32 | 32 |
K |
SIMD lane count per register | 8 | 64 |
N2 |
2×N — VE lane width |
16 | 16 |
N4 |
4×N — VR lane width |
32 | 32 |
For the full encoding rules, see ISA.
Architecture Overview¶
graph TD
RF["MultiWidthRegisterBlock\nL×K×N(bits) bytes\n(VX · VE · VR aliased views)"]
DEC["InstrDecoder\n32-bit word → DecodedMicroOp\n(combinational)"]
CTRL["NCoreVALUBundle\nop · dtype · regCls\nsaturate · round · rs3_idx · imm"]
VALU["VALU(K, N)\nK lanes × 3 widths\nFP32 · vlut/vsetlut"]
OVX["out_vx\nVec(K, UInt(N.W))"]
OVE["out_ve\nVec(K, UInt(2N.W))"]
OVR["out_vr\nVec(K, UInt(4N.W))"]
RF -->|"in_a_vx / in_b_vx"| VALU
RF -->|"in_a_ve / in_b_ve"| VALU
RF -->|"in_a_vr / in_b_vr / in_c_vr"| VALU
DEC --> CTRL --> VALU
VALU --> OVX -->|"VX write port 0"| RF
VALU --> OVE -->|"VE write port 0"| RF
VALU --> OVR -->|"VR write port 0"| RF
Key properties:
- K lanes processed in parallel; all three widths (N, 2N, 4N bits) are available simultaneously.
- 1-cycle latency for all ops except
vfma(2 cycles) and rounding CVT ops (1–2 cycles). - Three output ports registered with 1-cycle latency:
out_vx—Vec(K, UInt(N.W))— INT8/BF8 results; VX write bankout_ve—Vec(K, UInt(2N.W))— INT16/BF16 results; VE write bankout_vr—Vec(K, UInt(4N.W))— INT32/FP32 results; VR write bank (also receives MMALU accumulator directly)- Instruction decode is handled externally by
InstrDecoderbefore the VALU sees the control bundle.
Parameters and Data Types¶
| Lane type | Width | Register class | Implemented |
|---|---|---|---|
| INT8 / BF8 | N bits | VX | ✓ |
| INT16 | 2N bits | VE | ✓ |
| INT32 | 4N bits | VR | ✓ |
| FP32 | 4N bits | VR | ✓ (Tier-2 subset) |
| BF16 | 2N bits | VE | ✓ (truncation/padding) |
| BF8 E4M3 | N bits | VX | ✓ |
| BF8 E5M2 | N bits | VX | ✓ |
| FP16 | 2N bits | VE | — (reserved) |
VecDType encoding (in NCoreVALUBundle.dtype)¶
| Code | Type | Notes |
|---|---|---|
S8C4 |
INT8 × K | Primary dtype for arithmetic/logic/vlut |
S16C2 |
INT16 × K | VE register class |
S32C1 |
INT32 × K | VR register class |
FP32C1 |
FP32 × K | VR; Tier-2 FP32 subset |
BF16C2 |
BF16 × K | VE; top-16-bits of FP32 |
BF8E4M3 |
BF8 E4M3 × K | VX; selected for CVT ops |
BF8E5M2 |
BF8 E5M2 × K | VX; selected by funct7[6]=1 in CVT |
Instruction Reference¶
VALU_ARITH — Elementwise arithmetic¶
Opcode family 0x10. Supports all three widths via regCls (from funct7[1:0]).
Saturation (funct7[4]) clamps narrow results.
| Op | funct3 | VX | VE | VR | Saturate effect |
|---|---|---|---|---|---|
add |
000 | ✓ | ✓ | ✓ | clamp to lane width min/max |
sub |
001 | ✓ | ✓ | ✓ | clamp |
mul |
010 | ✓ | ✓ | ✓ | narrow sat on out_vx/ve; full product on out_vr |
neg |
011 | ✓ | ✓ | ✓ | clamp (e.g. −(−128) → 127 for INT8) |
abs |
100 | ✓ | ✓ | ✓ | clamp |
max |
101 | ✓ | ✓ | ✓ | — |
min |
110 | ✓ | ✓ | ✓ | — |
rsub |
111 | ✓ | ✓ | ✓ | clamp (rs2 − rs1) |
VALU_LOGIC — Bitwise and shift¶
Opcode family 0x11. Operates on raw bit patterns; ignores sat and round.
Shift amount = low log2(lane_width) bits of the corresponding in_b lane.
| Op | funct3 | Operation |
|---|---|---|
sll |
000 | logical left shift |
srl |
001 | logical right shift |
sra |
010 | arithmetic right shift (sign-extending) |
rol |
011 | rotate left |
xor |
100 | bitwise XOR |
not |
101 | bitwise NOT (in_b unused) |
or |
110 | bitwise OR |
and |
111 | bitwise AND |
VALU_REDUCE — Horizontal reductions¶
Opcode family 0x12. Reduces all K lanes of in_a_vx to a scalar and broadcasts to
every lane of out_vr. The combinational tree has no additional latency for small K.
| Op | funct3 | Result | Broadcast to |
|---|---|---|---|
sum |
000 | Σ lane[i] (sign-extended) |
out_vr all K lanes |
rmax |
001 | max(lane[i]) |
out_vr all K lanes |
rmin |
010 | min(lane[i]) |
out_vr all K lanes |
rand |
011 | AND(lane[i]) |
out_vr all K lanes |
ror |
100 | OR(lane[i]) |
out_vr all K lanes |
rxor |
101 | XOR(lane[i]) |
out_vr all K lanes |
VALU_LUT — Programmable two-bank 256-entry lookup table¶
Opcode family 0x13. VX lanes only. Two independently writable 256-byte banks (A and B).
The LUT is not a fixed ROM: entries are written at runtime via vsetlut before being
queried per-lane with vlut.
vlut — per-lane lookup (R-type, 1 tick)¶
rd[i] = lut_bank[in_a_vx[i].asUInt]
rs2 is unused. The raw unsigned value of each in_a_vx lane is the LUT index (0–255).
Output goes to out_vx. Bank A or B is selected by funct3[0]; this bit is also
propagated as round[0] in the DecodedMicroOp bundle.
| funct3 | Mnemonic | Bank | Notes |
|---|---|---|---|
| 000 | vlut |
A | round[0]=0 in decoded bundle |
| 001 | vlut |
B | round[0]=1 in decoded bundle |
vsetlut — write LUT segment (I-type, no register-file write)¶
Writes one K×4-byte segment from VR[rs1] into the selected bank.
imm = segment index s; fills entries [s×K×4 .. (s+1)×K×4 − 1].
funct3[0] selects bank A (0) or B (1), propagated the same way as vlut.
| funct3 | Mnemonic | Bank | Notes |
|---|---|---|---|
| 100 | vsetlut |
A | I-type; imm=segment; rs1=VR source |
| 101 | vsetlut |
B | I-type; imm=segment; rs1=VR source |
Segment capacity:
| K | Bytes per VR | vsetlut calls to fill 256-entry bank |
|---|---|---|
| 8 | 32 | 8 |
| 64 | 256 | 1 |
Reserved funct3 values
funct3 010, 011, 110, 111 are reserved; the decoder asserts illegal.
Qfmt table utility
The Qfmt object (src/main/scala/alu/vec/vec.scala) provides Scala-side precomputed
tables (lutExp, lutRecip, lutTanh, lutErf) for use in tests and as a source to
load into LUT banks via vsetlut. These are no longer synthesised as hardware ROMs.
VALU_CVT — Type conversion¶
Opcode family 0x14. funct3 = destination format code; funct7[2:0] = source format code.
Decoder asserts illegal if src == dst.
| Mnemonic | Src | Dst | Input port | Output port | Ticks |
|---|---|---|---|---|---|
vcvt_s8_s32 |
s32 | s8 | in_a_vr |
out_vx |
1 |
vcvt_s32_s8 |
s8 | s32 | in_a_vx |
out_vr |
1 |
vcvt_s32_f32 |
s32 | f32 | in_a_vr |
out_vr |
1 |
vcvt_f32_s32 |
f32 | s32 | in_a_vr |
out_vr |
1–2 |
vcvt_s8_f32 |
s8 | f32 | in_a_vx |
out_vr |
1 |
vcvt_f32_s8 |
f32 | s8 | in_a_vr |
out_vx |
1–2 |
vcvt_f32_bf16 |
bf16 | f32 | in_a_ve |
out_vr |
1 |
vcvt_bf16_f32 |
f32 | bf16 | in_a_vr |
out_ve |
1 |
vcvt_f32_bf8 |
bf8 | f32 | in_a_vx |
out_vr |
1 |
vcvt_bf8_f32 |
f32 | bf8 | in_a_vr |
out_vx |
1 |
vcvt_s16_s32 |
s32 | s16 | in_a_vr |
out_ve |
1 |
vcvt_s32_s16 |
s16 | s32 | in_a_ve |
out_vr |
1 |
VALU_BCAST — Scalar broadcast¶
Opcode family 0x15. Splats a scalar to all K output lanes.
Primary use: loading quantization scale/zp into a register before vfma.
| funct3 | Format | Mnemonic | Operation |
|---|---|---|---|
| 000 | R | bcast.reg |
rd[i] = rs1[0] for all K lanes; width from regCls |
| 001 | I | bcast.imm |
rd[i] = sext(imm[11:0]) for all K lanes |
VALU_FP — FP32 arithmetic¶
Opcode family 0x16. Always operates on VR (K lanes of 32-bit FP). Width and dtype are implicit.
Tier-2 FP32 constraints¶
| Property | Behaviour |
|---|---|
| Rounding | RNE default; funct7[3:2] selects RTZ/floor/ceil |
| NaN inputs | Treated as ±0; output never NaN |
| ±Inf inputs | Treated as ±0; output saturates to max finite normal |
| Subnormals | Flushed to zero on input and output |
| Overflow | Saturates to ±0x7F7FFFFF (max finite normal) |
| funct3 | Mnemonic | Operation | Ticks |
|---|---|---|---|
| 000 | fadd |
rd[i] = rs1[i] + rs2[i] |
1 |
| 001 | fsub |
rd[i] = rs1[i] − rs2[i] |
1 |
| 010 | fmul |
rd[i] = rs1[i] × rs2[i] |
1 |
| 011 | fneg |
rd[i] = −rs1[i] (sign flip) |
1 |
| 100 | fabs |
rd[i] = |rs1[i]| |
1 |
| 101 | fmax |
rd[i] = max(rs1[i], rs2[i]) |
1 |
| 110 | fmin |
rd[i] = min(rs1[i], rs2[i]) |
1 |
VALU_FP_FMA — Fused multiply-add (S-format, 2 ticks)¶
Opcode family 0x17. S-format: rs3 at bits [31:27], round mode at bits [26:25].
Result always in VR.
| funct3 | Mnemonic | Operation |
|---|---|---|
| 000 | fma |
rd[i] = (rs1[i] × rs2[i]) + rs3[i] |
| 001 | fms |
rd[i] = (rs1[i] × rs2[i]) − rs3[i] |
| 010 | nfma |
rd[i] = −(rs1[i] × rs2[i]) + rs3[i] |
| 011 | nfms |
rd[i] = −(rs1[i] × rs2[i]) − rs3[i] |
BF16 and BF8 Encoding¶
BF16 (Brain Float 16)¶
BF16 is the top 16 bits of an IEEE FP32 word — same sign and exponent, truncated mantissa.
FP32: S EEEEEEEE MMMMMMM MMMMMMMM MMMMMMMM (32 bits)
BF16: S EEEEEEEE MMMMMMM (16 bits — top half)
vcvt_bf16_f32: adds 16 zero bits in the low half → lossless exponent, truncated mantissa.vcvt_f32_bf16: removes the low 16 bits (RNE: adds 0x8000 before truncating).
BF8 — two variants¶
| Format | S | Exp | Man | Bias | Max value |
|---|---|---|---|---|---|
| E4M3 | 1 | 4 | 3 | 7 | ≈ 448 |
| E5M2 | 1 | 5 | 2 | 15 | ≈ 57 344 |
Selected by funct7[6] in CVT instructions: 0 = E4M3 (activations), 1 = E5M2 (weights/gradients).
Timing Summary¶
Backend Integration¶
sequenceDiagram
participant FE as Frontend / Test
participant DEC as InstrDecoder
participant RF as MultiWidthRegisterBlock
participant VU as VALU
participant MMA as MMALU
FE->>DEC: 32-bit instruction word
DEC-->>FE: DecodedMicroOp (combinational)
DEC->>RF: rd/rs1/rs2/rs3 addresses
RF-->>VU: in_a_vx/ve/vr · in_b_vx/ve/vr · in_c_vr
DEC-->>VU: NCoreVALUBundle (op · regCls · dtype · sat · round · imm)
Note over VU: cycle 1: combinational compute
VU-->>VU: out_vx/ve/vr registers latch
Note over VU: cycle 2: outputs valid
VU->>RF: write VX/VE/VR bank 0 (VALU result)
MMA->>RF: write VR bank 1 (INT32 accumulator — no truncation)
Write-back timing (2-cycle hold)
Because VALU has a 1-cycle output register, the backend holds the decoded op active for 2 clock cycles: cycle 1 latches the result, cycle 2 fires the write-back. A production frontend can pipeline this with a 1-cycle stall or forwarding network.
MMALU → VR direct path
MMALU's 4N-bit accumulator (Vec(K, SInt(4N))) is wired directly to VR write port 1
in NCoreBackend. No INT8 truncation occurs. This is the path that enables
INT32 quantization: the full accumulator is available in VR for subsequent vcvt_f32_s32
and vfma operations.
Activation Functions via Primitives¶
See Quantization Pipeline for the full worked example including register allocation.
Softmax (K lanes, SQ1.6 input)¶
flowchart LR
X["VX: x[K] (SQ1.6)"]
X --> A["vrmax → VR\n(broadcast max)\n1 tick"]
A --> B
X --> B["vsub.sat → VX\nx' = x - max\n1 tick"]
B --> C["vlut (exp, bank A) → VX\ne = exp(x') UQ0.8\n1 tick"]
C --> D["vsum → VR\nΣ = sum(e)\n1 tick"]
D --> E["vlut (recip, bank B) → VX\nr = 1/Σ (SQ1.6)\n1 tick"]
C --> F["vmul → VX\np = e × r\n1 tick"]
E --> F
F --> G["softmax(x)\n6 ticks total"]
GELU approximation (K lanes, SQ1.6 input)¶
flowchart LR
X["VX: x[K] (SQ1.6)"]
X --> A["vsra by 1\n≈ x/√2\n1 tick → VX"]
A --> B["vlut (erf, bank A) → VX\nerf(x/√2) SQ1.6\n1 tick"]
B --> C["vadd 64\n1+erf(·)\n1 tick → VX"]
X --> D["vmul → VR\nx·(1+erf)\n1 tick"]
C --> D
D --> E["vsra by 7\n÷128\n1 tick → VX"]
E --> F["GELU(x) approx\n5 ticks total"]
Implementation Notes¶
- Programmable LUT banks: the VALU holds two 256-byte banks written at runtime via
vsetlut. TheQfmtobject (lutExp/lutRecip/lutTanh/lutErf) provides Scala-side table data for loading into banks before simulation; no static hardware ROM is synthesised. - UQ0.8 sign when using exp LUT bank:
out_vxisUInt(N.W), so values 0–255 are unsigned. The UQ0.8 value 255 (exp(0)≈1.0) is fully representable. Signed reinterpretation is only needed if the caller treatsout_vxasSInt. regClsfield: the register-class selector insideNCoreVALUBundleis namedregCls(notwidth) to avoid a Chisel plugin naming conflict withchisel3.Width. Everyfunct7[1:0]decodes toregClsin hardware.- VecOp enum width:
VecOpvalues go up to0x45(= 69), requiring 7-bit width. If you add new entries, verify the maximum still fits in 7 bits. - Per-lane shift amount:
vsll/vsra/etc. use the lowlog2(lane_width)bits of the correspondingin_blane, enabling heterogeneous per-lane shifts within a single instruction. - FP32
fadd32normalization: leading-1 detection usesPriorityEncoder(Reverse(raw(24,0))). The returned value is the position of the highest set bit in the reversed vector, which equals24 − position_of_highest_bit_in_raw. The exponent adjustment is(24 − lzFromTop) − 23.