Vector ALU (VALU)¶

The Vector Arithmetic Logic Unit (VALU) is a K-lane, multi-width coprocessor running alongside the Systolic Array. It handles all post-GEMM work: elementwise arithmetic, bitwise ops, horizontal reductions, programmable two-bank LUT lookups (vlut/vsetlut), type conversions (INT/FP32/BF16/BF8), scalar broadcasts, and FP32 fused multiply-add. All output is written back to the shared MultiWidthRegisterBlock.

Notation¶

Symbol	Meaning	Default (test)	Default (top)
`N` (N(bits))	Base lane width in bits (VX lane = N bits)	8	8
`L`	Number of base VX registers (must be div-by-4)	32	32
`K`	SIMD lane count per register	8	64
`N2`	`2×N` — VE lane width	16	16
`N4`	`4×N` — VR lane width	32	32

For the full encoding rules, see ISA.

Architecture Overview¶

graph TD
    RF["MultiWidthRegisterBlock\nL×K×N(bits) bytes\n(VX · VE · VR aliased views)"]

    DEC["InstrDecoder\n32-bit word → DecodedMicroOp\n(combinational)"]

    CTRL["NCoreVALUBundle\nop · dtype · regCls\nsaturate · round · rs3_idx · imm"]

    VALU["VALU(K, N)\nK lanes × 3 widths\nFP32 · vlut/vsetlut"]

    OVX["out_vx\nVec(K, UInt(N.W))"]
    OVE["out_ve\nVec(K, UInt(2N.W))"]
    OVR["out_vr\nVec(K, UInt(4N.W))"]

    RF -->|"in_a_vx / in_b_vx"| VALU
    RF -->|"in_a_ve / in_b_ve"| VALU
    RF -->|"in_a_vr / in_b_vr / in_c_vr"| VALU
    DEC --> CTRL --> VALU
    VALU --> OVX -->|"VX write port 0"| RF
    VALU --> OVE -->|"VE write port 0"| RF
    VALU --> OVR -->|"VR write port 0"| RF

Key properties:

K lanes processed in parallel; all three widths (N, 2N, 4N bits) are available simultaneously.
1-cycle latency for all ops except vfma (2 cycles) and rounding CVT ops (1–2 cycles).
Three output ports registered with 1-cycle latency:
out_vx — Vec(K, UInt(N.W)) — INT8/BF8 results; VX write bank
out_ve — Vec(K, UInt(2N.W)) — INT16/BF16 results; VE write bank
out_vr — Vec(K, UInt(4N.W)) — INT32/FP32 results; VR write bank (also receives MMALU accumulator directly)
Instruction decode is handled externally by InstrDecoder before the VALU sees the control bundle.

Parameters and Data Types¶

Lane type	Width	Register class	Implemented
INT8 / BF8	N bits	VX	✓
INT16	2N bits	VE	✓
INT32	4N bits	VR	✓
FP32	4N bits	VR	✓ (Tier-2 subset)
BF16	2N bits	VE	✓ (truncation/padding)
BF8 E4M3	N bits	VX	✓
BF8 E5M2	N bits	VX	✓
FP16	2N bits	VE	— (reserved)

VecDType encoding (in `NCoreVALUBundle.dtype`)¶

Code	Type	Notes
`S8C4`	INT8 × K	Primary dtype for arithmetic/logic/vlut
`S16C2`	INT16 × K	VE register class
`S32C1`	INT32 × K	VR register class
`FP32C1`	FP32 × K	VR; Tier-2 FP32 subset
`BF16C2`	BF16 × K	VE; top-16-bits of FP32
`BF8E4M3`	BF8 E4M3 × K	VX; selected for CVT ops
`BF8E5M2`	BF8 E5M2 × K	VX; selected by `funct7[6]=1` in CVT

Instruction Reference¶

VALU_ARITH — Elementwise arithmetic¶

Opcode family 0x10. Supports all three widths via regCls (from funct7[1:0]). Saturation (funct7[4]) clamps narrow results.

Op	funct3	VX	VE	VR	Saturate effect
`add`	000	✓	✓	✓	clamp to lane width min/max
`sub`	001	✓	✓	✓	clamp
`mul`	010	✓	✓	✓	narrow sat on `out_vx/ve`; full product on `out_vr`
`neg`	011	✓	✓	✓	clamp (e.g. −(−128) → 127 for INT8)
`abs`	100	✓	✓	✓	clamp
`max`	101	✓	✓	✓	—
`min`	110	✓	✓	✓	—
`rsub`	111	✓	✓	✓	clamp (rs2 − rs1)

VALU_LOGIC — Bitwise and shift¶

Opcode family 0x11. Operates on raw bit patterns; ignores sat and round. Shift amount = low log2(lane_width) bits of the corresponding in_b lane.

Op	funct3	Operation
`sll`	000	logical left shift
`srl`	001	logical right shift
`sra`	010	arithmetic right shift (sign-extending)
`rol`	011	rotate left
`xor`	100	bitwise XOR
`not`	101	bitwise NOT (`in_b` unused)
`or`	110	bitwise OR
`and`	111	bitwise AND

VALU_REDUCE — Horizontal reductions¶

Opcode family 0x12. Reduces all K lanes of in_a_vx to a scalar and broadcasts to every lane of out_vr. The combinational tree has no additional latency for small K.

Op	funct3	Result	Broadcast to
`sum`	000	`Σ lane[i]` (sign-extended)	`out_vr` all K lanes
`rmax`	001	`max(lane[i])`	`out_vr` all K lanes
`rmin`	010	`min(lane[i])`	`out_vr` all K lanes
`rand`	011	`AND(lane[i])`	`out_vr` all K lanes
`ror`	100	`OR(lane[i])`	`out_vr` all K lanes
`rxor`	101	`XOR(lane[i])`	`out_vr` all K lanes

VALU_LUT — Programmable two-bank 256-entry lookup table¶

Opcode family 0x13. VX lanes only. Two independently writable 256-byte banks (A and B). The LUT is not a fixed ROM: entries are written at runtime via vsetlut before being queried per-lane with vlut.

`vlut` — per-lane lookup (R-type, 1 tick)¶

rd[i] = lut_bank[in_a_vx[i].asUInt]

rs2 is unused. The raw unsigned value of each in_a_vx lane is the LUT index (0–255). Output goes to out_vx. Bank A or B is selected by funct3[0]; this bit is also propagated as round[0] in the DecodedMicroOp bundle.

funct3	Mnemonic	Bank	Notes
000	`vlut`	A	`round[0]=0` in decoded bundle
001	`vlut`	B	`round[0]=1` in decoded bundle

`vsetlut` — write LUT segment (I-type, no register-file write)¶

Writes one K×4-byte segment from VR[rs1] into the selected bank. imm = segment index s; fills entries [s×K×4 .. (s+1)×K×4 − 1]. funct3[0] selects bank A (0) or B (1), propagated the same way as vlut.

funct3	Mnemonic	Bank	Notes
100	`vsetlut`	A	I-type; `imm`=segment; rs1=VR source
101	`vsetlut`	B	I-type; `imm`=segment; rs1=VR source

Segment capacity:

K	Bytes per VR	`vsetlut` calls to fill 256-entry bank
8	32	8
64	256	1

Reserved funct3 values

funct3 010, 011, 110, 111 are reserved; the decoder asserts illegal.

Qfmt table utility

The Qfmt object (src/main/scala/alu/vec/vec.scala) provides Scala-side precomputed tables (lutExp, lutRecip, lutTanh, lutErf) for use in tests and as a source to load into LUT banks via vsetlut. These are no longer synthesised as hardware ROMs.

VALU_CVT — Type conversion¶

Opcode family 0x14. funct3 = destination format code; funct7[2:0] = source format code. Decoder asserts illegal if src == dst.

Mnemonic	Src	Dst	Input port	Output port	Ticks
`vcvt_s8_s32`	s32	s8	`in_a_vr`	`out_vx`	1
`vcvt_s32_s8`	s8	s32	`in_a_vx`	`out_vr`	1
`vcvt_s32_f32`	s32	f32	`in_a_vr`	`out_vr`	1
`vcvt_f32_s32`	f32	s32	`in_a_vr`	`out_vr`	1–2
`vcvt_s8_f32`	s8	f32	`in_a_vx`	`out_vr`	1
`vcvt_f32_s8`	f32	s8	`in_a_vr`	`out_vx`	1–2
`vcvt_f32_bf16`	bf16	f32	`in_a_ve`	`out_vr`	1
`vcvt_bf16_f32`	f32	bf16	`in_a_vr`	`out_ve`	1
`vcvt_f32_bf8`	bf8	f32	`in_a_vx`	`out_vr`	1
`vcvt_bf8_f32`	f32	bf8	`in_a_vr`	`out_vx`	1
`vcvt_s16_s32`	s32	s16	`in_a_vr`	`out_ve`	1
`vcvt_s32_s16`	s16	s32	`in_a_ve`	`out_vr`	1

VALU_BCAST — Scalar broadcast¶

Opcode family 0x15. Splats a scalar to all K output lanes. Primary use: loading quantization scale/zp into a register before vfma.

funct3	Format	Mnemonic	Operation
000	R	`bcast.reg`	`rd[i] = rs1[0]` for all K lanes; width from `regCls`
001	I	`bcast.imm`	`rd[i] = sext(imm[11:0])` for all K lanes

VALU_FP — FP32 arithmetic¶

Opcode family 0x16. Always operates on VR (K lanes of 32-bit FP). Width and dtype are implicit.

Tier-2 FP32 constraints¶

Property	Behaviour
Rounding	RNE default; `funct7[3:2]` selects RTZ/floor/ceil
NaN inputs	Treated as ±0; output never NaN
±Inf inputs	Treated as ±0; output saturates to max finite normal
Subnormals	Flushed to zero on input and output
Overflow	Saturates to `±0x7F7FFFFF` (max finite normal)

funct3	Mnemonic	Operation	Ticks
000	`fadd`	`rd[i] = rs1[i] + rs2[i]`	1
001	`fsub`	`rd[i] = rs1[i] − rs2[i]`	1
010	`fmul`	`rd[i] = rs1[i] × rs2[i]`	1
011	`fneg`	`rd[i] = −rs1[i]` (sign flip)	1
100	`fabs`	`rd[i] = \|rs1[i]\|`	1
101	`fmax`	`rd[i] = max(rs1[i], rs2[i])`	1
110	`fmin`	`rd[i] = min(rs1[i], rs2[i])`	1

VALU_FP_FMA — Fused multiply-add (S-format, 2 ticks)¶

Opcode family 0x17. S-format: rs3 at bits [31:27], round mode at bits [26:25]. Result always in VR.

\[\text{fma}: \quad rd_i = rs1_i \times rs2_i + rs3_i\]

funct3	Mnemonic	Operation
000	`fma`	`rd[i] = (rs1[i] × rs2[i]) + rs3[i]`
001	`fms`	`rd[i] = (rs1[i] × rs2[i]) − rs3[i]`
010	`nfma`	`rd[i] = −(rs1[i] × rs2[i]) + rs3[i]`
011	`nfms`	`rd[i] = −(rs1[i] × rs2[i]) − rs3[i]`

BF16 and BF8 Encoding¶

BF16 (Brain Float 16)¶

BF16 is the top 16 bits of an IEEE FP32 word — same sign and exponent, truncated mantissa.

FP32:  S EEEEEEEE MMMMMMM MMMMMMMM MMMMMMMM  (32 bits)
BF16:  S EEEEEEEE MMMMMMM                    (16 bits — top half)

vcvt_bf16_f32: adds 16 zero bits in the low half → lossless exponent, truncated mantissa.
vcvt_f32_bf16: removes the low 16 bits (RNE: adds 0x8000 before truncating).

BF8 — two variants¶

Format	S	Exp	Man	Bias	Max value
E4M3	1	4	3	7	≈ 448
E5M2	1	5	2	15	≈ 57 344

Selected by funct7[6] in CVT instructions: 0 = E4M3 (activations), 1 = E5M2 (weights/gradients).

Timing Summary¶

Backend Integration¶

sequenceDiagram
    participant FE as Frontend / Test
    participant DEC as InstrDecoder
    participant RF as MultiWidthRegisterBlock
    participant VU as VALU
    participant MMA as MMALU

    FE->>DEC: 32-bit instruction word
    DEC-->>FE: DecodedMicroOp (combinational)

    DEC->>RF: rd/rs1/rs2/rs3 addresses
    RF-->>VU: in_a_vx/ve/vr · in_b_vx/ve/vr · in_c_vr

    DEC-->>VU: NCoreVALUBundle (op · regCls · dtype · sat · round · imm)

    Note over VU: cycle 1: combinational compute
    VU-->>VU: out_vx/ve/vr registers latch
    Note over VU: cycle 2: outputs valid

    VU->>RF: write VX/VE/VR bank 0 (VALU result)
    MMA->>RF: write VR bank 1 (INT32 accumulator — no truncation)

Write-back timing (2-cycle hold)

Because VALU has a 1-cycle output register, the backend holds the decoded op active for 2 clock cycles: cycle 1 latches the result, cycle 2 fires the write-back. A production frontend can pipeline this with a 1-cycle stall or forwarding network.

MMALU → VR direct path

MMALU's 4N-bit accumulator (Vec(K, SInt(4N))) is wired directly to VR write port 1 in NCoreBackend. No INT8 truncation occurs. This is the path that enables INT32 quantization: the full accumulator is available in VR for subsequent vcvt_f32_s32 and vfma operations.

Activation Functions via Primitives¶

See Quantization Pipeline for the full worked example including register allocation.

Softmax (K lanes, SQ1.6 input)¶

\[\text{softmax}(x_i) = \frac{e^{x_i - \max_j x_j}}{\sum_j e^{x_j - \max_j x_j}}\]

flowchart LR
    X["VX: x[K] (SQ1.6)"]
    X --> A["vrmax → VR\n(broadcast max)\n1 tick"]
    A --> B
    X --> B["vsub.sat → VX\nx' = x - max\n1 tick"]
    B --> C["vlut (exp, bank A) → VX\ne = exp(x') UQ0.8\n1 tick"]
    C --> D["vsum → VR\nΣ = sum(e)\n1 tick"]
    D --> E["vlut (recip, bank B) → VX\nr = 1/Σ (SQ1.6)\n1 tick"]
    C --> F["vmul → VX\np = e × r\n1 tick"]
    E --> F
    F --> G["softmax(x)\n6 ticks total"]

GELU approximation (K lanes, SQ1.6 input)¶

\[\text{GELU}(x) \approx 0.5 \cdot x \cdot \bigl(1 + \text{erf}(x/\sqrt{2})\bigr)\]

flowchart LR
    X["VX: x[K] (SQ1.6)"]
    X --> A["vsra by 1\n≈ x/√2\n1 tick → VX"]
    A --> B["vlut (erf, bank A) → VX\nerf(x/√2) SQ1.6\n1 tick"]
    B --> C["vadd 64\n1+erf(·)\n1 tick → VX"]
    X --> D["vmul → VR\nx·(1+erf)\n1 tick"]
    C --> D
    D --> E["vsra by 7\n÷128\n1 tick → VX"]
    E --> F["GELU(x) approx\n5 ticks total"]

Implementation Notes¶

Programmable LUT banks: the VALU holds two 256-byte banks written at runtime via vsetlut. The Qfmt object (lutExp/lutRecip/lutTanh/lutErf) provides Scala-side table data for loading into banks before simulation; no static hardware ROM is synthesised.
UQ0.8 sign when using exp LUT bank: out_vx is UInt(N.W), so values 0–255 are unsigned. The UQ0.8 value 255 (exp(0)≈1.0) is fully representable. Signed reinterpretation is only needed if the caller treats out_vx as SInt.
regCls field: the register-class selector inside NCoreVALUBundle is named regCls (not width) to avoid a Chisel plugin naming conflict with chisel3.Width. Every funct7[1:0] decodes to regCls in hardware.
VecOp enum width: VecOp values go up to 0x45 (= 69), requiring 7-bit width. If you add new entries, verify the maximum still fits in 7 bits.
Per-lane shift amount: vsll/vsra/etc. use the low log2(lane_width) bits of the corresponding in_b lane, enabling heterogeneous per-lane shifts within a single instruction.
FP32 fadd32 normalization: leading-1 detection uses PriorityEncoder(Reverse(raw(24,0))). The returned value is the position of the highest set bit in the reversed vector, which equals 24 − position_of_highest_bit_in_raw. The exponent adjustment is (24 − lzFromTop) − 23.