Instruction Set Architecture¶
This NPU targets edge SoC integration, so the ISA is designed to be compact and tile-friendly. We tile GEMM+activation work across multiple pipeline stages inside the core, inspired by OpenTPU and systolic tiling literature. The FP32/BF16/BF8 conversion and VALU families enable post-GEMM quantization entirely in hardware without round-tripping through host memory.
Notation¶
These three symbols appear throughout all ISA, VALU, register-file, and backend code. Confusing them is the primary source of errors.
| Symbol | Meaning | Test default | Top default |
|---|---|---|---|
N (spoken: N(bits)) |
Base lane width in bits. Matches MMALU nbits. Always written N(bits) in prose. |
8 | 8 |
L |
Number of base VX registers. Must be divisible by 4 for VE/VR aliasing. | 32 | 32 |
K |
SIMD lane count per register. Equals MMALU array-side n at the backend boundary. |
8 | 64 |
Physical register file total = L × K × N/8 bytes = 256 B (test) / 2 KiB (top).
Instruction Word Layout¶
All instructions are 32-bit words. Three formats are defined.
R-type — register-register operations¶
Bit 31 is the MSB. Fields are shown MSB → LSB (bit 31 on the left).
I-type — register + immediate (e.g. bcast.imm, ld)¶
The 12-bit immediate is sign-extended to the lane width.
S-type — three-source FMA (VALU_FP_FMA only)¶
rnd is the rounding mode for this instruction only. rs3 is the addend for FMA.
Field definitions¶
| Field | Bits | Description |
|---|---|---|
opcode |
[6:0] | Functional family (7 bits) |
rd |
[11:7] | Destination register index (5 bits; VE uses [3:0]; VR uses [2:0]) |
funct3 |
[14:12] | Sub-operation within the family (3 bits) |
rs1 |
[19:15] | Source register 1 (5 bits) |
rs2 |
[24:20] | Source register 2 (5 bits) |
funct7 |
[31:25] | Attribute field: width/round/sat/dtype (7 bits) |
imm[11:0] |
[31:20] | Signed 12-bit immediate (I-type) |
rnd |
[26:25] | Rounding mode for FMA (S-type) |
rs3 |
[31:27] | Third source register for FMA (S-type) |
funct7 Attribute Field (R-type)¶
| Sub-field | Bits in funct7 | Values |
|---|---|---|
| width | [1:0] | 00=VX (N bits) · 01=VE (2N bits) · 10=VR (4N bits) · 11=reserved |
| round | [3:2] | 00=RNE · 01=RTZ · 10=floor · 11=ceil |
| sat | [4] | 0=wrap · 1=saturate (arithmetic ops only) |
| dtype | [6:5] | 00=INT · 01=FP · 10=BF · 11=reserved |
funct7 for VALU_CVT (different layout)¶
CVT repurposes funct7 to carry source format and BF8 variant:
| Sub-field | Bits | Meaning |
|---|---|---|
| src fmt | [2:0] | Source format code (see FmtCode table below) |
| sat | [3] | Saturate on output narrowing |
| round | [5:4] | Rounding mode |
| BF8 var | [6] | 0=E4M3 · 1=E5M2 |
CVT Format Codes¶
Used in both funct3 (destination) and funct7[2:0] (source):
| Code | Format | Width | Register class |
|---|---|---|---|
000 |
s8 |
N bits | VX |
001 |
s16 |
2N bits | VE |
010 |
s32 |
4N bits | VR |
011 |
f32 |
4N bits | VR |
100 |
bf16 |
2N bits | VE |
101 |
bf8 (variant from funct7[6]) |
N bits | VX |
CVT naming convention
Mnemonics follow vcvt_<dst>_<src>.
vcvt_s8_f32 = INT8 input → FP32 output (wide, goes to VR).
vcvt_f32_s8 = FP32 input → INT8 output (narrow, goes to VX).
Opcode Family Map¶
opcode selects one of 13 functional families. Reserved codes (0x04–0x0F, 0x19–0x7F)
are detected by the decoder as illegal instructions.
graph TD
ISA["7-bit Opcode Space<br/>13 active families"]
ISA --> MEM["Memory & Control<br/>NOP 0x00 · LD 0x01 · ST 0x02"]
ISA --> MMA["Matrix Multiply-Accumulate<br/>MMA 0x03"]
ISA --> INT["Integer Vector<br/>ARITH 0x10 · LOGIC 0x11<br/>REDUCE 0x12 · LUT 0x13 (vlut/vsetlut)"]
ISA --> CVT["Type Conversion<br/>CVT 0x14"]
ISA --> BCAST["Broadcast<br/>BCAST 0x15"]
ISA --> FP["Floating-Point<br/>FP 0x16 · FP_FMA 0x17"]
ISA --> MOV["Move (proposed)<br/>MOV 0x18 ⚠"]
| Family | Opcode | Format | funct3 sub-ops |
|---|---|---|---|
NOP |
0x00 | — | (none) |
LD |
0x01 | I | 000=byte · 001=half · 010=word · 011=VX · 100=VE · 101=VR |
ST |
0x02 | I | same as LD |
MMA |
0x03 | R | 000=mma · 001=mma.last · 010=mma.reset |
VALU_ARITH |
0x10 | R | 000=add · 001=sub · 010=mul · 011=neg · 100=abs · 101=max · 110=min · 111=rsub |
VALU_LOGIC |
0x11 | R | 000=sll · 001=srl · 010=sra · 011=rol · 100=xor · 101=not · 110=or · 111=and |
VALU_REDUCE |
0x12 | R | 000=sum · 001=rmax · 010=rmin · 011=rand · 100=ror · 101=rxor |
VALU_LUT |
0x13 | R / I | 000=vlut.A (R) · 001=vlut.B (R) · 100=vsetlut.A (I) · 101=vsetlut.B (I) · 010/011/110/111=reserved |
VALU_CVT |
0x14 | R | funct3=dst fmt (see CVT table) |
VALU_BCAST |
0x15 | R / I | 000=bcast.reg (R) · 001=bcast.imm (I) |
VALU_FP |
0x16 | R | 000=fadd · 001=fsub · 010=fmul · 011=fneg · 100=fabs · 101=fmax · 110=fmin |
VALU_FP_FMA |
0x17 | S | 000=fma · 001=fms · 010=nfma · 011=nfms |
VALU_MOV ⚠ |
0x18 | R / I | 000=mov (R) · 001=movi (I) · 010=movh (I) · agent-added, unverified |
Family Reference¶
VALU_ARITH — Elementwise arithmetic on VX / VE / VR¶
Width selected by funct7[1:0]. Saturation controlled by funct7[4].
Applies to all integer lane widths (N, 2N, 4N bits).
| funct3 | Mnemonic | Operation |
|---|---|---|
| 000 | add |
rd[i] = rs1[i] + rs2[i] |
| 001 | sub |
rd[i] = rs1[i] − rs2[i] |
| 010 | mul |
rd[i] = rs1[i] × rs2[i] (narrow sat on out_vx; full product on out_vr) |
| 011 | neg |
rd[i] = −rs1[i] (rs2 unused) |
| 100 | abs |
rd[i] = |rs1[i]| (rs2 unused) |
| 101 | max |
rd[i] = max(rs1[i], rs2[i]) |
| 110 | min |
rd[i] = min(rs1[i], rs2[i]) |
| 111 | rsub |
rd[i] = rs2[i] − rs1[i] (reverse subtract; useful after bcast) |
VALU_LOGIC — Bitwise and shift on VX / VE / VR¶
Operates on raw bit patterns. sat and round are ignored.
Shift amount = rs2[i][ log2(lane_width)−1 : 0 ] (low bits of each lane of rs2).
| funct3 | Mnemonic | Operation | RV parallel |
|---|---|---|---|
| 000 | sll |
logical left shift | sll |
| 001 | srl |
logical right shift | srl |
| 010 | sra |
arithmetic right shift (sign-extending) | sra |
| 011 | rol |
rotate left by 1 (heterogeneous per-lane) | — |
| 100 | xor |
bitwise XOR | xor |
| 101 | not |
bitwise NOT (rs2 unused) | — |
| 110 | or |
bitwise OR | or |
| 111 | and |
bitwise AND | and |
VALU_REDUCE — Horizontal reduction, broadcast result¶
Result is broadcast to all K lanes of out_vr.
Operates on VX lanes (sign-extended to 4N bits for the accumulation tree).
| funct3 | Mnemonic | Operation |
|---|---|---|
| 000 | sum |
Σ rs1[i] → broadcast to all K lanes |
| 001 | rmax |
max(rs1[i]) → broadcast |
| 010 | rmin |
min(rs1[i]) → broadcast |
| 011 | rand |
AND(rs1[i]) → broadcast |
| 100 | ror |
OR(rs1[i]) → broadcast |
| 101 | rxor |
XOR(rs1[i]) → broadcast |
VALU_LUT — Programmable two-bank 256-entry lookup table (VX only)¶
The LUT family provides two independently programmable 256-byte banks (bank A and bank B).
Each bank holds 256 one-byte entries. The LUT is not a fixed ROM: entries must be
written before use with vsetlut, then queried per-lane with vlut.
vlut — per-lane lookup (R-type, 1 tick)¶
rd[i] = lut_bank[in_a_vx[i].asUInt]
rs2 is unused. The raw 8-bit unsigned value of each in_a_vx lane is the LUT index.
Output goes to out_vx. Bank selected by funct3[0] (propagated as round[0] in the
decoded bundle).
| funct3 | Mnemonic | Bank | Format |
|---|---|---|---|
| 000 | vlut |
A | R-type |
| 001 | vlut |
B | R-type |
vsetlut — write LUT segment (I-type, no RF write)¶
Writes one segment of K×4 bytes from VR[rs1] into the selected bank.
imm = segment index; segment s covers LUT entries [s×K×4 .. (s+1)×K×4 − 1].
At K=8: one VR holds 32 bytes → 8 vsetlut calls fill a full 256-byte bank.
At K=64: one VR holds 256 bytes → 1 vsetlut call fills a full 256-byte bank.
No register-file write occurs; the operation is a side-effect on VALU-internal state only.
| funct3 | Mnemonic | Bank | Format |
|---|---|---|---|
| 100 | vsetlut |
A | I-type |
| 101 | vsetlut |
B | I-type |
Reserved funct3 values
funct3 010, 011, 110, 111 are reserved and flagged as illegal by the decoder.
Assembler helpers
NpuAssembler.vlut(rd, rs1, bank) and NpuAssembler.vsetlut(rs1, segment, bank)
produce the correct 32-bit encodings. bank=0 selects bank A; bank=1 selects bank B.
VALU_CVT — Type conversion¶
funct3 = destination format code. funct7[2:0] = source format code.
Illegal if src == dst. Width of output register class is determined by the destination format.
| Mnemonic | src fmt | dst fmt | Input reg | Output reg | Ticks |
|---|---|---|---|---|---|
vcvt_s8_s32 |
s32 (VR) | s8 (VX) | VR | VX | 1 |
vcvt_s32_s8 |
s8 (VX) | s32 (VR) | VX | VR | 1 |
vcvt_f32_s32 |
s32 (VR) | f32 (VR) | VR | VR | 1 |
vcvt_s32_f32 |
f32 (VR) | s32 (VR) | VR | VR | 1–2 |
vcvt_f32_s8 |
s8 (VX) | f32 (VR) | VX | VR | 1 |
vcvt_s8_f32 |
f32 (VR) | s8 (VX) | VR | VX | 1–2 |
vcvt_f32_bf16 |
bf16 (VE) | f32 (VR) | VE | VR | 1 |
vcvt_bf16_f32 |
f32 (VR) | bf16 (VE) | VR | VE | 1 |
vcvt_f32_bf8 |
bf8 (VX) | f32 (VR) | VX | VR | 1 |
vcvt_bf8_f32 |
f32 (VR) | bf8 (VX) | VR | VX | 1 |
vcvt_s16_s32 |
s32 (VR) | s16 (VE) | VR | VE | 1 |
vcvt_s32_s16 |
s16 (VE) | s32 (VR) | VE | VR | 1 |
VALU_BCAST — Scalar broadcast to all K lanes¶
| funct3 | Format | Mnemonic | Operation |
|---|---|---|---|
| 000 | R | bcast.reg |
rd[i] = rs1[0] for all i; width from funct7[1:0] |
| 001 | I | bcast.imm |
rd[i] = sext(imm[11:0]) for all i; width=VX always |
bcast.reg broadcasts lane 0 of rs1 to all K output lanes. Used to splat a scale or zero-point constant prior to vfma.
VALU_FP — FP32 arithmetic (VR only)¶
Operands and result are always in VR (K lanes of 32-bit FP). Width implicit (always VR), dtype implicit (always FP). Tier-2 FP32 constraints: RNE rounding; NaN/Inf inputs treated as zero; overflow saturates to max finite normal; subnormals flushed to zero.
| funct3 | Mnemonic | Operation |
|---|---|---|
| 000 | fadd |
rd[i] = rs1[i] + rs2[i] |
| 001 | fsub |
rd[i] = rs1[i] − rs2[i] |
| 010 | fmul |
rd[i] = rs1[i] × rs2[i] |
| 011 | fneg |
rd[i] = −rs1[i] (sign-bit flip; no FP computation) |
| 100 | fabs |
rd[i] = |rs1[i]| (sign-bit clear) |
| 101 | fmax |
rd[i] = max(rs1[i], rs2[i]) |
| 110 | fmin |
rd[i] = min(rs1[i], rs2[i]) |
VALU_FP_FMA — Fused multiply-add (VR, S-format, 2 ticks)¶
rd = rs1 × rs2 + rs3 (and variants). S-format encodes rs3 at bits [31:27] and round mode at [26:25].
| funct3 | Mnemonic | Operation |
|---|---|---|
| 000 | fma |
rd[i] = (rs1[i] × rs2[i]) + rs3[i] |
| 001 | fms |
rd[i] = (rs1[i] × rs2[i]) − rs3[i] |
| 010 | nfma |
rd[i] = −(rs1[i] × rs2[i]) + rs3[i] |
| 011 | nfms |
rd[i] = −(rs1[i] × rs2[i]) − rs3[i] |
VALU_MOV — Register copy and immediate load¶
Agent-added family — not in original design
VALU_MOV (opcode 0x18, funct3 000/001/010) was added by the agent during
the implementation phase. It has no test coverage and was not part of the original
hardware design spec. Treat it as a proposed extension pending review.
For loading constants into registers, prefer bcast.imm (opcode 0x15, funct3 001)
which is tested and verified.
| funct3 | Format | Mnemonic | Operation |
|---|---|---|---|
| 000 | R | mov |
rd = rs1, width from funct7[1:0] |
| 001 | I | movi |
rd[0] = sext(imm); other lanes unchanged |
| 010 | I | movh |
rd[0][2N-1:N] = imm[N-1:0]; low N bits unchanged (useful to build 2N-bit constants) |
MMA — Matrix Multiply-Accumulate¶
rd = destination VR base index; rs1 = A operand VX base; rs2 = B operand VX base.
The keep signal (from funct7[4], i.e. the sat bit) controls accumulation.
| funct3 | Mnemonic | Operation |
|---|---|---|
| 000 | mma |
Start accumulate. keep high to add; low to reset PE. |
| 001 | mma.last |
Assert clct; collect final diagonal result into VR. |
| 010 | mma.reset |
Clear all PE accumulators. |
MMALU output goes directly to VR — no INT8 truncation. The 4N-bit accumulator is preserved intact.
Instruction Timing¶
1-tick ops (all VALU except FMA)¶
All VALU instructions (ARITH, LOGIC, REDUCE, LUT, CVT, BCAST, FP, MOV) have a 1-tick output register stage. The result appears on out_vx / out_ve / out_vr one clock edge after issue.
2-tick: vfma and rounding CVT ops¶
vfma performs a multiply then add, requiring two clock edges.
CVT ops that require rounding logic (e.g. vcvt_s32_f32, vcvt_s8_f32) also take 2 ticks.
Reduction ops (1-tick, broadcast to all K lanes)¶
vsum and vrmax reduce all K input lanes combinationally, then broadcast the scalar result to every lane of out_vr.
MMA: 3K−2 tick pipeline¶
For a K×K systolic array. Input vectors are consumed for the first K ticks; the first output column appears at tick 2K−1; the last at tick 3K−2.
Activation Function Software Sequences¶
Activation functions are not hardware opcodes; they are composed from VALU primitives. See Quantization Pipeline for the full worked example.
| Function | Instruction sequence | Total ticks |
|---|---|---|
| ReLU | vmax.VX 0 (vs. zero bcast) |
2 |
| Clamp | vmin → vmax |
2 |
| Tanh | vlut bank A (pre-loaded tanh table) |
1 |
| GELU (approx) | vsra → vlut (erf table, bank A) → vadd → vmul → vsra |
5 |
| Softmax (K lanes) | vrmax → vsub → vlut (exp, bank A) → vsum → vlut (recip, bank B) → vmul |
6 |
| Quantize (post-MMA) | mma.last → vcvt_f32_s32 → vfma → vcvt_s8_f32 |
3K+5 |
Softmax flow¶
flowchart LR
X["x[K] in VX\n(SQ1.6)"]
X --> A["vrmax\nmax=max(x)\n1 tick\n→ out_vr"]
X --> B["vsub.sat\nx'=x-max\n1 tick\n→ out_vx"]
A --> B
B --> C["vlut (exp, bank A)\ne=exp(x')\n1 tick\n→ out_vx"]
C --> D["vsum\nΣ=sum(e)\n1 tick\n→ out_vr"]
D --> E["vlut (recip, bank B)\nr=1/Σ\n1 tick\n→ out_vx"]
C --> F["vmul\np=e×r\n1 tick\n→ out_vx"]
E --> F
F --> G["softmax(x)\n6 ticks total"]
GELU flow¶
flowchart LR
X["x[K] in VX\n(SQ1.6)"]
X --> A["vsra by 1\n≈ x/√2\n1 tick"]
A --> B["vlut (erf, bank A)\nerf(x/√2)\n1 tick → out_vx"]
B --> C["vadd 64\n1+erf(·)\n(bcast.imm 64)\n1 tick"]
X --> D["vmul\nx·(1+erf)\n1 tick → out_vr"]
C --> D
D --> E["vsra by 7\n÷128 ≈ ×0.5/scale²\n1 tick"]
E --> F["GELU(x)\n5 ticks total"]
Assembler¶
src/main/scala/isa/NpuAssembler.scala provides a Scala-side assembler.
All methods return a Scala Int (the 32-bit bit pattern).
import isa.NpuAssembler._
// Arithmetic
val i1 = vadd(rd=0, rs1=1, rs2=2, width=VX, sat=false) // VX add
val i2 = vmul(rd=4, rs1=4, rs2=5, width=VR, sat=true) // VR mul saturated
// FP32
val i3 = vfma(rd=2, rs1=2, rs2=0, rs3=1) // VR fused multiply-add
// Conversion
val i4 = vcvt_f32_s32(rd=2, rs1=2) // INT32 → FP32
val i5 = vcvt_s8_f32(rd=31, rs1=2, sat=true) // FP32 → INT8 saturated
// Broadcast
val i6 = vbcast(rd=0, rs1=0, width=VR) // splat VR[0] lane 0
// Programmable LUT
val i_set = vsetlut(rs1=4, segment=0, bank=0) // write VR[4] → LUT bank A seg 0
val i_lut = vlut(rd=2, rs1=1, bank=0) // rd[i] = lut_A[VX[1][i]]
// MMA
val i7 = mma(rd=2, rs1=0, rs2=8, keep=true)
val i8 = mmaLast(rd=2, rs1=0, rs2=8)
// Poke in simulation (convert Scala Int → Chisel UInt safely):
dut.io.instr.poke((i1.toLong & 0xFFFFFFFFL).U)
Negative Scala ints
Instruction words with bit 31 set are negative in Scala (e.g. large funct7 values). Always
convert as (instr.toLong & 0xFFFFFFFFL).U before poking in tests.