Skip to content

Instruction Set Architecture

This NPU targets edge SoC integration, so the ISA is designed to be compact and tile-friendly. We tile GEMM+activation work across multiple pipeline stages inside the core, inspired by OpenTPU and systolic tiling literature. The FP32/BF16/BF8 conversion and VALU families enable post-GEMM quantization entirely in hardware without round-tripping through host memory.


Notation

These three symbols appear throughout all ISA, VALU, register-file, and backend code. Confusing them is the primary source of errors.

Symbol Meaning Test default Top default
N (spoken: N(bits)) Base lane width in bits. Matches MMALU nbits. Always written N(bits) in prose. 8 8
L Number of base VX registers. Must be divisible by 4 for VE/VR aliasing. 32 32
K SIMD lane count per register. Equals MMALU array-side n at the backend boundary. 8 64

Physical register file total = L × K × N/8 bytes = 256 B (test) / 2 KiB (top).


Instruction Word Layout

All instructions are 32-bit words. Three formats are defined.

R-type — register-register operations

Bit 31 is the MSB. Fields are shown MSB → LSB (bit 31 on the left).

I-type — register + immediate (e.g. bcast.imm, ld)

The 12-bit immediate is sign-extended to the lane width.

S-type — three-source FMA (VALU_FP_FMA only)

rnd is the rounding mode for this instruction only. rs3 is the addend for FMA.

Field definitions

Field Bits Description
opcode [6:0] Functional family (7 bits)
rd [11:7] Destination register index (5 bits; VE uses [3:0]; VR uses [2:0])
funct3 [14:12] Sub-operation within the family (3 bits)
rs1 [19:15] Source register 1 (5 bits)
rs2 [24:20] Source register 2 (5 bits)
funct7 [31:25] Attribute field: width/round/sat/dtype (7 bits)
imm[11:0] [31:20] Signed 12-bit immediate (I-type)
rnd [26:25] Rounding mode for FMA (S-type)
rs3 [31:27] Third source register for FMA (S-type)

funct7 Attribute Field (R-type)

Sub-field Bits in funct7 Values
width [1:0] 00=VX (N bits) · 01=VE (2N bits) · 10=VR (4N bits) · 11=reserved
round [3:2] 00=RNE · 01=RTZ · 10=floor · 11=ceil
sat [4] 0=wrap · 1=saturate (arithmetic ops only)
dtype [6:5] 00=INT · 01=FP · 10=BF · 11=reserved

funct7 for VALU_CVT (different layout)

CVT repurposes funct7 to carry source format and BF8 variant:

Sub-field Bits Meaning
src fmt [2:0] Source format code (see FmtCode table below)
sat [3] Saturate on output narrowing
round [5:4] Rounding mode
BF8 var [6] 0=E4M3 · 1=E5M2

CVT Format Codes

Used in both funct3 (destination) and funct7[2:0] (source):

Code Format Width Register class
000 s8 N bits VX
001 s16 2N bits VE
010 s32 4N bits VR
011 f32 4N bits VR
100 bf16 2N bits VE
101 bf8 (variant from funct7[6]) N bits VX

CVT naming convention

Mnemonics follow vcvt_<dst>_<src>. vcvt_s8_f32 = INT8 input → FP32 output (wide, goes to VR). vcvt_f32_s8 = FP32 input → INT8 output (narrow, goes to VX).


Opcode Family Map

opcode selects one of 13 functional families. Reserved codes (0x04–0x0F, 0x19–0x7F) are detected by the decoder as illegal instructions.

graph TD
    ISA["7-bit Opcode Space<br/>13 active families"]

    ISA --> MEM["Memory &amp; Control<br/>NOP 0x00 · LD 0x01 · ST 0x02"]
    ISA --> MMA["Matrix Multiply-Accumulate<br/>MMA 0x03"]
    ISA --> INT["Integer Vector<br/>ARITH 0x10 · LOGIC 0x11<br/>REDUCE 0x12 · LUT 0x13 (vlut/vsetlut)"]
    ISA --> CVT["Type Conversion<br/>CVT 0x14"]
    ISA --> BCAST["Broadcast<br/>BCAST 0x15"]
    ISA --> FP["Floating-Point<br/>FP 0x16 · FP_FMA 0x17"]
    ISA --> MOV["Move (proposed)<br/>MOV 0x18 ⚠"]
Family Opcode Format funct3 sub-ops
NOP 0x00 (none)
LD 0x01 I 000=byte · 001=half · 010=word · 011=VX · 100=VE · 101=VR
ST 0x02 I same as LD
MMA 0x03 R 000=mma · 001=mma.last · 010=mma.reset
VALU_ARITH 0x10 R 000=add · 001=sub · 010=mul · 011=neg · 100=abs · 101=max · 110=min · 111=rsub
VALU_LOGIC 0x11 R 000=sll · 001=srl · 010=sra · 011=rol · 100=xor · 101=not · 110=or · 111=and
VALU_REDUCE 0x12 R 000=sum · 001=rmax · 010=rmin · 011=rand · 100=ror · 101=rxor
VALU_LUT 0x13 R / I 000=vlut.A (R) · 001=vlut.B (R) · 100=vsetlut.A (I) · 101=vsetlut.B (I) · 010/011/110/111=reserved
VALU_CVT 0x14 R funct3=dst fmt (see CVT table)
VALU_BCAST 0x15 R / I 000=bcast.reg (R) · 001=bcast.imm (I)
VALU_FP 0x16 R 000=fadd · 001=fsub · 010=fmul · 011=fneg · 100=fabs · 101=fmax · 110=fmin
VALU_FP_FMA 0x17 S 000=fma · 001=fms · 010=nfma · 011=nfms
VALU_MOV 0x18 R / I 000=mov (R) · 001=movi (I) · 010=movh (I) · agent-added, unverified

Family Reference

VALU_ARITH — Elementwise arithmetic on VX / VE / VR

Width selected by funct7[1:0]. Saturation controlled by funct7[4]. Applies to all integer lane widths (N, 2N, 4N bits).

funct3 Mnemonic Operation
000 add rd[i] = rs1[i] + rs2[i]
001 sub rd[i] = rs1[i] − rs2[i]
010 mul rd[i] = rs1[i] × rs2[i] (narrow sat on out_vx; full product on out_vr)
011 neg rd[i] = −rs1[i] (rs2 unused)
100 abs rd[i] = |rs1[i]| (rs2 unused)
101 max rd[i] = max(rs1[i], rs2[i])
110 min rd[i] = min(rs1[i], rs2[i])
111 rsub rd[i] = rs2[i] − rs1[i] (reverse subtract; useful after bcast)

VALU_LOGIC — Bitwise and shift on VX / VE / VR

Operates on raw bit patterns. sat and round are ignored. Shift amount = rs2[i][ log2(lane_width)−1 : 0 ] (low bits of each lane of rs2).

funct3 Mnemonic Operation RV parallel
000 sll logical left shift sll
001 srl logical right shift srl
010 sra arithmetic right shift (sign-extending) sra
011 rol rotate left by 1 (heterogeneous per-lane)
100 xor bitwise XOR xor
101 not bitwise NOT (rs2 unused)
110 or bitwise OR or
111 and bitwise AND and

VALU_REDUCE — Horizontal reduction, broadcast result

Result is broadcast to all K lanes of out_vr. Operates on VX lanes (sign-extended to 4N bits for the accumulation tree).

funct3 Mnemonic Operation
000 sum Σ rs1[i] → broadcast to all K lanes
001 rmax max(rs1[i]) → broadcast
010 rmin min(rs1[i]) → broadcast
011 rand AND(rs1[i]) → broadcast
100 ror OR(rs1[i]) → broadcast
101 rxor XOR(rs1[i]) → broadcast

VALU_LUT — Programmable two-bank 256-entry lookup table (VX only)

The LUT family provides two independently programmable 256-byte banks (bank A and bank B). Each bank holds 256 one-byte entries. The LUT is not a fixed ROM: entries must be written before use with vsetlut, then queried per-lane with vlut.

vlut — per-lane lookup (R-type, 1 tick)

rd[i] = lut_bank[in_a_vx[i].asUInt]

rs2 is unused. The raw 8-bit unsigned value of each in_a_vx lane is the LUT index. Output goes to out_vx. Bank selected by funct3[0] (propagated as round[0] in the decoded bundle).

funct3 Mnemonic Bank Format
000 vlut A R-type
001 vlut B R-type

vsetlut — write LUT segment (I-type, no RF write)

Writes one segment of K×4 bytes from VR[rs1] into the selected bank. imm = segment index; segment s covers LUT entries [s×K×4 .. (s+1)×K×4 − 1].

At K=8: one VR holds 32 bytes → 8 vsetlut calls fill a full 256-byte bank. At K=64: one VR holds 256 bytes → 1 vsetlut call fills a full 256-byte bank.

No register-file write occurs; the operation is a side-effect on VALU-internal state only.

funct3 Mnemonic Bank Format
100 vsetlut A I-type
101 vsetlut B I-type

Reserved funct3 values

funct3 010, 011, 110, 111 are reserved and flagged as illegal by the decoder.

Assembler helpers

NpuAssembler.vlut(rd, rs1, bank) and NpuAssembler.vsetlut(rs1, segment, bank) produce the correct 32-bit encodings. bank=0 selects bank A; bank=1 selects bank B.

VALU_CVT — Type conversion

funct3 = destination format code. funct7[2:0] = source format code. Illegal if src == dst. Width of output register class is determined by the destination format.

Mnemonic src fmt dst fmt Input reg Output reg Ticks
vcvt_s8_s32 s32 (VR) s8 (VX) VR VX 1
vcvt_s32_s8 s8 (VX) s32 (VR) VX VR 1
vcvt_f32_s32 s32 (VR) f32 (VR) VR VR 1
vcvt_s32_f32 f32 (VR) s32 (VR) VR VR 1–2
vcvt_f32_s8 s8 (VX) f32 (VR) VX VR 1
vcvt_s8_f32 f32 (VR) s8 (VX) VR VX 1–2
vcvt_f32_bf16 bf16 (VE) f32 (VR) VE VR 1
vcvt_bf16_f32 f32 (VR) bf16 (VE) VR VE 1
vcvt_f32_bf8 bf8 (VX) f32 (VR) VX VR 1
vcvt_bf8_f32 f32 (VR) bf8 (VX) VR VX 1
vcvt_s16_s32 s32 (VR) s16 (VE) VR VE 1
vcvt_s32_s16 s16 (VE) s32 (VR) VE VR 1

VALU_BCAST — Scalar broadcast to all K lanes

funct3 Format Mnemonic Operation
000 R bcast.reg rd[i] = rs1[0] for all i; width from funct7[1:0]
001 I bcast.imm rd[i] = sext(imm[11:0]) for all i; width=VX always

bcast.reg broadcasts lane 0 of rs1 to all K output lanes. Used to splat a scale or zero-point constant prior to vfma.

VALU_FP — FP32 arithmetic (VR only)

Operands and result are always in VR (K lanes of 32-bit FP). Width implicit (always VR), dtype implicit (always FP). Tier-2 FP32 constraints: RNE rounding; NaN/Inf inputs treated as zero; overflow saturates to max finite normal; subnormals flushed to zero.

funct3 Mnemonic Operation
000 fadd rd[i] = rs1[i] + rs2[i]
001 fsub rd[i] = rs1[i] − rs2[i]
010 fmul rd[i] = rs1[i] × rs2[i]
011 fneg rd[i] = −rs1[i] (sign-bit flip; no FP computation)
100 fabs rd[i] = |rs1[i]| (sign-bit clear)
101 fmax rd[i] = max(rs1[i], rs2[i])
110 fmin rd[i] = min(rs1[i], rs2[i])

VALU_FP_FMA — Fused multiply-add (VR, S-format, 2 ticks)

rd = rs1 × rs2 + rs3 (and variants). S-format encodes rs3 at bits [31:27] and round mode at [26:25].

funct3 Mnemonic Operation
000 fma rd[i] = (rs1[i] × rs2[i]) + rs3[i]
001 fms rd[i] = (rs1[i] × rs2[i]) − rs3[i]
010 nfma rd[i] = −(rs1[i] × rs2[i]) + rs3[i]
011 nfms rd[i] = −(rs1[i] × rs2[i]) − rs3[i]

VALU_MOV — Register copy and immediate load

Agent-added family — not in original design

VALU_MOV (opcode 0x18, funct3 000/001/010) was added by the agent during the implementation phase. It has no test coverage and was not part of the original hardware design spec. Treat it as a proposed extension pending review. For loading constants into registers, prefer bcast.imm (opcode 0x15, funct3 001) which is tested and verified.

funct3 Format Mnemonic Operation
000 R mov rd = rs1, width from funct7[1:0]
001 I movi rd[0] = sext(imm); other lanes unchanged
010 I movh rd[0][2N-1:N] = imm[N-1:0]; low N bits unchanged (useful to build 2N-bit constants)

MMA — Matrix Multiply-Accumulate

rd = destination VR base index; rs1 = A operand VX base; rs2 = B operand VX base. The keep signal (from funct7[4], i.e. the sat bit) controls accumulation.

funct3 Mnemonic Operation
000 mma Start accumulate. keep high to add; low to reset PE.
001 mma.last Assert clct; collect final diagonal result into VR.
010 mma.reset Clear all PE accumulators.

MMALU output goes directly to VR — no INT8 truncation. The 4N-bit accumulator is preserved intact.


Instruction Timing

1-tick ops (all VALU except FMA)

All VALU instructions (ARITH, LOGIC, REDUCE, LUT, CVT, BCAST, FP, MOV) have a 1-tick output register stage. The result appears on out_vx / out_ve / out_vr one clock edge after issue.

2-tick: vfma and rounding CVT ops

vfma performs a multiply then add, requiring two clock edges. CVT ops that require rounding logic (e.g. vcvt_s32_f32, vcvt_s8_f32) also take 2 ticks.

Reduction ops (1-tick, broadcast to all K lanes)

vsum and vrmax reduce all K input lanes combinationally, then broadcast the scalar result to every lane of out_vr.

MMA: 3K−2 tick pipeline

For a K×K systolic array. Input vectors are consumed for the first K ticks; the first output column appears at tick 2K−1; the last at tick 3K−2.


Activation Function Software Sequences

Activation functions are not hardware opcodes; they are composed from VALU primitives. See Quantization Pipeline for the full worked example.

Function Instruction sequence Total ticks
ReLU vmax.VX 0 (vs. zero bcast) 2
Clamp vminvmax 2
Tanh vlut bank A (pre-loaded tanh table) 1
GELU (approx) vsravlut (erf table, bank A) → vaddvmulvsra 5
Softmax (K lanes) vrmaxvsubvlut (exp, bank A) → vsumvlut (recip, bank B) → vmul 6
Quantize (post-MMA) mma.lastvcvt_f32_s32vfmavcvt_s8_f32 3K+5

Softmax flow

\[\text{softmax}(x_i) = \frac{e^{x_i - \max_j x_j}}{\sum_j e^{x_j - \max_j x_j}}\]
flowchart LR
    X["x[K] in VX\n(SQ1.6)"]
    X --> A["vrmax\nmax=max(x)\n1 tick\n→ out_vr"]
    X --> B["vsub.sat\nx'=x-max\n1 tick\n→ out_vx"]
    A --> B
    B --> C["vlut (exp, bank A)\ne=exp(x')\n1 tick\n→ out_vx"]
    C --> D["vsum\nΣ=sum(e)\n1 tick\n→ out_vr"]
    D --> E["vlut (recip, bank B)\nr=1/Σ\n1 tick\n→ out_vx"]
    C --> F["vmul\np=e×r\n1 tick\n→ out_vx"]
    E --> F
    F --> G["softmax(x)\n6 ticks total"]

GELU flow

\[\text{GELU}(x) \approx 0.5 \cdot x \cdot \bigl(1 + \text{erf}(x/\sqrt{2})\bigr)\]
flowchart LR
    X["x[K] in VX\n(SQ1.6)"]
    X --> A["vsra by 1\n≈ x/√2\n1 tick"]
    A --> B["vlut (erf, bank A)\nerf(x/√2)\n1 tick → out_vx"]
    B --> C["vadd 64\n1+erf(·)\n(bcast.imm 64)\n1 tick"]
    X --> D["vmul\nx·(1+erf)\n1 tick → out_vr"]
    C --> D
    D --> E["vsra by 7\n÷128 ≈ ×0.5/scale²\n1 tick"]
    E --> F["GELU(x)\n5 ticks total"]

Assembler

src/main/scala/isa/NpuAssembler.scala provides a Scala-side assembler. All methods return a Scala Int (the 32-bit bit pattern).

import isa.NpuAssembler._

// Arithmetic
val i1 = vadd(rd=0, rs1=1, rs2=2, width=VX, sat=false)  // VX add
val i2 = vmul(rd=4, rs1=4, rs2=5, width=VR, sat=true)   // VR mul saturated

// FP32
val i3 = vfma(rd=2, rs1=2, rs2=0, rs3=1)                // VR fused multiply-add

// Conversion
val i4 = vcvt_f32_s32(rd=2, rs1=2)                      // INT32 → FP32
val i5 = vcvt_s8_f32(rd=31, rs1=2, sat=true)            // FP32 → INT8 saturated

// Broadcast
val i6 = vbcast(rd=0, rs1=0, width=VR)                  // splat VR[0] lane 0

// Programmable LUT
val i_set = vsetlut(rs1=4, segment=0, bank=0)           // write VR[4] → LUT bank A seg 0
val i_lut = vlut(rd=2, rs1=1, bank=0)                   // rd[i] = lut_A[VX[1][i]]

// MMA
val i7 = mma(rd=2, rs1=0, rs2=8, keep=true)
val i8 = mmaLast(rd=2, rs1=0, rs2=8)

// Poke in simulation (convert Scala Int → Chisel UInt safely):
dut.io.instr.poke((i1.toLong & 0xFFFFFFFFL).U)

Negative Scala ints

Instruction words with bit 31 set are negative in Scala (e.g. large funct7 values). Always convert as (instr.toLong & 0xFFFFFFFFL).U before poking in tests.