Register Files¶
The NPU register file is implemented as a multi-width aliased block (MultiWidthRegisterBlock)
that presents three views over a single physical byte array:
VX (base), VE (paired), and VR (quad).
This design lets GEMM, vector arithmetic, and FP32 post-processing share the same storage
without data copying.
Notation¶
| Symbol | Meaning | Default (test) | Default (top) |
|---|---|---|---|
N (N(bits)) |
Base lane width in bits | 8 | 8 |
L |
Number of VX registers (must be divisible by 4) | 32 | 32 |
K |
SIMD lane count per register | 8 | 64 |
Physical storage = L × K × (N/8) bytes = 256 B at test defaults / 2 KiB at top.
Register Classes¶
Three views share the same physical bytes:
| Class | Count | Lane width | Per-reg bits | Address bits | Alias |
|---|---|---|---|---|---|
| VX[0..L-1] | 32 | N bits | K × N | 5 | native row |
| VE[0..L/2-1] | 16 | 2N bits | K × 2N | 4 | VE[i] = VX[2i] ∥ VX[2i+1] |
| VR[0..L/4-1] | 8 | 4N bits | K × 4N | 3 | VR[i] = VX[4i..4i+3] |
Physical byte layout (N=8, K=8, L=32)¶
Physical bytes: 32 rows × 8 lanes × 1 byte = 256 B
Row 0 : VX[0] lane[0..7] ─┐
Row 1 : VX[1] lane[0..7] ─┤ VE[0] lane[0..7] ─┐
Row 2 : VX[2] lane[0..7] ─┤ ┤ VR[0] lane[0..7]
Row 3 : VX[3] lane[0..7] ─┤ VE[1] lane[0..7] ─┘
... ┘
Row 4 : VX[4] lane[0..7] ─┐
Row 5 : VX[5] lane[0..7] ─┤ VE[2] lane[0..7] ─┐
Row 6 : VX[6] lane[0..7] ─┤ ┤ VR[1] lane[0..7]
Row 7 : VX[7] lane[0..7] ─┤ VE[3] lane[0..7] ─┘
...
Row 28 : VX[28] lane[0..7] ─┐
Row 29 : VX[29] lane[0..7] ─┤ VE[14] lane[0..7] ─┐
Row 30 : VX[30] lane[0..7] ─┤ ┤ VR[7] lane[0..7]
Row 31 : VX[31] lane[0..7] ─┤ VE[15] lane[0..7] ─┘
Lane packing within a VE or VR word (little-endian):
VE[i] lane j = { VX[2i+1][lane j][N-1:0], VX[2i][lane j][N-1:0] }
──── hi N bits ──────── ──── lo N bits ────────
VR[i] lane j = { VX[4i+3][lane j], VX[4i+2][lane j],
VX[4i+1][lane j], VX[4i+0][lane j] }
─── bits [4N-1:3N] ─── ─── [3N-1:2N] ───
─── [2N-1:N] ───────── ─── [N-1:0] ───
Aliasing consequences¶
- Writing VR[i] atomically updates VX[4i], VX[4i+1], VX[4i+2], VX[4i+3] and thus VE[2i], VE[2i+1].
- Writing VE[i] atomically updates VX[2i] and VX[2i+1].
- Reading VX[j] after a VR write to VR[j/4] returns the byte that was written.
- Conflict resolution: last writer wins per physical row. Software is responsible for avoiding write-after-write conflicts within the same cycle.
MultiWidthRegisterBlock¶
Source: src/main/scala/sram/multiWidthRegister.scala
Package: sram.mwreg
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
L |
Int | 32 | Number of VX rows (must be divisible by 4) |
K |
Int | 8 | SIMD lane count |
N |
Int | 8 | Base lane width in bits (N(bits)) |
vx_rd |
Int | 4 | Number of VX async read ports |
vx_wr |
Int | 2 | Number of VX write ports |
ve_rd |
Int | 2 | Number of VE async read ports |
ve_wr |
Int | 1 | Number of VE write ports |
vr_rd |
Int | 2 | Number of VR async read ports |
vr_wr |
Int | 2 | Number of VR write ports |
I/O ports¶
All reads are asynchronous (combinational read, registered in the caller). All writes are synchronous (registered on clock edge).
| Port | Direction | Width | Description |
|---|---|---|---|
vx_r_addr(p) |
Input | 5 bits | VX read address for port p |
vx_r_data(p) |
Output | K × N bits | VX read data for port p |
vx_w_addr(p) |
Input | 5 bits | VX write address for port p |
vx_w_data(p) |
Input | K × N bits | VX write data for port p |
vx_w_en(p) |
Input | Bool | VX write enable for port p |
ve_r_addr(p) |
Input | 4 bits | VE read address |
ve_r_data(p) |
Output | K × 2N bits | VE read data |
ve_w_addr(p) |
Input | 4 bits | VE write address |
ve_w_data(p) |
Input | K × 2N bits | VE write data |
ve_w_en(p) |
Input | Bool | VE write enable |
vr_r_addr(p) |
Input | 3 bits | VR read address |
vr_r_data(p) |
Output | K × 4N bits | VR read data |
vr_w_addr(p) |
Input | 3 bits | VR write address |
vr_w_data(p) |
Input | K × 4N bits | VR write data |
vr_w_en(p) |
Input | Bool | VR write enable |
ext_r_addr |
Input | 5 bits | External read (VX width); test-harness use |
ext_r_data |
Output | K × N bits | External read data |
ext_w_addr |
Input | 5 bits | External write address |
ext_w_data |
Input | K × N bits | External write data |
ext_w_en |
Input | Bool | External write enable |
Write priority (per physical row)¶
When multiple write ports target the same row in the same cycle, priority is:
VR (highest) > VE > VX > ext (lowest)
The last-priority rule is implemented as overwrite chaining in combinational logic:
each successive priority level simply overwrites the previous assignment for
that row's wr_data wire.
ext_r_addr must be driven
ext_r_addr is an input port that must always be driven from the backend, even when
the external read port is not in use. Default it to 0.U. Leaving it undriven causes
firtool to report an "uninitialized sink" elaboration error.
Backend Port Assignment¶
NCoreBackend instantiates MultiWidthRegisterBlock with 4 VX read ports,
2 VX write ports, 2 VE read/1 VE write port, and 2 VR read/2 VR write ports.
Read ports¶
| RF port | Index | Connected to | Purpose |
|---|---|---|---|
vx_r_addr(0) |
0 | io.mma_a_addr |
MMALU operand A |
vx_r_addr(1) |
1 | io.vx_a_addr |
VALU in_a_vx |
vx_r_addr(2) |
2 | io.vx_b_addr |
VALU in_b_vx |
vx_r_addr(3) |
3 | io.mma_b_addr / io.ext_rd_addr |
MMALU B or external read |
ve_r_addr(0) |
0 | io.ve_a_addr |
VALU in_a_ve |
ve_r_addr(1) |
1 | io.ve_b_addr |
VALU in_b_ve |
vr_r_addr(0) |
0 | io.vr_a_addr |
VALU in_a_vr (and out_vr read-back) |
vr_r_addr(1) |
1 | io.vr_b_addr |
VALU in_b_vr + in_c_vr |
ext_r_addr |
— | io.ext_rd_addr |
Test-harness read (VX lanes) |
Write ports¶
| RF port | Index | Connected to | Purpose |
|---|---|---|---|
vx_w_en(0) |
0 | VALU narrow result | INT8/BF8 output |
vx_w_en(1) |
1 | io.ext_wr_en |
Test-harness write |
ve_w_en(0) |
0 | VALU VE result | INT16/BF16 output |
vr_w_en(0) |
0 | VALU VR result | INT32/FP32 VALU output |
vr_w_en(1) |
1 | MMALU accumulator (direct) | INT32 from systolic array — no truncation |
MMALU direct VR write
The MMALU's Vec(K, SInt(4N.W)) accumulator output is wired directly into VR write
port 1. There is no INT8 truncation — the full INT32 precision is preserved in VR
for subsequent vcvt_f32_s32 and vfma instructions.
Legacy RegisterBlock¶
src/main/scala/sram/register.scala contains the original flat RegisterBlock
(multi-bank, single width). It is still used by standalone MMALU and SA tests
(MMALUSpec, RegisterSpec, etc.) but is not used by NCoreBackend.
RegisterBlock w_addr quirk
RegisterBlock.io.w_addr is declared as Vec(rd_banks, ...) instead of
Vec(wr_banks, ...) — a pre-existing naming inconsistency. When using this module
in test harnesses, all rd_banks entries of w_addr must be explicitly driven
(even unused write address slots) to avoid firtool "uninitialized sink" errors.
See SimpleBackend.scala for the workaround pattern.
Implemented vs Planned¶
| Lane type | VX | VE | VR | Notes |
|---|---|---|---|---|
| INT8 (S8) | ✓ | — | — | Primary VALU dtype |
| INT16 (S16) | — | ✓ | — | ARITH, LOGIC ops |
| INT32 (S32) | — | — | ✓ | MMALU accumulator; ARITH, CVT |
| FP32 | — | — | ✓ | Tier-2 IEEE754 subset |
| BF16 | — | ✓ | — | CVT only (truncation/padding) |
| BF8 E4M3 | ✓ | — | — | CVT only |
| BF8 E5M2 | ✓ | — | — | CVT only (funct7[6]=1) |
| UINT8 (U8) | — | — | — | Planned |
| FP16 | — | — | — | Planned |