Register Files¶

Register Files

The NPU register file is implemented as a multi-width aliased block (MultiWidthRegisterBlock) that presents three views over a single physical byte array: VX (base), VE (paired), and VR (quad). This design lets GEMM, vector arithmetic, and FP32 post-processing share the same storage without data copying.

Notation¶

Symbol	Meaning	Default (test)	Default (top)
`N` (N(bits))	Base lane width in bits	8	8
`L`	Number of VX registers (must be divisible by 4)	32	32
`K`	SIMD lane count per register	8	64

Physical storage = L × K × (N/8) bytes = 256 B at test defaults / 2 KiB at top.

Register Classes¶

Three views share the same physical bytes:

Class	Count	Lane width	Per-reg bits	Address bits	Alias
VX[0..L-1]	32	N bits	K × N	5	native row
VE[0..L/2-1]	16	2N bits	K × 2N	4	VE[i] = VX[2i] ∥ VX[2i+1]
VR[0..L/4-1]	8	4N bits	K × 4N	3	VR[i] = VX[4i..4i+3]

Physical byte layout (N=8, K=8, L=32)¶

Physical bytes: 32 rows × 8 lanes × 1 byte = 256 B

Row  0  : VX[0]  lane[0..7]   ─┐
Row  1  : VX[1]  lane[0..7]   ─┤ VE[0]  lane[0..7] ─┐
Row  2  : VX[2]  lane[0..7]   ─┤                      ┤ VR[0]  lane[0..7]
Row  3  : VX[3]  lane[0..7]   ─┤ VE[1]  lane[0..7] ─┘
          ...                   ┘
Row  4  : VX[4]  lane[0..7]   ─┐
Row  5  : VX[5]  lane[0..7]   ─┤ VE[2]  lane[0..7] ─┐
Row  6  : VX[6]  lane[0..7]   ─┤                      ┤ VR[1]  lane[0..7]
Row  7  : VX[7]  lane[0..7]   ─┤ VE[3]  lane[0..7] ─┘
          ...

Row 28  : VX[28] lane[0..7]   ─┐
Row 29  : VX[29] lane[0..7]   ─┤ VE[14] lane[0..7] ─┐
Row 30  : VX[30] lane[0..7]   ─┤                      ┤ VR[7]  lane[0..7]
Row 31  : VX[31] lane[0..7]   ─┤ VE[15] lane[0..7] ─┘

Lane packing within a VE or VR word (little-endian):

VE[i] lane j = { VX[2i+1][lane j][N-1:0], VX[2i][lane j][N-1:0] }
                  ──── hi N bits ────────   ──── lo N bits ────────

VR[i] lane j = { VX[4i+3][lane j], VX[4i+2][lane j],
                  VX[4i+1][lane j], VX[4i+0][lane j] }
                  ─── bits [4N-1:3N] ───  ─── [3N-1:2N] ───
                  ─── [2N-1:N] ─────────  ─── [N-1:0] ───

Aliasing consequences¶

Writing VR[i] atomically updates VX[4i], VX[4i+1], VX[4i+2], VX[4i+3] and thus VE[2i], VE[2i+1].
Writing VE[i] atomically updates VX[2i] and VX[2i+1].
Reading VX[j] after a VR write to VR[j/4] returns the byte that was written.
Conflict resolution: last writer wins per physical row. Software is responsible for avoiding write-after-write conflicts within the same cycle.

MultiWidthRegisterBlock¶

Source: src/main/scala/sram/multiWidthRegister.scala
Package: sram.mwreg

Parameters¶

Parameter	Type	Default	Description
`L`	Int	32	Number of VX rows (must be divisible by 4)
`K`	Int	8	SIMD lane count
`N`	Int	8	Base lane width in bits (N(bits))
`vx_rd`	Int	4	Number of VX async read ports
`vx_wr`	Int	2	Number of VX write ports
`ve_rd`	Int	2	Number of VE async read ports
`ve_wr`	Int	1	Number of VE write ports
`vr_rd`	Int	2	Number of VR async read ports
`vr_wr`	Int	2	Number of VR write ports

I/O ports¶

All reads are asynchronous (combinational read, registered in the caller). All writes are synchronous (registered on clock edge).

Port	Direction	Width	Description
`vx_r_addr(p)`	Input	5 bits	VX read address for port p
`vx_r_data(p)`	Output	K × N bits	VX read data for port p
`vx_w_addr(p)`	Input	5 bits	VX write address for port p
`vx_w_data(p)`	Input	K × N bits	VX write data for port p
`vx_w_en(p)`	Input	Bool	VX write enable for port p
`ve_r_addr(p)`	Input	4 bits	VE read address
`ve_r_data(p)`	Output	K × 2N bits	VE read data
`ve_w_addr(p)`	Input	4 bits	VE write address
`ve_w_data(p)`	Input	K × 2N bits	VE write data
`ve_w_en(p)`	Input	Bool	VE write enable
`vr_r_addr(p)`	Input	3 bits	VR read address
`vr_r_data(p)`	Output	K × 4N bits	VR read data
`vr_w_addr(p)`	Input	3 bits	VR write address
`vr_w_data(p)`	Input	K × 4N bits	VR write data
`vr_w_en(p)`	Input	Bool	VR write enable
`ext_r_addr`	Input	5 bits	External read (VX width); test-harness use
`ext_r_data`	Output	K × N bits	External read data
`ext_w_addr`	Input	5 bits	External write address
`ext_w_data`	Input	K × N bits	External write data
`ext_w_en`	Input	Bool	External write enable

Write priority (per physical row)¶

When multiple write ports target the same row in the same cycle, priority is:

VR (highest) > VE > VX > ext (lowest)

The last-priority rule is implemented as overwrite chaining in combinational logic: each successive priority level simply overwrites the previous assignment for that row's wr_data wire.

ext_r_addr must be driven

ext_r_addr is an input port that must always be driven from the backend, even when the external read port is not in use. Default it to 0.U. Leaving it undriven causes firtool to report an "uninitialized sink" elaboration error.

Backend Port Assignment¶

NCoreBackend instantiates MultiWidthRegisterBlock with 4 VX read ports, 2 VX write ports, 2 VE read/1 VE write port, and 2 VR read/2 VR write ports.

Read ports¶

RF port	Index	Connected to	Purpose
`vx_r_addr(0)`	0	`io.mma_a_addr`	MMALU operand A
`vx_r_addr(1)`	1	`io.vx_a_addr`	VALU `in_a_vx`
`vx_r_addr(2)`	2	`io.vx_b_addr`	VALU `in_b_vx`
`vx_r_addr(3)`	3	`io.mma_b_addr` / `io.ext_rd_addr`	MMALU B or external read
`ve_r_addr(0)`	0	`io.ve_a_addr`	VALU `in_a_ve`
`ve_r_addr(1)`	1	`io.ve_b_addr`	VALU `in_b_ve`
`vr_r_addr(0)`	0	`io.vr_a_addr`	VALU `in_a_vr` (and `out_vr` read-back)
`vr_r_addr(1)`	1	`io.vr_b_addr`	VALU `in_b_vr` + `in_c_vr`
`ext_r_addr`	—	`io.ext_rd_addr`	Test-harness read (VX lanes)

Write ports¶

RF port	Index	Connected to	Purpose
`vx_w_en(0)`	0	VALU narrow result	INT8/BF8 output
`vx_w_en(1)`	1	`io.ext_wr_en`	Test-harness write
`ve_w_en(0)`	0	VALU VE result	INT16/BF16 output
`vr_w_en(0)`	0	VALU VR result	INT32/FP32 VALU output
`vr_w_en(1)`	1	MMALU accumulator (direct)	INT32 from systolic array — no truncation

MMALU direct VR write

The MMALU's Vec(K, SInt(4N.W)) accumulator output is wired directly into VR write port 1. There is no INT8 truncation — the full INT32 precision is preserved in VR for subsequent vcvt_f32_s32 and vfma instructions.

Legacy RegisterBlock¶

src/main/scala/sram/register.scala contains the original flat RegisterBlock (multi-bank, single width). It is still used by standalone MMALU and SA tests (MMALUSpec, RegisterSpec, etc.) but is not used by NCoreBackend.

RegisterBlock w_addr quirk

RegisterBlock.io.w_addr is declared as Vec(rd_banks, ...) instead of Vec(wr_banks, ...) — a pre-existing naming inconsistency. When using this module in test harnesses, all rd_banks entries of w_addr must be explicitly driven (even unused write address slots) to avoid firtool "uninitialized sink" errors. See SimpleBackend.scala for the workaround pattern.

Implemented vs Planned¶

Lane type	VX	VE	VR	Notes
INT8 (S8)	✓	—	—	Primary VALU dtype
INT16 (S16)	—	✓	—	ARITH, LOGIC ops
INT32 (S32)	—	—	✓	MMALU accumulator; ARITH, CVT
FP32	—	—	✓	Tier-2 IEEE754 subset
BF16	—	✓	—	CVT only (truncation/padding)
BF8 E4M3	✓	—	—	CVT only
BF8 E5M2	✓	—	—	CVT only (funct7[6]=1)
UINT8 (U8)	—	—	—	Planned
FP16	—	—	—	Planned