Neural Core (NCoreBackend)¶

Neural Core (NCoreBackend)

The Neural Core (NCoreBackend) is the central execution unit of the NPU. It integrates an instruction decoder, a multi-width register file, the systolic-array matrix engine (MMALU), and the vector ALU (VALU) into a single pipelined backend.

The design philosophy mirrors a lightweight super-scalar processor: while the systolic array is busy computing a matrix multiplication over many clock cycles, the VALU and load/store units can overlap with it to maximise throughput.

Components¶

graph TB
    FE["Frontend / Test Harness\n32-bit instruction words"]

    FE --> DEC

    subgraph NCoreBackend
        DEC["InstrDecoder\ncombinational\n32-bit word → DecodedMicroOp"]

        RF["MultiWidthRegisterBlock\nVX · VE · VR\naliased over L×K×N bytes"]

        MMA["MMALU\nK×K systolic array\n(n=K, nbits=N, accum=4N)"]

        VALU["VALU(K, N)\nK lanes × 3 widths\nFP32 · BF16 · BF8 · LUT"]

        DEC -->|"NCoreVALUBundle\n(op · regCls · dtype · sat · round · imm)"| VALU
        DEC -->|"NCoreMMALUCtrlBundle\n(keep · last · reset)"| MMA
        DEC -->|"rd · rs1 · rs2 · rs3"| RF

        RF -->|"VX/VE/VR read ports"| VALU
        RF -->|"VX read ports 0,3\n(in_a, in_b)"| MMA

        VALU -->|"out_vx → VX write port 0"| RF
        VALU -->|"out_ve → VE write port 0"| RF
        VALU -->|"out_vr → VR write port 0"| RF
        MMA  -->|"out (INT32, no truncation)\n→ VR write port 1"| RF
    end

InstrDecoder (`src/main/scala/isa/instrDecoder.scala`)¶

Purely combinational; one pipeline stage.
Input: 32-bit instruction word.
Output: DecodedMicroOp bundle (family, op, regCls, rd/rs1/rs2/rs3, imm, mma control).
Asserts io.illegal for reserved opcodes, reserved funct7 width bits, or CVT src == dst.
The decoded bundle arrives at VALU and MMALU in the same clock cycle as the instruction word.

MultiWidthRegisterBlock (`src/main/scala/sram/multiWidthRegister.scala`)¶

Physical storage: L × K × (N/8) bytes (256 B at default parameters).
Three aliased views: VX (K × N), VE (K × 2N), VR (K × 4N).
Async reads; synchronous writes.
See Registers for the full port table and aliasing rules.

MMALU (`src/main/scala/alu/mma/mma.scala`)¶

Systolic array with K×K processing elements.
Parameters: n = K (array side), nbits = N, accum_nbits = 4N.
Latency: 3K − 2 clock cycles from first input row to last output column.
Output: Vec(K, SInt(4N.W)) — written directly to VR write port 1 without truncation.
See Systolic Array for detailed timing.

VALU (`src/main/scala/alu/vec/vec.scala`)¶

K lanes of N(bits) each; supports VX (N), VE (2N), and VR (4N) width classes.
Includes IEEE754 Tier-2 FP32 helpers (fadd/fmul/fma), BF16 truncation, BF8 E4M3/E5M2 encoding.
1-tick output register for all ops except vfma (2 ticks).
See VectorALU for the full instruction reference.

Execution Pipeline¶

sequenceDiagram
    participant FE as Frontend
    participant DEC as InstrDecoder
    participant RF as Register File
    participant VU as VALU
    participant MMA as MMALU

    note over FE,MMA: Cycle 0 — fetch/issue

    FE->>DEC: 32-bit instr word
    DEC-->>RF: rd/rs1/rs2 (async read)
    RF-->>VU: in_a/b_vx/ve/vr (combinational)
    DEC-->>VU: NCoreVALUBundle
    DEC-->>MMA: NCoreMMALUCtrlBundle

    note over FE,MMA: Cycle 1 — compute + latch
    VU-->>VU: out_vx/ve/vr latch (RegNext)
    MMA-->>MMA: PE accumulate

    note over FE,MMA: Cycle 2 — write-back (VALU)
    VU->>RF: out_vx → VX write 0
    VU->>RF: out_ve → VE write 0
    VU->>RF: out_vr → VR write 0

    note over FE,MMA: Cycle 3K−2 — MMA finalise
    MMA->>RF: out (INT32) → VR write 1

VALU write-back requires 2-cycle hold

The VALU output register adds one cycle of latency. The backend (or a future frontend) must hold the decoded vector op active for 2 clock cycles to fire the write-back when out_vx/ve/vr are valid.

MMALU and VALU can overlap

The MMALU pipeline (3K−2 cycles) is independent of the VALU pipeline (1–2 cycles). A frontend scheduler can issue vector instructions (CVT, BCAST, FP) during the systolic array's drain phase to hide most of the quantization overhead.

Parameter Constraints¶

Constraint	Reason
`K == mmalu.n`	MMALU array side must equal VALU lane count. Enforced by `require` in `NCoreBackend`.
`L % 4 == 0`	VR aliasing needs VX rows in groups of 4. Enforced by `require` in `MultiWidthRegisterBlock`.
`N == mmalu.nbits`	MMALU input lane width must match VALU base lane width.
`4N == mmalu.accum_nbits`	MMALU accumulator width must match VR lane width.

Source Files¶

File	Description
`src/main/scala/backend/SimpleBackend.scala`	`NCoreBackend` module
`src/main/scala/isa/instrDecoder.scala`	`InstrDecoder` combinational module
`src/main/scala/isa/instrFormat.scala`	Bit-position constants, enums
`src/main/scala/isa/instSetArch.scala`	Opcode family and funct3 definitions
`src/main/scala/isa/NpuAssembler.scala`	Scala-side assembler helpers
`src/main/scala/sram/multiWidthRegister.scala`	`MultiWidthRegisterBlock`
`src/main/scala/alu/vec/vec.scala`	`VALU` module + `Qfmt` LUT tables
`src/main/scala/alu/vec/fp.scala`	`IEEE754` FP32/BF16/BF8 helpers + `FpRef` reference
`src/main/scala/alu/mma/mma.scala`	`MMALU` systolic engine

Test Coverage¶

Spec	What it covers
`InstrDecoderSpec`	All 13 opcode families: funct3, regCls, sat, round, rd/rs1/rs2, illegal detection
`MultiWidthRegisterSpec`	VX write/read, VX→VE alias, VR→VX alias, external port
`VALUArith/Logic/MinMax/Reduce/Lut/CastSpec`	VALU functional correctness (K=8)
`VALUFP32Spec`	FP32 add/mul/fma bit-accurate vs `java.lang.Float`
`VALUCvtSpec`	All CVT pairs, BF16 round-trip, BF8 E4M3 encoding
`VALUActivationSpec`	Softmax and GELU as primitive sequences
`NCoreBackendQuantSpec`	End-to-end: MMA → vcvt → vfma → vcvt quantization pipeline