Quantization Pipeline¶
This page shows how to implement post-GEMM INT8 quantization using the NPU instruction set — from raw INT8 inputs, through a matrix multiply, scale-and-shift in FP32, and back to INT8 for the next layer.
The key insight is that the NPU's MMALU accumulates in INT32 (full precision), and the
VALU's vcvt / vfma family converts between INT32, FP32, and INT8 entirely on-chip.
No host-side dequantization round-trip is needed.
Background: Uniform Affine Quantization¶
Linear quantization maps a floating-point value \(x\) to an integer \(q\) as:
And dequantizes as:
For INT8, \(q \in [-128, 127]\).
After a matrix multiplication \(Y = W \cdot X\) (weights × activations), the accumulator output is an INT32 sum. To produce the output activations for the next layer, we need to:
- Dequantize the INT32 accumulator: \(y_{\text{fp32}} = \text{acc} \times S_W \times S_X\)
- Apply bias/scale: \(y_{\text{fp32}} = y_{\text{fp32}} \times S_{\text{out}} + \text{zp}_{\text{out}}\)
- Requantize to INT8: \(q_{\text{out}} = \text{clip}(\text{round}(y_{\text{fp32}}), -128, 127)\)
Steps 2 and 3 collapse into a single vfma + vcvt_f32_s8 on the NPU.
Register Allocation¶
Using the default parameters (K=8 lanes, N=8, L=32):
| Register | Class | Content |
|---|---|---|
| VX[0..K-1] | VX | Quantized INT8 input activations X[0..K-1] |
| VX[K..2K-1] | VX | Quantized INT8 weights W[0..K-1] |
| VR[0] | VR | Combined scale: S_W × S_X × S_out⁻¹ (FP32, broadcast) |
| VR[1] | VR | Output zero-point: zp_out (FP32, broadcast) |
| VR[2] | VR | MMALU INT32 accumulator output → FP32 intermediate |
| VX[31] | VX | Final INT8 quantized output |
Instruction Sequence¶
Setup (run once per layer, outside the inner loop)¶
The hardware does not yet have a constant-load instruction for 32-bit values. The recommended approach in current software (and tests) is:
- Write the FP32 constant bytes into a VX register via the external RF write port (used by a loader/DMA or the test harness).
- Issue
bcast.vrto broadcast VX lane 0 (reinterpreted as FP32) across all K VR lanes.
# Pre-load scale FP32 bits into VX[0] lane 0 via ext_wr port (DMA / loader)
ext_write VX[0], float_bits(scale) # K copies of the same 4-byte FP32 word
# Splat VX[0] lane 0 across VR[0] (all K lanes)
bcast.vr VR[0], VX[0] # funct7: width=VR
# Pre-load zp FP32 bits into VX[1]
ext_write VX[1], float_bits(zp)
bcast.vr VR[1], VX[1]
Using NpuAssembler in Scala tests (mirrors NCoreBackendQuantSpec):
import isa.NpuAssembler._
import java.lang.Float.floatToRawIntBits
val scale = 0.0078125f // example: 1/128
val zp = 0.0f
val scaleBits = floatToRawIntBits(scale)
val zpBits = floatToRawIntBits(zp)
// Write FP32 word into VX[0] lane 0 via ext write port (all K lanes get the same value)
extWrite(dut, addr=0, Array.fill(K)((scaleBits & 0xFF).toByte.toInt))
// NOTE: for a 32-bit FP value in a VR register the backend writes 4 consecutive
// VX rows; using the VR ext write port is cleaner if available.
// Broadcast VX[0] lane 0 → VR[0], all K lanes
val setup = Seq(
vbcast(rd=0, rs1=0, width=VR), // scale → VR[0]
vbcast(rd=1, rs1=1, width=VR), // zp → VR[1]
)
Inner loop (one K-lane dot product)¶
# 1. Matrix Multiply-Accumulate: VR[2] = Σ VX[0..K-1] × VX[K..2K-1] (INT32)
mma VR[2], VX[0], VX[K], keep=true
mma.last VR[2], VX[0], VX[K]
# 2. INT32 → FP32
vcvt_s32_f32 VR[2], VR[2] # in: VR[2] INT32, out: VR[2] FP32
# 3. FP32 FMA: VR[2] = VR[2] * scale + zp
vfma VR[2], VR[2], VR[0], VR[1]
# 4. FP32 → INT8 saturated
vcvt_f32_s8 VX[31], VR[2]
Using NpuAssembler:
val innerLoop = Seq(
mma (rd=2, rs1=0, rs2=8, keep=true),
mmaLast(rd=2, rs1=0, rs2=8),
vcvt_s32_f32(rd=2, rs1=2),
vfma (rd=2, rs1=2, rs2=0, rs3=1),
vcvt_f32_s8(rd=31, rs1=2, sat=true),
)
// poke as: (instr.toLong & 0xFFFFFFFFL).U
Timing¶
gantt
title Quantization pipeline (K=4 example, 3K-2 = 10 MMA cycles)
dateFormat X
axisFormat %s
section Setup (once, via ext_wr + bcast)
ext write + bcast scale VR0 : 0, 3
ext write + bcast zp VR1 : 3, 6
section Inner loop
mma (streaming) : 6, 16
mma.last : 16, 17
vcvt_s32_f32 : 17, 18
vfma (2 ticks) : 18, 20
vcvt_f32_s8 : 20, 21
| Phase | Instructions | Ticks |
|---|---|---|
| Setup (once) | 2× ext_write (DMA/loader) + 2× bcast.vr | 4 |
| MMA pipeline | mma × (K-1) + mma.last | 3K−2 |
| INT32 → FP32 | vcvt_s32_f32 | 1 |
| Scale + shift | vfma | 2 |
| FP32 → INT8 | vcvt_f32_s8 | 1 |
| Per-tile total | (excluding setup) | 3K |
For K=64 (top-level configuration): 192 clock cycles per K×K tile plus 2 bcast.vr setup cycles amortised across many tiles (the ext_write/DMA takes place before the pipeline starts).
Timing Diagram¶
Complete Worked Example (K=8)¶
The following Scala test from NCoreBackendQuantSpec demonstrates the full pipeline:
// 1. Load quantized INT8 inputs and weights into VX via ext write port
extWrite(dut, addr=0, inputActivations) // → VX[0]
extWrite(dut, addr=8, weights) // → VX[8..15] (K rows)
// 2. Write FP32 scale/zp via ext write port, then broadcast into VR
extWrite(dut, addr=0, Array.fill(K)(scaleBits & 0xFF)) // VX[0] ← scale bytes
extWrite(dut, addr=1, Array.fill(K)(zpBits & 0xFF)) // VX[1] ← zp bytes
issue(dut, vbcast(rd=0, rs1=0, width=VR)) // → VR[0]: scale × K lanes
issue(dut, vbcast(rd=1, rs1=1, width=VR)) // → VR[1]: zp × K lanes
// 3. Run MMA (K rows, keep=true for all but last)
for (row <- 0 until K-1) {
dut.io.mma_a_addr.poke(row.U)
dut.io.mma_b_addr.poke((row + K).U)
issue(dut, mma(rd=2, rs1=row, rs2=row+K, keep=true))
}
issue(dut, mmaLast(rd=2, rs1=K-1, rs2=2*K-1))
// → VR[2] now holds INT32 dot product
// 4. Convert and quantize
issue(dut, vcvt_s32_f32(rd=2, rs1=2)) // INT32 → FP32
issue(dut, vfma(rd=2, rs1=2, rs2=0, rs3=1)) // ×scale + zp
issue(dut, vcvt_f32_s8(rd=31, rs1=2, sat=true)) // FP32 → INT8 saturated
// 5. Read result from VX[31]
dut.io.ext_rd_addr.poke(31.U)
val result = Array.tabulate(K)(i => dut.io.ext_rd_data(i).peek().litValue.toByte.toInt)
Numerical example¶
For K=1 scalar lane, inputs a=10, w=5, scale=0.01, zp=0:
| Step | Computation | Result |
|---|---|---|
| INT8 × INT8 | 10 × 5 | 50 (INT32) |
| INT32 → FP32 | float(50) |
50.0f |
| vfma | 50.0 × 0.01 + 0.0 |
0.5f |
| FP32 → INT8 | round(0.5) saturated |
1 (INT8) |
Pipelining with Future Tiles¶
Because VALU instructions execute independently of the MMALU pipeline, a future out-of-order front-end can overlap the post-quantization steps of tile N with the MMA drain of tile N+1:
gantt
title Pipelined quantization (two K-tile batches)
dateFormat X
axisFormat %s
section Tile 0
MMA streaming : 0, 10
mma.last : 10, 11
vcvt + vfma + cvt : 11, 15
section Tile 1 (overlapped)
MMA streaming : 11, 21
mma.last : 21, 22
vcvt + vfma + cvt : 22, 26
The VALU operations for tile 0 (cycles 11–14) overlap the MMA streaming for tile 1 (cycles 11–20), hiding the 4-cycle quantization overhead.
Related Pages¶
- ISA Reference — full instruction encoding and opcode families
- VectorALU — VALU instruction reference (CVT, FP, FMA)
- Registers — VX/VE/VR aliasing and port assignment
- Systolic Array — MMALU timing (3K−2 cycle pipeline)
- Neural Core — backend architecture and parameter constraints