Systolic Array¶
Systolic arrays are often the fundamental execution unit in modern NPU implementations. They are efficient in matrix multiplication, which is widely adopted in neural networks. However, it is not flexible to execute high performance vector or scalar operations. In general, a systolic array will have latency of \(3N-2\) (from the first element of input matrices to the final element of the output matrix). To maximize the gain from a systolic array architecture, most implementation will have around \(64\times 64\) grid of PEs. Larger size of the grid will suffer more from the memory wall. Hence hardware engineers need to battle with this multi-constrained problem and try to find a sweet point for the design.
Improving IPC(Instruction per Cycle) of systolic arrays is quite straight forward. We will handle the pre-fetch and pipeline the stages for sure. We also need to design an appropriate control for the systolic array to make it exeuting instructions without bubbles in the array.
Functional Model¶
The K×K (one feed window) result is
This is the M=K corner case of a more general capability — the array natively performs ?×K reductions for arbitrary M ≥ K. See M×K Streaming Reduction below.
What is bubbles in a systolic array?¶
We need to reshape the input matrix into a stair shaped vectors like below. And there will be some areas (marked in gray) where does not have any useful data in it. This is what we call bubbles here.
Those bubbles will drastically harm the performance if not handled properly. Some of you may notice that we can actually tile the previous input matrix with it. Yes that is correct, but we need to figure out the timing first.
Timing of systolic array: with an \(4\times 4\) array example¶
We will discuss the timing with an \(4\times 4\) systolic array. The latency from an \(4\times 4\) systolic array is \(3\times 4 - 2 = 10\). We have already demonstrated the system status at the first tick in the previous section. And here is the system status at the 4-th tick:
At the end of the 4-th cycle, the first element of the systolic array has completed its computation. From this cycle, we can gradually write back the result to scratch pad memory (SPM / L1-D) or L2 memory.
The next cycle (the 5-th cycle), the first element will be idle. In another word, this element can accept data for the next input and compute.
The systolic will not receiver any input for the current matrix multiplication from this cycle.
This is the what the final cycle will look like.
General Timing for \(N\times N\) array¶
Micro Operation Timing in Pseudo Assembly¶
Taking \(4\times 4\) SA as an example, an now a GEMM \(Y=B^TA+C\) can be generated as below:
.code
main:
# prepare the registers for matrix A, bank0 - bank3 (vb0-vb3), 4x8bits per bank
# prepare the registers for matrix B, bank4 - bank7 (vb4-vb7), 4x8bits per bank
# prepare the registers for matrix C, bank8 - bank11 (vb8-vb11), 4x8bits per bank
# prepare the registers for matrix Y, bank12 - bank15 (vb12-vb15), 4x8bits per bank
mma vb0, vb4, vb8, vb12
mma vb1, vb5, vb9, vb13
mma vb2, vb6, vb10, vb14
mma.last vb3, vb7, vb11, vb15
nop
And a tiled GEMM can be generated as below:
.code
main:
# first round mma will have no C available
mma vb0, vb4, 0, vb12
mma vb1, vb5, 0, vb13
mma vb2, vb6, 0, vb14
mma.last vb3, vb7, 0, vb15
# second round mma will have the previous output as C
# pipelined register will enque the vector for control
mma vb8, vb4, vb12, vb12
mma vb9, vb5, vb13, vb13
mma vb10, vb6, vb14, vb14
mma.last vb11, vb7, vb15, vb15
nop
M×K Streaming Reduction¶
The K-burst protocol shown above (one mma.last per K data rows) is one valid
way to use the array. It is not the architectural limit. The K×K
processing element grid is fundamentally a ?×K reduction engine: when
ctrl.keep is held high continuously for M cycles, every cycle of valid
(a, b) data is summed into PE.res for arbitrary M ≥ K. Nothing in the PE,
DataFeeder, or ControlUnit forces a K-cycle granularity — only the explicit
keep=false cycle that the K-burst protocol pokes between sub-passes.
Generalized formula¶
For an M-cycle continuous feed with ctrl.keep and ctrl.busy held high:
so the K×K array delivers \(C = B^{\mathsf T} A\) where \(A, B \in \mathbb{Z}^{M\times K}\). The K×K functional model is recovered by setting M = K.
Staircased output frames¶
Crucially, the DataCollector continues to emit a valid K×K matrix at every K-cycle boundary during the feed, not only at the end. Define frame f (\(f = 1, 2, \ldots, \lceil M/K \rceil\)) as the K consecutive output cycles at
(with \(T = i_{\text{tick}} + 1\), the cycle after step()). The contents are
cumulative partial sums:
The last frame (\(f = \lceil M/K \rceil\)) carries the full M-cycle accumulation; earlier frames carry running partial sums that you can either read for free or ignore.
Protocol comparison¶
The difference between the K-burst protocol and the M×K streaming protocol is
exactly one keep=false cycle:
sequenceDiagram
participant H as Test harness / frontend
participant R0 as reg(0) (control latch)
participant PE as PE(0,0)
participant R as PE(0,0).res
Note over H,R: K-burst protocol (M = K, explicit reset between bursts)
H->>R0: cycles 0..K-1 ctrl.keep = true
H->>R0: cycle K-1 ctrl.keep = FALSE (boundary marker)
H->>R0: cycles K..2K-1 ctrl.keep = true
R0->>PE: keep wave + 1-cycle latch
PE->>R: cycles 1..K res += a·b (1st K-sum reached)
PE->>R: cycle K+1 res := a·b (RESET, 1st sum lost)
PE->>R: cycles K+2..2K res += a·b (2nd K-sum reached)
Note over H,R: ?×K streaming protocol (M arbitrary, no reset inside feed)
H->>R0: cycles 0..M-1 ctrl.keep = true (held)
PE->>R: cycles 1..M res += a·b (full M-sum reached)
Note over R: every K cycles, the collector emits the running cumulative sum
When to use which protocol¶
| Use case | Protocol | Notes |
|---|---|---|
| Two independent K×K matmuls back-to-back | K-burst (keep=false at i_tick = K-1) |
Each result lives in its own frame; the existing MMALUSpec "in stream" pattern |
| One M×K reduction (tiled GEMM inner-k loop) | Continuous keep=true |
Final frame at the last K output cycles; intermediate frames available for free |
| Both partial K-sum AND full M-sum needed | Continuous keep=true |
All \(\lceil M/K \rceil\) frames observable |
Timing diagram — streaming case (K = 4, M = 8)¶
io.clct is busy delayed by 2K−2, so it spans the full streaming feed
plus the drain. Two staircased K×K result frames appear within that window:
Frame 1 lives at output cycles 7..10 (= \(2K{-}1 \,..\, 3K{-}2\)) and equals \(\sum_{m=0}^{K-1} A_{m,j}\cdot B_{m,i}\). Frame 2 lives at output cycles 11..14 (= \(3K{-}1 \,..\, 4K{-}2\)) and equals the full \(\sum_{m=0}^{2K-1} A_{m,j}\cdot B_{m,i}\).
Tested¶
End-to-end verified for K = 4 in
src/test/scala/alu/mma/MMALUStreamReduceSpec.scala:
| Test | M | What it asserts |
|---|---|---|
| 1 | 2K | Both K-partial frame and full 2K-sum frame appear during one continuous feed |
| 2 | 3K | All three staircased frames (K, 2K, 3K sums) |
| 3 | 5K | All five frames — stresses cnt mod K wrap-around |
| 4 | 2K + injected keep=false at i_tick = K-1 |
Frame 2 collapses to the second-K-only sum (not the 2K-sum) — proves the inserted reset is the sole cause of K-granularity |
Run with tool/test-specific-spec.sh alu.mma.MMALUStreamReduceSpec.
Gaps for software dispatch¶
Two known asymmetries between the direct-ctrl MMALU interface (used by the
tests) and the current ISA path; both must be resolved before a frontend can
issue mma/mma.last to drive a real M×K reduction:
MMA_LASTdecode (src/main/scala/isa/instrDecoder.scala:254-260) setsmma_last := truebut leavesmma_keep := false. Issued literally, the last data cycle wouldres := a·b(REPLACE) rather than accumulate — wiping the M-sum at the very end. Fix options:- Have
MMA_LASTalso honourfunct7[4]as a keep bit (a literalmma.last keep=truebecomes well-defined), or - Treat
mma.lastas a zero-data boundary marker that follows the lastmma keep=true.
- Have
- MMA register-file addressing
(
src/main/scala/backend/SimpleBackend.scala:90-92) —mma_a_addr/mma_b_addr/mma_out_addrare separate top-level inputs, not yet driven fromdec.rs1/dec.rs2/dec.rd. A frontend has to sequence them in lockstep with eachmmainstruction.
Both are small, isolated fixes; this section documents the architectural capability so a future PR can rely on it.
Data types¶
Implemented items are checked below
- [ ] s16
- [x] s8
- [ ] s4pack2
- [ ] s2pack4
Matrix Multiplication will only have identical data types for both inputs and outputs
Summary¶
Systolic array can support pipelining, but unlike traditional DSP architectures, a systolic pipeline is more gradual and continuous. So the control unit should be designed carefully to make use of this advantage.
Waveform — K-burst protocol¶
For a \(4\times4\) systolic array driving three back-to-back K-burst MMAs (the
"in stream" protocol from MMALUSpec). Note the keep=0 pulse at every
K-cycle boundary that resets the PE accumulator between independent K×K
results.
Compare against the streaming-protocol waveform in the
M×K Streaming Reduction section above — that one
omits the keep=0 pulse and accumulates across the whole feed instead.