Skip to content

[WIP] Memory

Memory design is always a big issue for NPU architectures as SIMD and SIMT always have a high memory wall between the processing unit and the memory devices. Our implementation is seeking a integrated SoC solution so we assume that the we are sharing L2 caches (which boosts DMA DDR access) and having a large scratch pad memory (SPM) to save the intermediate results.

We are referring some of the designs in NVIDIA PTX ISA. And as we are designing a slave device for the CPU so the architecture will be Harvard-ish.

Memory Address Layout

Name Address Internal Access External Access Comment
.reg TBD R/W R/W Per Processing Core
.sreg TBD RO (R/W only by CU) RO All Cores
.code TBD R/W R/W Per Processing Core
.code.kernel TBD RO R/W (only by bios?) Kernel Library Code
.data TBD R/W R/W Per Processing Core
.sdata TBD R/W R/W Global

Registers

Name Address Bitwidth Comment
Registers ==== ==== ====
.reg.ax TBD 32 Vector register AX per Processing Core
.reg.bx TBD 32 Vector register BX per Processing Core
.reg.cx TBD 32 Vector register CX per Processing Core
.reg.dx TBD 32 Vector register DX per Processing Core
Shared Registers ==== ==== ====
.sreg.fl TBD 32 flag register for CU
.sreg.pc TBD 32 program pointer

Memory Barriers on SW & HW

Here are some good references to this topic: 1. Breaking Down Barriers: An Intro to GPU Synchronization 2. This is also helpful for parallel computing: An Intro to GPU Architecture and Programming Models