[WIP] Memory¶

Memory design is always a big issue for NPU architectures as SIMD and SIMT always have a high memory wall between the processing unit and the memory devices. Our implementation is seeking a integrated SoC solution so we assume that the we are sharing L2 caches (which boosts DMA DDR access) and having a large scratch pad memory (SPM) to save the intermediate results.

We are referring some of the designs in NVIDIA PTX ISA. And as we are designing a slave device for the CPU so the architecture will be Harvard-ish.

Memory Address Layout¶

Name	Address	Internal Access	External Access	Comment
.reg	TBD	R/W	R/W	Per Processing Core
.sreg	TBD	RO (R/W only by CU)	RO	All Cores
.code	TBD	R/W	R/W	Per Processing Core
.code.kernel	TBD	RO	R/W (only by bios?)	Kernel Library Code
.data	TBD	R/W	R/W	Per Processing Core
.sdata	TBD	R/W	R/W	Global

Registers¶

Name	Address	Bitwidth	Comment
Registers	====	====	====
.reg.ax	TBD	32	Vector register AX per Processing Core
.reg.bx	TBD	32	Vector register BX per Processing Core
.reg.cx	TBD	32	Vector register CX per Processing Core
.reg.dx	TBD	32	Vector register DX per Processing Core
Shared Registers	====	====	====
.sreg.fl	TBD	32	flag register for CU
.sreg.pc	TBD	32	program pointer

Memory Barriers on SW & HW¶

Here are some good references to this topic: 1. Breaking Down Barriers: An Intro to GPU Synchronization 2. This is also helpful for parallel computing: An Intro to GPU Architecture and Programming Models