[WIP] Memory¶
Memory design is always a big issue for NPU architectures as SIMD and SIMT always have a high memory wall between the processing unit and the memory devices. Our implementation is seeking a integrated SoC solution so we assume that the we are sharing L2 caches (which boosts DMA DDR access) and having a large scratch pad memory (SPM) to save the intermediate results.
We are referring some of the designs in NVIDIA PTX ISA. And as we are designing a slave device for the CPU so the architecture will be Harvard-ish.
Memory Address Layout¶
| Name | Address | Internal Access | External Access | Comment |
|---|---|---|---|---|
| .reg | TBD | R/W | R/W | Per Processing Core |
| .sreg | TBD | RO (R/W only by CU) | RO | All Cores |
| .code | TBD | R/W | R/W | Per Processing Core |
| .code.kernel | TBD | RO | R/W (only by bios?) | Kernel Library Code |
| .data | TBD | R/W | R/W | Per Processing Core |
| .sdata | TBD | R/W | R/W | Global |
Registers¶
| Name | Address | Bitwidth | Comment |
|---|---|---|---|
| Registers | ==== | ==== | ==== |
| .reg.ax | TBD | 32 | Vector register AX per Processing Core |
| .reg.bx | TBD | 32 | Vector register BX per Processing Core |
| .reg.cx | TBD | 32 | Vector register CX per Processing Core |
| .reg.dx | TBD | 32 | Vector register DX per Processing Core |
| Shared Registers | ==== | ==== | ==== |
| .sreg.fl | TBD | 32 | flag register for CU |
| .sreg.pc | TBD | 32 | program pointer |
Memory Barriers on SW & HW¶
Here are some good references to this topic: 1. Breaking Down Barriers: An Intro to GPU Synchronization 2. This is also helpful for parallel computing: An Intro to GPU Architecture and Programming Models