## Presenting for OSCAR 2023: Pipelining an Open-Source Last-Level Cache

Kevin Jiang, Joseph Zuckerman, and Luca P. Carloni







#### **Motivation**

- SoCs are increasingly heterogeneous and complex
- On-chip shared memory can reduce memory access time and simplify programming in SoCs
- SoCs utilize cache hierarchies to enforce memory coherency of on-chip shared memory across the entire system

The performance of the cache hierarchy is crucial to reducing memory access time in an SoC

| ı/o                              | Core                       | Core                       |
|----------------------------------|----------------------------|----------------------------|
| Natural<br>language<br>processor | Core                       | Core                       |
| Matrix op.<br>accelerator        | Graph<br>accelerator       | Comp Vision<br>accelerator |
| Radio<br>accelerator             | Signal Proc<br>accelerator | ı/o                        |

## ESP: An Open-Source Platform for SoC design

www.esp.cs.columbia.edu

- ESP combines a flexible architecture with automated IP integration and a large variety of accelerator design flows to provide a platform for rapid SoC design and prototyping
- ESP also provides a cache hierarchy for implementing onchip shared memory



#### **ESP** Architecture

- RISC-V Processors
- Many-Accelerator
- Distributed Memory
- Multi-Plane NoC

The ESP architecture implements a distributed system, which is scalable, modular and heterogeneous, giving processors and accelerators similar weight in the SoC



## ESP Methodology

**Accelerator Flow** 

- Simplified design
- Automated integration

SoC Flow

- Mix & match floorplanning GUI
- Rapid FPGA prototyping



#### ESP Methodology

#### SoC Flow

- Mix & match floorplanning GUI
- Rapid FPGA prototyping

|                                                                                                                                   |                                                       | ESP SoC Generator         |                  | _ II                                        |
|-----------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|---------------------------|------------------|---------------------------------------------|
| General SoC configuration:<br>virtexup<br>ETH FPnew<br>No JTAG<br>Eth (192.168.1.2)<br>Use SGMII<br>No SVGA<br>With synchronizers | Data transfers:<br>Bigphysical area<br>Scatter/Gather |                           | PU Architecture: |                                             |
| NoC configuration                                                                                                                 |                                                       | NoC Tile C                | Configuration    |                                             |
| Rows: 2 Cols: 2<br>Config                                                                                                         | ſ                                                     | (0,0)<br>mem 🛁            |                  | (0,1)<br>cpu                                |
| Monitor DDR bandwidth<br>Monitor memory access<br>Monitor injection rate                                                          |                                                       | mem                       |                  | сри                                         |
| Monitor Injection rate                                                                                                            | T Has L2 Clk Reg                                      | I: 0 I Has PLL I CLK BUF  | Has 12 Clk F     | Reg: 0 🔮 🗆 Has PLL 🗖 CLK BUF<br>(1,1)<br>10 |
| Monitor LLC Hit/Miss                                                                                                              |                                                       | empty                     |                  | 10                                          |
| Num CPUs: 1<br>Num memory controllers: 1<br>Num I/O tiles: 1<br>Num accelerators: 0<br>Num CLK regions: 1                         | T Has L2 Clk Reg                                      | : 💿 🚆 🗆 Has PLL 🗖 CLK BUF | Has L2 Clk F     | Neg: 이 윌 디 Has PLL 디 CLK BUF                |
| Num CLKBUF: 0                                                                                                                     |                                                       |                           |                  |                                             |

#### ESP cache hierarchy

- Consists of private L2 caches and a LLC (Last Level Cache)
- Extended MESI directory-based cache coherence protocol
- LLC maintains coherence between L2 caches
- Processor cores have L2 caches provided by ESP
- Accelerators typically perform DMA to DRAM, but can also interface with LLC and optionally have L2 cache
- In ESP, accelerators can operate under 4 coherence modes:
  - $\circ$   $\,$  Non-coherent DMA: No L2 cache, DMA to DRAM only  $\,$
  - LLC-coherent DMA: No L2 cache, DMA to LLC with coherence enforced by software flush
  - $\circ$   $\,$  Coherent DMA: No L2 cache, DMA to LLC with coherence enforced by protocol
  - Fully Coherent: Accelerator has L2 cache just like processor cores

#### ESP cache coherence protocol

- Extended MESI directory-based cache coherence protocol
- Cache line states: Modified, Exclusive, Shared, Invalid, (Valid)

|     | REQUESTS                                                                  |                                                                         |                                                           |                                                                        | DMA R                                                      | EQUESTS                         | RESPONSES                        |                    |                                                      |
|-----|---------------------------------------------------------------------------|-------------------------------------------------------------------------|-----------------------------------------------------------|------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------|----------------------------------|--------------------|------------------------------------------------------|
|     | GetS                                                                      | GetM                                                                    | PutS                                                      | PutM                                                                   | Evict                                                      | Read                            | Write                            | Inv-Ack            | Data                                                 |
| I   | read mem,<br>Excl. Data to req,<br>owner = req / E                        | read mem,<br>Data to req,<br>owner = req / M                            | Put-Ack to req                                            | Put-Ack to req                                                         |                                                            | read mem,<br>Data to req<br>/ V | [read mem],<br>write LLC,<br>/ V |                    |                                                      |
| v   | Excl. Data to req,<br>owner = req / E                                     | Data to req,<br>owner = req / M                                         | Put-Ack to req                                            | Put-Ack to req                                                         | [write mem]<br>/ I                                         | Data to req                     | write LLC                        |                    |                                                      |
| s   | Data to req,<br>sharers += req                                            | Data to req,<br>Inval. to sharers,<br>owner = req,<br>clear sharers / M | Put-Ack to req,<br>sharers -= req<br>/ V (if last sharer) | Put-Ack to req,<br>sharers -= req<br>/ V (if last sharer)              | [write mem],<br>Inval. to<br>sharers, clear<br>sharers / I |                                 |                                  |                    |                                                      |
| E   | Fwd-GetS to owner,<br>sharers+=req+owner,<br>clear owner / S <sup>D</sup> | Fwd-GetM<br>to owner,<br>owner = req<br>/ M                             | Put-Ack to req,<br>if req is owner:<br>- clear owner / V  | write LLC,<br>Put-Ack to req,<br>if req is owner:<br>- clear owner / V | Fwd-GetM<br>to owner,<br>clear owner<br>/ EI <sup>D</sup>  |                                 |                                  |                    |                                                      |
| м   | Fwd-GetS to owner,<br>sharers+=req+owner<br>clear owner / S <sup>D</sup>  | Fwd-GetM<br>to owner,<br>owner = req                                    | Put-Ack to req                                            | write LLC,<br>Put-Ack to req,<br>if req is owner:<br>- clear owner / V | Fwd-GetM<br>to owner,<br>clear owner<br>/ MI <sup>D</sup>  |                                 |                                  |                    |                                                      |
| SD  | stall                                                                     | stall                                                                   | Put-Ack to req,<br>sharers -= req                         | Put-Ack to req,<br>sharers -= req                                      | stall                                                      |                                 |                                  |                    | write LLC,<br>/ V (if no sharers)<br>/ S (otherwise) |
| EID | stall                                                                     | stall                                                                   | Put-Ack to req,<br>sharers -= req                         | Put-Ack to req,<br>sharers - = req                                     |                                                            |                                 |                                  | [write mem]<br>/ I | write mem<br>/ I                                     |
| MID | stall                                                                     | stall                                                                   | Put-Ack to req,<br>sharers -= req                         | Put-Ack to req,<br>sharers -= req                                      |                                                            |                                 |                                  |                    | write mem<br>/ I                                     |

TABLE I DIRECTORY CONTROLLER'S EXTENDED MESI PROTOCOL.

#### ESP cache hierarchy example



- 4x4 tile ESP system
- Processor tiles have off-the-shelf L1 cache
- ESP provides L2 caches for Processor tiles and optionally Accelerator tiles
- Memory tiles contain LLC
- Cache hierarchy is connected via a multi-plane NoC (Network-on-Chip)

## Improving LLC throughput

- LLC is the main synchronization point for the SoC
  - All L2 caches must interface with LLC
- Throughput of the LLC can limit performance of SoC when density of requests to the LLC is high
- The current LLC implementation utilizes a multi-cycle data path, only handling one request per multi-cycle iteration

We implement a pipelined data path to increase the throughput of the LLC

#### LLC Microarchitecture without pipelining

6-stage multi-cycle datapath controlled by FSM unit



Distribute control logic across all stages

Implement valid-ready protocol pipeline registers



#### Elimination of read-after-write hazards and read/write collisions



Prevent out-of-order completion of requests



#### Increase pipeline utilization of DMA requests



Cycle 6, Original LLC: Cache line 1

#### LLC Pipelined Microarchitecture



## Verification: RTL simulation for LLC

#### SystemVerilog testbenches for RTL simulation



#### Verification: RTL simulation for full ESP system

Single-core and multi-core simulation of basic "Hello World" program

|         |                 | NoC Tile Co | onfiguratio | n     |            |         |
|---------|-----------------|-------------|-------------|-------|------------|---------|
|         | (0,0)           |             |             |       | (0,1)      |         |
|         | mem             | -           |             |       | cpu        | -       |
|         | mem             |             |             |       | сри        |         |
| Г На    | s cache 🔲 Ha    | as DDR      | 🕅 Has       | cache | Г На       | as DDR  |
| Clk Reg | : 0 🚔 🕅 Has PLL | CLK BUF     | Clk Reg:    | 0 🛢   | 🕅 Has PLL  | CLK BUF |
|         | (1,0)           |             |             |       | (1,1)      |         |
|         | сри             | -           |             |       | ю          | -       |
|         | сри             |             |             |       | 10         |         |
| 🔽 Ha    | s cache 🔲 Ha    | as DDR      | I ∏ Has     | cache | Г Ha       | as DDR  |
|         |                 | CLK BUF     |             |       | E Line DLL | CLK BUF |

## Verification: FPGA testing

Implementing SoC on FPGA and running applications

- Small applications:
  - Single-core and multi-core "Hello World"
  - Multi-core shared memory and lock program
- Booting Linux with Ethernet enabled on Single-core SoC
  - Ethernet uses coherent DMA to the LLC



Picture shows 2-core SoC. Red: CPU, Green: Memory Tile with LLC, Yellow: I/O Tile

#### Performance Assessment: Method

Monitor memory access time of 3 different accelerator workloads, compare times when using LLC without pipelining and with pipelining

- FFT (Fast Fourier Transform)
- Matrix Multiplication (GEMM)
- 2D Convolution (CONV2D)

ESP Performance Monitors API allows monitoring of memory access time only

#### Performance Assessment: FPGA Implementation

Implement two SoCs, one SoC with pipelined LLC, one SoC with original LLC

Workload variables:

- 5 different sizes from XS to XL
- 2 Coherence Modes
  - LLC-Coherent DMA: Accelerators access LLC after software flush of L2 caches for coherence
  - Coherent DMA: Accelerators access LLC with coherence enforced by hardware coherence protocol

| NoC Tile Configuration             |                                       |  |  |  |
|------------------------------------|---------------------------------------|--|--|--|
| (0,0)                              | (0,1)                                 |  |  |  |
| mem 🛁                              | cpu 🛁                                 |  |  |  |
| mem                                | сри                                   |  |  |  |
| 🗖 Has cache 👘 Has DDR              | 🖬 Has cache 🗖 Has DDR                 |  |  |  |
| Clk Reg: 0 🔮 🗆 Has PLL 🗆 CLK BUF   | Clk Reg: 0 🗧 🗆 Has PLL 🗆 CLK BUF      |  |  |  |
| (1,0)                              | (1,1)                                 |  |  |  |
| FFT_STRATUS Impl.: isic_fx32_dma6  | ю —                                   |  |  |  |
| FFT_STRATUS                        | ю                                     |  |  |  |
| Has cache Has DDR                  | 🗆 Has cache 👘 Has DDR                 |  |  |  |
| Clk Reg: 0 🚔 🗆 Has PLL 🗆 CLK BUF   | Clk Reg: 0 🖉 🗆 Has PLL 🗆 CLK BUF      |  |  |  |
| (2,0)                              | (2,1)                                 |  |  |  |
| GEMM_STRATUS - Impl.: 48_dma64_wor | CONV2D_STRATUS 🛁 Impl.: basic_dma64 🛁 |  |  |  |
| GEMM_STRATUS                       | CONV2D_STRATUS                        |  |  |  |
| F Has cache Has DDR                | Has cache Has DDR                     |  |  |  |
| Clk Reg: 0 🚍 🗆 Has PLL 🗂 CLK BUF   | Clk Reg: 0 🛢 🗆 Has PLL 🗂 CLK BUF      |  |  |  |

#### Performance Assessment: Speedup Results

- Speedup of memory access times on SoC with pipelined LLC compared to SoC with original LLC
- Speedup is as high as 50%, ranges from 10%-25%



Workload/Coherence Mode

#### Conclusion

- We implemented pipelining in the ESP Last Level Cache, resolved pipelining hazards, optimized DMA performance, and enabled concurrent processing of LLC requests.
- We achieved significant speedup in memory access times of accelerator DMA, up to ~50% for some workloads with a consistent range of 10% to 25%, and maintained modularity and scalability of ESP

Verification is still in progress (booting Linux on multi-core configurations), but we plan to release a version of ESP with the improved cache hierarchy later this year!

# Thank you for listening!

www.esp.cs.columbia.edu

