

# **RAMP:** Resource-Aware Mapping for CGRAs Shail Dave, Mahesh Balasubramanian, Aviral Shrivastava Compiler Microarchitecture Lab, Arizona State University



### **Coarse Grained Reconfigurable Arrays (CGRAs)**

- Array of Processing Elements (PEs); each PE has ALU-like functional unit that works on an operation every cycle.
- Power-efficiency of several 10s of GOps/Sec per Watt! > ADRES [HiPEAC '08]  $\rightarrow$  HyCUBE [DAC '17]



### **Performance Impact of Ad-Hoc Routing Strategies**



#### Architectural Configuration

### Mapping Loops on CGRAs



- Performance Critically Depends on the Obtained Mapping
- Mapping Problem = Routing Problem
  - $\succ$  Routing is needed when the dependent operations are scheduled at a distant time, or operations cannot be mapped due to resource constraints.

## **Challenges with Code Generation Heuristics Employing Ad-hoc Routing Strategies**

### **RAMP: Resource-Aware Mapping Technique**



- Partition Mapping Problem in **3 Sub-Problems**
- Systematically and Flexibly Explore Resources to Achieve Mapping, Adapting to the Application Needs
- > E.g. we can choose to first map the DDG with routing via registers. Then, for any unmapped data dependency, explore different routing options, per failure analysis.







- **Routing Data Dependency via PEs** 
  - EMS [H. Park et al., PACT '08]
  - **EPIMap** [M. Hamzeh et al., DAC '12]

Since a<sub>r</sub> is not rescheduled, cannot route  $a \rightarrow e$ 

a

d

g

b

### **Routing via Registers**

- **REGIMap** [M. Hamzeh al., DAC '13] - **GraphMinor** [L. Chen et al., TRETS '14] **Cannot efficiently utilize distributed** 

#### **Routing via Memory**

- **MEMMap** [S. Yin al., TVLSI '16] **Statically determine the** dependencies routed via memory

#### RFs to route $e^{i-2} \rightarrow a^i$



### **≠** spill data to memory when required (unavailability of regs)



### **Experimental Setup**

- 8 MiBench benchmarks (top performance-critical loops)
- RAMP modeled using **CCF Compilation & Simulation Framework** Available at: https://github.com/cmlasu/ccf (LLVM 4.0 and gem5 as foundation) CGRA modeled as a separate core coupled with ARM Cortex-like core
- Evaluation over **12 architectural** configurations
  - $\blacktriangleright$  PEs connected in a 2D torus, perform fixed-point computations
- CGRA accesses 4 kB data memory and 4 kB instruction memory
- Configurations vary in terms of array size, PE functionality, registers etc.

| 4             | 8 | 12          |
|---------------|---|-------------|
| Architecture  |   | Increase in |
| Configuration |   | Resources   |



### Acknowledgements

This work was partially supported by funding from NSF grants CNS 1525855 and CCF 172346 - NSF/Intel joint research center for Computer Assisted Programming for Heterogeneous Architectures (CAPA).