

# dMazeRunner: Executing Perfectly Nested Loops on Programmable Dataflow Accelerators



Shail Dave<sup>1</sup>, Youngbin Kim<sup>2</sup>, Sasikanth Avancha<sup>3</sup>, Kyoungwoo Lee<sup>2</sup>, Aviral Shrivastava<sup>1</sup> 1. Arizona State University 2. Yonsei University 3. Parallel Computing Lab, Intel Labs

<u>╶</u>┣·{<sub>┙</sub><mark>┣</mark>·{<sub>┙</sub><mark>┣</mark>·{<sub>┙</sub><mark>┣</mark>·{<sub>┙</sub></sub>┣·{<sub>┙</sub></sub>

<u>└</u>┣-{<u></u>┣-{<u></u>┣-{</u><u>┣</u>-{<u></u>┣-{</u>

╧╔╃╼┇╧╔╋╼╏╧╔╋╼╏╧╔╋╼╏╧

<u>└</u>┇╋╶┥<u>╪</u><u></u>┓╋╴┥<u>╪</u><u></u>┓╋╴┥<u>╪</u><u></u>┓╋╴┥</u>

│<sup>┷</sup><mark>│</mark><mark><mark>┝</mark>╶∮<sup>┷</sup>│<mark>┝</mark>╶∮<sup>┷</sup>│<mark>┝</mark>╶∮<sup>┷</sup>│<mark>┝</mark>╶∮<sup>┷</sup>│</mark>

Scratch-Pad Memory

DRAM (Off-Chip)

╔╶┇┇┏╶┇┇┏╺┇┇┏╶┇┇



**SpatioTemporal** 

Execution

of Loops

### **Weight Stationary Dataflow Output Stationary Dataflow** #pragma unroll space #pragma unroll space for fy S=1:3 for oy S=1:3 for fx S=1:3for ox S=1:3 0[oy][ox]+= O[OY][OX] +=W[fy][fx]× W[fy][fx]× I[oy+fy-1][ox+fx-1]; I[oy+fy-1][ox+fx-1];

|   | Fx_Spatial=3 |       |    |  |  |
|---|--------------|-------|----|--|--|
|   | Fy_Spatial=3 | P1 P2 | P3 |  |  |
| 9 |              | P4 P5 | P6 |  |  |
|   |              | P7 P8 | P9 |  |  |
|   |              |       |    |  |  |

## **Analytical Model of Dataflow Execution**

- (1,3)

**→(3,3)** 

Ox\_Spatial=3

(1,2)

(2,1) (2,2) (2,3)

**→(3,2)** 

O(1,1)

(3,1)

Spatial=3

- **Description of** Architectural **App Loop-Nest Specification** Dataflow **Execution** Model Execution Energy Cycles (pJ)
- arbitrary perfectly nested loops.

Each PE

computes 1

output from 9

All PEs compute partial sum for

same 1 output

- - Eyeriss chip: ~11% difference in

# **Programmable Dataflow Accelerators**

- Massive array of Processing Elements (PEs); each PE has ALU-like functional unit to perform operation every cycle (simple, programmable).
- PE's Private + shared memory sustain data reuse.
- Efficiently accelerate ML and media kernels.
- Architecture Variations
- > Systolic arrays: TPU (Google), TensorCore (nVIDIA)
- $\succ$  Spatially programmable architecture: Eyeriss (MIT), SCNN (nVIDIA), AI core (IBM), CSA (Intel)
- Coarse-grained reconfig array: HyCUBE(NUS), DPU(Wave)

# **Current Focus in System Stack**



| nique Reuse Factors for Data Operands |   |        |   |  |  |  |
|---------------------------------------|---|--------|---|--|--|--|
| Loop Order                            | I | W      | 0 |  |  |  |
| [ n]                                  |   | NI = 2 |   |  |  |  |

# **5** schedules

W[m][c][fx][fy]; . . . . . . . .

### I[][][] \* W[][][];

O[][][] +=

for fy L3 = 1:Fy SPATIAL

### **}**} Vast "Execution Method" Space

I[n][c][ox+fx-1][oy+fy-1]\*

### Problem of exploring "execution methods" becomes problem of exploring all the possibilities of tiling and ordering in 28-dimensional loop.

Acknowledgements This work was partially supported by funding from NSF grant CCF 1723476 – NSF/Intel joint research center for Computer Assisted **Programming for Heterogeneous Architectures (CAPA) and from the** grants NRF-2015M3C4A7065522 and 2014-3-00035 funded by MSIT.

resnet18 v1 -auto-optimize