

#### VLSI Signal Processing

#### Lecture 5 Systolic Array Architecture

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw)

5-1

A C



### Techniques for VLSI Systems

- Algorithm Strength Reduction
  - Fast algorithm
    - Using polyphase filter bank to realize long-length taps linear phase filter
    - Using FFT instead of DFT
    - Fast convolution algorithm
  - Tradeoff between performance and complexity
    - Fast Trackback Viterbi algorithm (for convolutional code, Turbo code)
    - Detect first and followed by BM algorithm for Reed-Solomon Code
    - Using CORDIC machine instead of complex multiplier
- Memory management
  - Memory bank, Register File
  - Local buffer (cache, or FIFO...)
  - Multiple-port memory is replaced by multiple single-port memory bank
- Power Management
  - Resource allocation
  - Using finite-state machine (FSM) to well timing and flow control, through enable/disable signals, in order to time-share the same PE
  - Clock gating, Data gating
  - Power-aware, Energy-aware design
- Low-power circuit design technology









#### Convolutinal Code





### Mapping Algorithms onto Array Structures



- Localized operations, intensive computations, and matrix operations are features of many DSP algorithms.
- Derive a maximal concurrency by using both pipelining and parallel processing
  - How is the inherent concurrency?
  - How is the array processor design dependent on the algorithm?
  - How is the algorithm best implemented in the array processor?
- Dependence graph (DG)
  - By tracing the associated space-time index space and using proper arcs to display the dependencies
  - It exhibits the full dependencies incurred in the execution of a specific algorithm
- Interconnection network
- Systolic Array
  - Modularity, regularity, local interconnection







# History and Motivation

- Introduced by HT Kung and Leiserson, 1978
- Designs for matrix computations
- Illustrated by snapshots of operation
- Motivations
  - Improve performance of special-purpose systems (e.g. maximize processing per memory access)
  - Reduce design and implementation costs





### What is a Systolic Architecture

- A network of processing elements (PEs) that computes and rhythmically passes data through it
- Multiple PEs to maximize processing per memory access





Example











VSP Lecture 5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw)





- A new class of pipelined array architectures
- Benefits
  - Simple and regular design (cost-effective)
  - Concurrency and communication
  - Modular and expandable
- Drawbacks
  - Not all algorithms can be implemented using a systolic architecture
  - Cost in hardware and area
  - Cost in latency





# Systolic Fundamentals

- Systolic architecture are designed by using linear mapping techniques on regular dependence graph (DG)
- Regular Dependence Graph: the presence of an edge in a certain direction at any node in the DG represents presence of an edge in the same direction at all nodes in the DG
- DG corresponds to space representation  $\rightarrow$  no time instance is assigned to any computation
- Systolic architectures have a space-time representation where each node is mapped to a certain processing element (PE) and is scheduled at a particular time instance.
- Systolic design methodology maps an N-dimensional DG to a lower dimensional systolic architecture







### **Regular Dependence Graph**

Space representation for FIR filter





VSP Lecture 5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw)



# Definitions

- Projection vector  $\mathbf{d} = \begin{pmatrix} d_1 \\ d_2 \end{pmatrix}$  (also called iteration vector)
  - Two nodes that are displaced by d or multiples of d are executed by the same processor
- Scheduling vector  $\mathbf{s}^T = (s_1, s_2)$ 
  - Any node with index  ${\bf I}$  would be executed at time  ${\bf s}^{\sf T}{\bf I}$
- Processor space vector  $\mathbf{p}^T = (p_1, p_2)$ 
  - Any node with index  $\mathbf{I}^{\mathsf{T}}=(i,j)$  would be executed by processor

$$\mathbf{p}^T \mathbf{I} = (p_1, p_2) \begin{bmatrix} i \\ j \end{bmatrix}$$



VSP Lecture 5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw)





# Systolic Design Methodology

- Many systolic architectures can be designed for a given algorithm by selecting different projection, processor space, and scheduling vectors.
- Feasibility constraints
  - If point  $I_A$  and point  $I_B$  differ by d,  $d = I_A I_B$ , i.e. they are lying on the same direction along projection vector, they must be executed by the same processor. That is,  $p^TI_A = p^TI_B$  or  $p^Td = 0$
  - If point  $\mathbf{I}_A$  and point  $\mathbf{I}_B$  are mapped to the same processor, i.e.  $\mathbf{I}_A \mathbf{I}_B = \mathbf{d}$ , they cannot be executed at the same time. That is,  $\mathbf{s}^T \mathbf{I}_A \neq \mathbf{s}^T \mathbf{I}_B$  or  $\mathbf{s}^T \mathbf{d} \neq \mathbf{0}$
  - If an edge e exists in DG, then an edge p<sup>T</sup>e is introduced in the systolic array with s<sup>T</sup>e delay









# Array Architecture Design

- Step 1: mapping algorithm to DG
  - Based on the space-time indices in the recursive algorithm
  - Shift-Invariance (Homogeneity) of DG
  - Localization of DG: broadcast vs. transmitted data
- Step 2: mapping DG to SFG
  - Processor assignment: a projection method may be applied (projection vector **d**)
  - Scheduling: a permissible linear schedule may be applied (schedule vector s)
    - Preserve the inter-dependence
    - Nodes on an equitemporal hyperplane should not be projected to the same PE
- Step 3: mapping an SFG onto an array processor









### Space-Time Representation

- The space representation or DG can be transformed to a space-time representation by interpreting one of the spatial dimensions as temporal dimension
- 2D DG:

$$\begin{bmatrix} i'\\j'\\t' \end{bmatrix} = T \begin{pmatrix} i\\j\\t \end{pmatrix} = \begin{bmatrix} 0 & 0 & 1\\ \mathbf{p}^T & 0\\ \mathbf{s}^T & 0 \end{bmatrix} \begin{bmatrix} i\\j\\t \end{bmatrix}$$

Scheduling time instance

5-16

$$i'=t, \quad j'=\mathbf{p}^T I, \quad t'=\mathbf{s}^T I$$

Processor axis (2D-DG is mapped to a 1D systolic array)

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw)









VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw)







### Selection of Scheduling Vector

• Linear scheduling

 $S_{X} = \mathbf{s}^{T} \mathbf{I}_{X} = (\mathbf{s}_{1} \mathbf{s}_{2}) (\mathbf{i}_{x}, \mathbf{j}_{x})^{T}$  $S_{Y} = \mathbf{s}^{T} \mathbf{I}_{Y} = (\mathbf{s}_{1} \mathbf{s}_{2}) (\mathbf{i}_{y}, \mathbf{j}_{y})^{T}$ 

Affine scheduling (A transformation followed by a translation)

 $S_{X}=\mathbf{s}^{\mathsf{T}}\mathbf{I}_{X}+\gamma_{x}=(\mathbf{s}_{1} \mathbf{s}_{2})(\mathbf{i}_{x},\mathbf{j}_{x})^{\mathsf{T}}+\gamma_{x}$  $S_{y}=\mathbf{s}^{\mathsf{T}}\mathbf{I}_{y}+\gamma_{y}=(\mathbf{s}_{1} \mathbf{s}_{2})(\mathbf{i}_{y},\mathbf{j}_{y})^{\mathsf{T}}+\gamma_{y}$ 

- For a dependence relation  $X \rightarrow Y$ , where  $I_X^{T}=(i_x,j_x)^{T}$  and  $I_y^{T}=(i_y,j_y)^{T}$ . Then we have  $S_y \ge S_X + T_x$ , where  $S_X$  and  $S_y$  are scheduling times for node X and Y, respectively, and  $T_X$  is the computation time for node X.
- Each edge of a DG leads to an inequality for selection of scheduling vectors



#### Regular Iteration Algorithm (RIA)

- Standard input RIA form
  - If the index of the inputs are the same for all equations
- Standard output RIA form
  - If all the output indices are the same
- For FIR filtering, we have O/I- relationship

$$w(i+1,j) = w(i,j)$$
  
x(i,j+1) = x(i,j)  
y(i+1,j-1) = y(i,j) + w(i+1,j-1)x(i+1,j-1)



(0, 0)

• We can express it in standard output RIA form as

$$w(i,j) = w(i-1,j) \times (i,j) = x(i,j-1) y(i,j) = y(i-1,j+1) + w(i,j)x(i,j)$$

 It is obvious that the FIR filtering problem cannot be expressed in standard input RIA form



### Selection of $\mathbf{s}^{\mathsf{T}}$

- Capture all the fundamentals edge in the reduced dependence graph (RDG), which is constructed by the regular iteration algorithm (RIA)
- Construct the scheduling inequalities according  $S_y \ge S_x + T_x$ , if there is an edge  $X \rightarrow Y$  $s^T I_y + \gamma_y \ge s^T I_x + \gamma_x + T_x$











٠

Taking  $s^T$  = (9 1), d = (1 -1) such that  $s^Td \neq 0$  and  $p^T$  = (1,1) such that  $p^Td$  = 0 we get HUE = 1/8. The edge mapping is as follows :

| e            | p™e | s⊺e |  |  |  |  |
|--------------|-----|-----|--|--|--|--|
| wt(1 0)      | 1   | 9   |  |  |  |  |
| i/p(0 1)     | 1   | 1   |  |  |  |  |
| result(1 -1) | 0   | 8   |  |  |  |  |



Systolic architecture for the example







#### Matrix-Matrix Multiplication

$$\begin{pmatrix} c_{11} & c_{12} \\ c_{21} & c_{22} \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix} \begin{pmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{pmatrix}$$





 $C_{11} = a_{11}b_{11} + a_{12}b_{21}$ 

 $C_{12} = a_{11}b_{12} + a_{12}b_{22}$ 



VSP Lecture 5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw)





### Example

- Applying scheduling inequality with  $T_{mult-add} = 1$ , and  $T_{com} = 0$  we get  $s_2 \ge 0$ ,  $s_1 \ge 0$ ,  $s_3 \ge 1$ ,  $\gamma_c \gamma_a \ge 0$ and  $\gamma_c - \gamma_b \ge 0$ . Take  $\gamma_a = \gamma_b = \gamma_c = 0$ for linear scheduling.
- Solution 1 :
  s<sup>T</sup> = (1,1,1), d<sup>T</sup> = (0,0,1), p<sub>1</sub> = (1,0,0),
  p<sub>2</sub> = (0,1,0), P<sup>T</sup> = (p<sub>1</sub> p<sub>2</sub>)<sup>T</sup>
- Solution 2 :
  - $s^{T} = (1,1,1), d^{T} = (1,1,-1), p_{1} = (1,0,1),$
  - p<sub>2</sub> = (0,1,1), P<sup>⊤</sup> = (p<sub>1</sub> p<sub>2</sub>)<sup>⊤</sup>

VSP Lecture5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw)





### Example



| Sol. 1             |        | Sol. 2           |            |        |                  |
|--------------------|--------|------------------|------------|--------|------------------|
| e                  | p™e    | s <sup>⊤</sup> e | e          | р⊤е    | s <sup>T</sup> e |
| a(0, 1, 0)         | (0, 1) | 1                | a(0, 1, 0) | (0, 1) | 1                |
| b(1, 0, 0)         | (1, 0) | 1                | b(1, 0, 0) | (1, 0) | 1                |
| <i>C</i> (0, 0, 1) | (0,0)  | 1                | C(0, 0, 1) | (1, 1) | 1                |







VSP Lecture 5 - Systolic Array (cwliu@twins.ee.nctu.edu.tw)