

5008: Computer Architecture

Appendix A - Pipelining

CA Lecture03 - pipelining (cwliu@twins.ee.nctu.edu.tw)

03-1

A C





# Pipeline Review

- A pipeline is like an hooked assembly line.
- Pipelining, in general, is not visible to the programmer (vs ILP)
- Pipelining doesn't help latency of single task, it helps throughput of entire workload
- Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously using different resources
- Potential speedup = Number pipe stages, if perfectly balanced stage.
- Unbalanced lengths of pipe stages reduces speedup
- Time to "fill" pipeline and time to "drain" it reduces speedup
- Stall for Dependences





## Outline

- MIPS An ISA example for pipelining
- 5 stage pipelining
- Structural and Data Hazards
- Forwarding
- Branch Schemes
- Exceptions and Interrupts
- Conclusion









# A "Typical" RISC ISA

- 32-bit fixed format instruction (3 formats)
- 32 32-bit GPR (RO contains zero, DP take pair)
- 3-address, reg-reg arithmetic instruction
- Single address mode for load/store: base + displacement
  - no indirection
- Simple branch conditions
- Delayed branch

see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3









#### **Register-Register**

| 31 | 26 | 25 2 | 120 16 | 15 | 1110 | 65  | 0 |
|----|----|------|--------|----|------|-----|---|
| Ор |    | Rs1  | Rs2    | Rd |      | Орх | , |

#### **Register-Immediate**

| 31 | 26 | 25  | 2120 | 16 | 15        | 0 |
|----|----|-----|------|----|-----------|---|
| Ор |    | Rs1 | Rd   |    | immediate |   |

#### Branch

| 31 | 26 | 25  | 2120 | 16   | 15        | 0 |
|----|----|-----|------|------|-----------|---|
| Ор |    | Rs1 | Rs2/ | ′Ор> | immediate |   |

#### Jump / Call





DEPT. OF ELECTRO

INST OF FLECTRONICS



- Datapath: Storage, FU, interconnect sufficient to perform the desired functions
  - Inputs are Control Points
  - Outputs are signals
- Controller: State machine to orchestrate operation on the data path
  - Based on desired function and signals





# Approaching an ISA

- Instruction Set Architecture
  - Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing
- Meaning of each instruction is described by RTL on architected registers and memory
- Given technology constraints assemble adequate datapath
  - Architected storage mapped to actual storage
  - Function units to do all the required operations
  - Possible additional storage (eg. MAR, MBR, ...)
  - Interconnect to move information among regs and FUs
- Map each instruction to sequence of RTLs
- Collate sequences into symbolic controller state transition diagram (STD)
- Lower symbolic STD to control points
- Implement controller





# Outline

- MIPS An ISA example for pipelining -- Read Appendix B
- 5 stage pipelining
- Structural and Data Hazards
- Forwarding
- Branch Schemes
- Exceptions and Interrupts
- Conclusion







#### The Five Steps of the Load Instruction





- Every instruction can be implemented in at most 5 clock cycle
- Ifetch: Instruction Fetch
  - Fetch the instruction from the Instruction Memory
- Reg/Dec: Registers Fetch and Instruction Decode
- Exec: Execution and calculate the memory address
- Mem: Read the data from the Data Memory
- Wr: Write the data back to the register file

Branch requires ? cycles, Store requires ? cycles, others require ? cycles

CA Lecture03 - pipelining (cwliu@twins.ee.nctu.edu.tw)





#### The Four Steps of R-type Instruction

does not access data memory...



- Ifetch: Instruction Fetch
  - Fetch the instruction from the Instruction Memory
  - Update PC
- Reg/Dec: Registers Fetch and Instruction Decode
- Exec:
  - ALU operates on the two register operands
- Wr: Write the ALU output back to the register file





# Important Observation

- Each functional unit can only be used once per instruction
- Each functional unit must be used at the same step for all instructions:
  - Load uses Register File's Write Port during its 5th step



This's what caused the problem

- R-type uses Register File's Write Port during its 4th step

|               | 1      | 2              | 3    | 4  |
|---------------|--------|----------------|------|----|
| <b>R-type</b> | Ifetch | <b>Reg/Dec</b> | Exec | Wr |

#### Structural hazard !!





### Pipelining the R-type and Load Instruction



- We have pipeline conflict or structural hazard:
  - Two instructions try to write to the register file at the same time!
  - Only one write port



CA Lecture03 - pipelining (cwliu@twins.ee.nctu.edu.tw)



### Sol 2: Delay R-type's Write by One Cycle

- Now R-type instructions also use Reg File's write port at Step  $5^{12}$
- Mem step for R-type inst. is a NOOP : nothing is being done.









Store Ifetch Reg/Dec Exec Mem Wr

In order to keep our pipeline uniform

- Ifetch: Instruction Fetch
  - -Fetch the instruction from the Instruction Memory
  - Update PC
- Reg/Dec: Registers Fetch and Instruction Decode
- Exec: Calculate the memory address
- Mem: Write the data into the Data Memory





## The Three Steps of Beq



• Ifetch: Instruction Fetch

-Fetch the instruction from the Instruction Memory

• Reg/Dec:

–Registers Fetch and Instruction Decode

- Exec:
  - compares the two register operand,
  - -select correct branch target address
  - \_latch into PC





## Designing a Pipelined Processor

- Examine the datapath and control diagram
  - Starting with single- or multi-cycle datapath?
  - Single- or multi-cycle control?
- Partition datapath into steps
- Insert pipeline registers between successive steps
- Associate resources with steps
- Ensure that flows do not conflict, or figure out how to resolve
- Assert control in appropriate stage











Inst. Set Processor Controller





# Data Stationary Control

- Pass control signals along just like the data
  - Main control generates control signals during ID





#### Use of "Data Stationary Control"

- The Main Control generates the control signals during Reg/Dec
  - Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
  - Control signals for Mem (MemWr Branch) are used 2 cycles later
  - Control signals for Wr (MemtoReg MemWr) are used 3 cycles later







# Outline

- MIPS An ISA example for pipelining
- 5 stage pipelining
- Structural and Data Hazards
- Forwarding
- Branch Schemes
- Exceptions and Interrupts
- Conclusion







## Pipelining is not quite that easy!

- Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
  - <u>Structural hazards</u>: HW cannot support this combination of instructions (single person to fold and put clothes away)
  - <u>Data hazards</u>: Instruction depends on result of prior instruction still in the pipeline (missing sock)
  - <u>Control hazards</u>: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).





Detection is easy in this case! (right half highlight means read, left half write)



## Structural Hazards Limit Performance



- Why? The primary reason is to reduce cost of the unit
- Example: if 1.3 memory accesses per instruction and only one memory access per cycle then
  - average CPI = 1.3
  - otherwise resource is more than 100% utilized
- Solution 1: Use separate instruction and data memories
- Solution 2: Allow memory to read and write more than one word per cycle
- Solution 3: Stall





## Speed Up Equations for Pipelining

 $Speedup = \frac{Average instruction time unpipelined}{Average instruction time pipelined} = \frac{CPI_{unpipelined}}{CPI_{pipelined}} \times \frac{Clock cycle unpipelined}{Clock cycle pipelined}$ 

 $CPI_{pipelined} = Ideal CPI + Average Stall cycles per Inst$ 

Speedup = 
$$\frac{\text{Ideal CPI} \times \text{Pipeline depth}}{\text{Ideal CPI} + \text{Pipeline stall CPI}} \times \frac{\text{Cycle Time}_{\text{unpipelined}}}{\text{Cycle Time}_{\text{pipelined}}}$$

#### for balanced pipelining

#### For simple RISC pipeline, CPI = 1:

Speedup =  $\frac{\text{Pipeline depth}}{1 + \text{Pipeline stall CPI}} \times \frac{\text{Cycle Time}_{\text{unpipelined}}}{\text{Cycle Time}_{\text{pipelined}}}$ 



CA Lecture03 - pipelining (cwliu@twins.ee.nctu.edu.tw)





#### Example: One or Two Memory Ports?

- Machine A: Dual ported memory ("Harvard Architecture")
- Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate
- Ideal CPI = 1 for both
- Loads are 40% of instructions executed

SpeedUp<sub>A</sub> = Pipeline Depth/(1 + 0) x (clock<sub>unpipe</sub>/clock<sub>pipe</sub>) = Pipeline Depth

SpeedUp<sub>B</sub> = Pipeline Depth/(1 + 0.4 × 1) × ( $clock_{unpipe}$ /( $clock_{unpipe}$ / 1.05)

- = (Pipeline Depth/1.4) x 1.05
- = 0.75 x Pipeline Depth

SpeedUp<sub>A</sub> / SpeedUp<sub>B</sub> = Pipeline Depth/( $0.75 \times Pipeline Depth$ ) = 1.33

• Machine A is 1.33 times faster





#### One Memory Port Structural Hazards



Time (clock cycles)





# Handling Stalls

- How to stall?
  - Stall instruction in IF and ID: not change PC and IF/ID
    - => the stages re-execute the instructions
  - What to move into EX: insert an NOP by changing EX, MEM, WB control fields of ID/EX pipeline register to 0
    - as control signals propagate, all control signals to EX, MEM, WB are de-asserted and no registers or memories are written



CA Lecture03 - pipelining (cwliu@twins.ee.nctu.edu.tw)





• Due to the overlapped instructions.

Example: r1 cannot be read by other instructions before it is written by the add.

| add | r2,r3, <mark>r1</mark>  |
|-----|-------------------------|
| sub | r4, <mark>r1</mark> ,r3 |
| and | r6, <u>r1</u> ,r7       |
| or  | r8, <mark>r1</mark> ,r9 |
| xor | r10, <u>r1</u> ,r1      |







## RAW Hazards on R1

*Time (clock cycles)* Dependencies backwards in time are hazards





# Types of Data Hazards

Three types: (inst. i1 followed by inst. i2)

- RAW (read after write): dependence i2 tries to read operand before i1 writes it
- WAR (write after read): anti-dependence
  - i2 tries to write operand before i1 reads it
  - Gets wrong operand, e.g., auto-increment addr.
  - Can't happen in MIPS 5-stage pipeline because:
    - All instructions take 5 stages, and reads are always in stage 2, and writes are always in stage 5
- WAW (write after write): output dependence i2 tries to write operand before i1 writes it
  - Leaves wrong result (i1's not i2's); occur only in pipelines that write in more than one stage
  - Can't happen in MIPS 5-stage pipeline because:
    - All instructions take 5 stages, and writes are always in stage 5
  - Out of order executions may suffer this data dependence CA Lecture03 - pipelining (cwliu@twins.ee.nctu.edu.tw)



### WAR Data Hazard

 Write After Read (WAR) Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> reads it

```
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
```

- Called an "anti-dependence" by compiler writers. This results from reuse of the name "r1".
- Can't happen in MIPS 5 stage pipeline <u>because</u>:
- Can happen in between a shorter (Int) pipeline and a longer (FP) pipeline
- WAR hazards can happen if instructions execute out of order or access data late







Time (clock cycles)





#### WAW Data Hazard

Write After Write (WAW)

Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> writes it.

I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

- Called an "output dependence" by compiler writers This also results from the reuse of name "r1".
- Can't happen in 5 stage pipeline <u>because</u>:
  - All instructions take 5 stages, and
  - Writes are always in stage 5
- Will see WAR and WAW in more complicated pipes









Time (clock cycles)





### Data Forwarding to Avoid Data Hazard

- With data forwarding (also called bypassing or shortcircuiting), data is transferred back to earlier pipeline stages before it is written into the register file.
  - Instr i: add r1,r2,r3 (result ready after EX stage)
  - Instr j: sub r4,r1,r5 (result needed in EX stage)
- This either eliminates or reduces the penalty of RAW hazards.
- To support data forwarding, additional hardware is required.
  - Multiplexors to allow data to be transferred back
  - Control logic for the multiplexors







### Data Hazard Solution







#### HW Change for Forwarding





#### Data Hazard Even with Forwarding Time (clock cycles)









#### Software Scheduling to Avoid Load Hazards





Compiler optimizes for performance. Hardware checks for safety.

03-45



#### Compiler Avoiding Load Stalls



Compilers reduce the number of load stalls, but do not completely eliminate them.





# Outline

- MIPS An ISA example for pipelining
- 5 stage pipelining
- Structural and Data Hazards
- Forwarding
- Branch Schemes
- Exceptions and Interrupts
- Conclusion





 $03-4^{\circ}$ 



CA Lecture03 - pipelining (cwliu@twins.ee.nctu.edu.tw)



# Control/Branch Hazards

- Control hazards, which occur due to instructions changing the PC, can result in a large performance loss.
- A branch is either
  - Taken: PC <= PC + 4 + Imm ; branch target address</p>
  - Not Taken:  $PC \leftarrow PC + 4$
- The simplest solution is to stall the pipeline as soon as a branch instruction is detected.
  - Detect the branch in the ID stage
  - Don't know if the branch is taken until the EX stage
  - If the branch is taken, we need to repeat the IF and ID stages
  - New PC is not changed until the end of the MEM stage, after



determining if the branch is taken and the new PC value

CA Lecture03 - pipelining (cwliu@twins.ee.nctu.edu.tw)



03-49



### **Branch Stall Impact**

- If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9 !!
- Two part solution:
  - Determine branch taken or not sooner, AND
  - Compute taken branch address earlier
- MIPS branch tests if register = 0 or  $\neq$  0
- MIPS Solution:
  - Move Zero test to ID/RF stage
  - Adder to calculate new PC in ID/RF stage
  - 1 clock cycle penalty for branch versus 3





03 - 50







### Four Branch Hazard Alternatives

- #1: Stall until branch direction is clear
- #2: Predict Branch Not Taken
  - Execute successor instructions in sequence
  - "Squash" instructions in pipeline if branch actually taken
  - Advantage of late pipeline state update
  - 47% MIPS branches not taken on average
  - PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken

- 53% MIPS branches taken on average
- But haven't calculated branch target address in MIPS
  - MIPS still incurs 1 cycle branch penalty
  - Other machines: branch target known before outcome







#### Four Branch Hazard Alternatives

#4: <u>Delayed Branch</u> -- make the stall cycle useful

- Define branch to take place AFTER a following instruction

```
branch instruction
sequential successor_
sequential successor_
sequential successor_
branch target if taken
e.g. Branch delay slot
of length n
These insts. are executed !!
```

- 1 slot delay allows proper decision and branch target address in 5 stage pipeline
- MIPS uses this







### Stall -- Control Hazard Solution

- Stall: wait until decision is clear
  - It's possible to move up decision to 2nd stage by adding hardware to check registers as being read







#### Predict-- Control Hazard Solution

- Predict: guess one direction then back up if wrong
  - Predict not taken, for example





Impact: 1 clock cycle per branch instruction if right, 2 if wrong



### Predict-Not-Taken Example

| Untaken branch instruction | IF | $\mathbb{D}$ | EX   | MEM  | WB      |              |     |     |    |
|----------------------------|----|--------------|------|------|---------|--------------|-----|-----|----|
| Instruction i + 1          |    | IF           | ID   | EX   | MEM     | WB           |     |     |    |
| Instruction i + 2          |    |              | IF   | ID   | EX      | MEM          | WB  |     |    |
| Instruction i + 3          |    |              |      | IF   | ID      | EX           | MEM | WB  |    |
| Instruction i + 4          |    |              |      |      | F       | $\mathbb{D}$ | EX  | MEM | WB |
|                            |    | /            | /    | А    | Stall i | ndeed        |     |     |    |
| Taken branch instruction   | F  | D            | EX   | MEM  | WB      |              |     |     |    |
| Instruction <i>i</i> + 1   |    | IF           | idle | idle | idle    | idle         |     |     |    |
| Branch target              |    |              | IF   | ID   | EX      | MEM          | WB  |     |    |
| Branch target + 1          |    |              |      | IF   | ID      | EX           | MEM | WB  |    |
| Branch target + 2          |    |              |      |      | IF      | ID           | EX  | MEM | WB |



1 clock cycle per branch instruction if right, 2 if wrong 03-56



### Delayed Branch-- Control Hazard Solution

• Redefine branch behavior (takes place after next instruction) "delayed branch"



- Impact: 1 clock cycles per branch instruction if can find instruction to put in "slot"



# Delayed Branch

- Delayed branch  $\rightarrow$  make the stall cycle useful
  - Add delay slots = branch penalty = length of branch delay
    - 1 slot for 5-stage DLX/MIPS
  - Instructions in the delay slot are executed whether or not the branch is taken
  - See if the compiler can schedule something useful in these slots
    - When the slots cannot be scheduled, they are filled with the no-op instruction (indeed, stall!!)
  - Hope that filled slots actually help advance the computation





- A is the best choice, fills delay slot & reduces instruction count (IC)
- In B, the sub instruction may need to be copied, increasing IC
- In B and C, must be okay to execute sub when branch fails



| Scheduling<br>Strategy | Requirements                                                                                                       | Improve Performance<br>When?                                                      |
|------------------------|--------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| From before            | Branch must not depend on the rescheduled instructions                                                             | Always                                                                            |
| From target            | Must be OK to execute<br>rescheduled instructions if<br>branch is not taken. May need<br>to duplicate instructions | When branch is taken.<br>May enlarge program<br>if instructions are<br>duplicated |
| From fall<br>through   | Must be OK to execute instructions if branch is taken                                                              | When branch is not<br>taken.                                                      |





# Delayed Branch Summary



- Compiler effectiveness for single branch delay slot:
  - Fills about 60% of branch delay slots
  - About 80% of instructions executed in branch delay slots useful in computation
  - About 50% (60% x 80%) of slots usefully filled
- As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slots
  - Delayed branching (the static way) has lost popularity compared to more expensive but more flexible dynamic approaches
  - Growth in available transistors has made dynamic approaches relatively cheaper





### **Evaluating Branch Alternatives**



Pipeline speedup =  $\frac{\text{Pipeline depth}}{1 + \text{Branch frequency} \times \text{Branch penalty}}$ 

• Assume 4% unconditional branch, 6% conditional branchuntaken, 10% conditional branch-taken

|                  | Branch  | CPI  | speedup v.  | speedup v. |
|------------------|---------|------|-------------|------------|
| scheme           | penalty |      | unpipelined | stall      |
| Stall pipeline   | 3       | 1.60 | 3.1         | 1.0        |
| Predict taken    | 1       | 1.20 | 4.2         | 1.33       |
| Predict not take | en 1    | 1.14 | 4.4         | 1.40       |
| Delayed branch   | 0.5     | 1.10 | 4.5         | 1.45       |





# Deeper Pipeline Example

- For a deeper pipeline, e.g. MIPS R4K, it takes at least 3 pipeline stages before the branch-target address is known and an additional cycle before the branch condition is evaluated, assuming no stalls on the registers in the conditional comparisons
  - Assuming an ideal CPI of 1,

Speedup =  $\frac{\text{Pipeline depth}}{1 + \text{Pipeline stall cycles from branches}}$ 

Pipeline stall cycles from branches = Branch frequency  $\times$  Branch penalty







### **Branch Penalties**



• Branch penalties

| Branch           | Penalty        | Penalty | Penalty |
|------------------|----------------|---------|---------|
| scheme           | unconditional. | untaken | taken   |
| Flush pipeline   | 2              | 3       | 3       |
| Predict taken    | 2              | 3       | 2       |
| Predict not take | n 2            | 0       | 3       |

#### Try to find the CPI penalties for 3 branch schemes (Fig. A.16)



CA Lecture03 - pipelining (cwliu@twins.ee.nctu.edu.tw)





# Problems with Pipelining



- Examples: divide by zero, undefined opcode
- Interrupt: Hardware signal to switch the processor to a new instruction stream
  - Example: a sound card interrupts when it needs more audio output samples (an audio "click" happens if it is left waiting)
- Problem: It must appear that the exception or interrupt must appear between 2 instructions  $(I_i \text{ and } I_{i+1})$ 
  - The effect of all instructions up to and including  $\mathbf{I}_{\mathrm{i}}$  is totalling complete
  - No effect of any instruction after I<sub>i</sub> can take place
- The interrupt (exception) handler either aborts program or restarts at instruction  $\mathbf{I}_{i+1}$











03-66

# Exceptions in MIPS

| Pipeline stage | Problem exceptions occurring                                                                 |
|----------------|----------------------------------------------------------------------------------------------|
| IF             | Page fault on instruction fetch; misaligned<br>memory access; memory protection<br>violation |
| ID             | Undefined or illegal opcode                                                                  |
| EX             | Arithmetic exception                                                                         |
| MEM            | Page fault on data fetch; misaligned<br>memory access; memory protection<br>violation        |
| WB             | Non                                                                                          |

Note: Multiple exceptions may occur in the same clock cycle in pipelining architecture



CA Lecture03 - pipelining (cwliu@twins.ee.nctu.edu.tw)



Key observation: architected state only change in memory and register write stages.



- Hazards limit performance on computers: ullet
  - Structural: need more HW resources
  - Data (RAW, WAR, WAW): need forwarding, compiler scheduling
  - Control: delayed branch, prediction
- Exceptions, Interrupts add complexity

