

#### **COMPUTER ORGANIZATION AND DESIGN**

The Hardware/Software Interface



## **Chapter 1**

Computer Abstractions and Technology

- Progress in computer technology
  - Underpinned by Moore's Law
- Makes novel applications feasible
  - Computers in automobiles
  - Cell phones
  - Human genome project
  - World Wide Web
  - Search Engines
- Computers are pervasive



## **Classes of Computers**

- Personal computers
  - Delivery of good performance to single users at low cost, a general purpose computer
  - Subject to cost/performance tradeoff
- Servers
  - Network based and much larger computers for high capacity, performance, and reliability
  - Widest range in cost and capability
    - Low-end servers for file storage vs. High-end servers (or supercomputers) for scientific and engineering calculations
- Embedded computers
  - Hidden as components of a system
  - Stringent energy/performance/cost constraints and lower tolerance for failure



## The PostPC Era





## The PostPC Era

- Personal Mobile Device (PMD)
  - Battery operated
  - Connects to the Internet
  - Hundreds of US dollars
  - Smart phones, tablets, electronic glasses
- Cloud computing
  - Warehouse Scale Computers (WSC)
    - Amazon, google, ....
  - Software as a Service (SaaS)
    - Portion of software run on a PMD and a portion run in the Cloud



## What You Will Learn

- You can answer the following questions:
  - How high-level C/Java programs are translated into the machine language
    - And how the hardware executes them
  - The hardware/software interface
  - What determines the performance of program
    - And how it can be improved
  - How hardware designers improve performance
  - What is parallel processing



# **Understanding Performance**

- Algorithm
  - Determine number of operations executed
- Programming language, compiler, architecture
  - Determine number of machine instructions executed per algorithm/operation
- Processor and memory system
  - Determine how fast instructions are executed
- I/O system (including OS)
  - Determine how fast I/O operations are executed



## **Eight Great Ideas in CA**

- Design for *Moore's Law* (1965)
- Use *abstraction* to simplify design
- Make the common case fast
- Performance via parallelism
- Performance *via pipelining*
- Performance via prediction
- *Hierarchy* of memories
- Dependability via redundancy





### **Hierarchy Layers in Programs**

- Application software
  - Written in high-level language
- System software
  - Compiler:
    - Translate HLL code to machine code
  - Operating System: service code
    - Handling basic input/output
    - Managing memory and storage
    - Scheduling tasks & sharing resources
- Hardware
  - Processor, memory, I/O controllers





# Levels of Program Code

- High-level language
  - Level of abstraction closer to problem domain
  - Provide for productivity and portability
- Assembly language
  - Textual representation of instructions
- Machine language
  - Encoded instructions and data in binary digits (bits)





Chapter 1 — Computer Abstractions and Technology — 10

# **Components of a Computer**

#### **The BIG Picture**





- Same components for all kinds of computer
  - Inputting/outputting data, processing data, and storing data

#### Input/output includes

- User-interface devices
  - Display, keyboard, mouse
- Storage devices
  - Hard disk, CD/DVD, flash
- Network adapters
  - For communicating with other computers







### Touchscreen

- PostPC device
- Supersedes keyboard and mouse
- Resistive and Capacitive types
  - Most tablets, smart phones use capacitive
  - Capacitive allows multiple touches simultaneously









Chapter 1 — Computer Abstractions and Technology — 14

### **Inside the Processor**

### Apple A5





Chapter 1 — Computer Abstractions and Technology — 15

# Inside the Processor (CPU)

- Datapath: performs arithmetic operations on data
- Control: tells datapath, memory, and I/O devices what to do according to the wishes of the instruction
- Memory: data memory and instruction memory
  - DRAM (dynamic random access memory)
- Cache memory
  - Small fast SRAM memory for immediate access to data



## **Abstractions**

#### The BIG Picture

- Abstraction helps us deal with complexity
  - Hide lower-level detail
- Instruction set architecture (ISA)
  - The hardware/software interface
- Application binary interface
  - The ISA plus system software interface
- Implementation
  - The details underlying and interface



## A Safe Place for Data

- Volatile main memory
  - Loses instructions and data when power off
- Non-volatile secondary memory
  - Magnetic disk
  - Flash memory
  - Optical disk (CDROM, DVD)









### **Networks (Communicating with Others)**

- Communication, resource sharing, nonlocal access
- Local area network (LAN): Ethernet
- Wide area network (WAN): the Internet
- Wireless network: WiFi, Bluetooth





# **Technology Trends**

- Electronics
   technology
   continues to evolve
  - Increased capacity and performance
  - Reduced cost



DRAM capacity

| Year | Technology                 | Relative performance/cost |  |
|------|----------------------------|---------------------------|--|
| 1951 | Vacuum tube                | 1                         |  |
| 1965 | Transistor                 | 35                        |  |
| 1975 | Integrated circuit (IC)    | 900                       |  |
| 1995 | Very large scale IC (VLSI) | 2,400,000                 |  |
| 2013 | Ultra large scale IC       | 250,000,000,000           |  |



# **Chip Manufacturing Process**



Yield: proportion of working dies per wafer



## Intel Core i7 Wafer



- 300mm wafer, 280 chips, 32nm technology
- Each chip is 20.7 x 10.5 mm<sup>2</sup>



## Cost of a Chip Includes ...

#### Die cost

- affected by wafer cost, number of dies per wafer, and die yield (#good dies/#total dies)
- goes roughly with the cube of the die area
- Testing cost
- Packaging cost
  - depends on pins, heat dissipation, ...







## **Three Equations for IC Cost**

$$Cost per die = \frac{Cost per wafer}{Dies per wafer \times Yield} Exactly derived eq.$$

$$Dies per wafer \approx Wafer area/Die area Approximation eq.$$

$$Yield = \frac{1}{(1+(Defects per area \times Die area/2))^2} Statistical eq.$$

- Nonlinear relation to area and defect rate
  - Wafer cost and area are fixed
  - Defect rate determined by manufacturing process
  - Die area determined by architecture and circuit design



## **Defining Performance**

#### Which airplane has the best performance?





### **Response Time and Throughput**

- Response time (aka execution time)
  - How long it takes to do a task
- Throughput (aka bandwidth)
  - Total work done per unit time
- PMDs are more focused on response time, while servers are more focused on throughput.
- Example: How are response time and throughput affected by
  - Replacing the processor with a faster version?
  - Adding more processors?
- We'll focus on response time for now...



## **Relative Performance**

- Define Performance = 1/Execution Time
- "X is n time faster than Y"

 $Performance_{x}/Performance_{y}$ 

= Execution time<sub>Y</sub>/Execution time<sub>X</sub> = n

- **Example**: time taken to run a program
  - 10s on A, 15s on B
  - Execution Time<sub>B</sub> / Execution Time<sub>A</sub>
     = 15s / 10s = 1.5
  - So A is 1.5 times faster than B



# **Measuring Execution Time**

#### Elapsed time

- Total response time, including all aspects
  - Processing, I/O, OS overhead, idle time, …
- Determines system performance
- CPU time
  - Time spent by CPU for processing a given job
    - Discounts I/O time, other jobs' shares
  - Comprises user CPU time and system CPU time
    - User CPU time: the CPU time spent in a program itself
    - System CPU time: the CPU time spent in OS performing tasks on behalf of the program



### **Execution Time and CPU Clocking**

 Operation of digital hardware governed by a constant-rate clock



Clock period: duration of a clock cycle

- e.g., 250ps = 0.25ns = 250×10<sup>-12</sup>s
- Clock frequency (rate): cycles per second

e.g., 4.0GHz = 4000MHz = 4.0×10<sup>9</sup>Hz





CPU Time = CPU Clock Cycles × Clock Cycle Time

CPU Clock Cycles

- Performance improved by
  - Reducing number of clock cycles (or cycle count)
  - Increasing clock rate
  - Hardware designer must often trade off clock rate against cycle count



## **CPU Time Example**

- Computer A: 2GHz clock, 10s CPU time
- Designing Computer B
  - Aim for 6s CPU time
  - Can do faster clock, but causes 1.2 × clock cycles
- How fast must Computer B clock be?

$$Clock Rate_{B} = \frac{Clock Cycles_{B}}{CPU Time_{B}} = \frac{1.2 \times Clock Cycles_{A}}{6s}$$

$$Clock Cycles_{A} = CPU Time_{A} \times Clock Rate_{A}$$

$$= 10s \times 2GHz = 20 \times 10^{9}$$

$$Clock Rate_{B} = \frac{1.2 \times 20 \times 10^{9}}{6s} = \frac{24 \times 10^{9}}{6s} = 4GHz$$



# Instruction Count and CPI

Clock Cycles = Instruction Count × Cycles per Instruction

CPU Time = Instruction Count × CPI × Clock Cycle Time

Instruction Count  $\times$  CPI

**Clock Rate** 

- Instruction Count for a program
  - Determined by program, ISA, and compiler
- Average cycles per instruction
  - Determined by CPU hardware
  - If different instructions have different CPI
    - Average CPI affected by instruction mix



# **CPI Example**

- Computer A: Cycle Time = 250ps, CPI = 2.0
- Computer B: Cycle Time = 500ps, CPI = 1.2
- Same ISA
- Which is faster, and by how much?

 $\begin{array}{l} \mathsf{CPUTime}_{\mathsf{A}} = \mathsf{Instruction} \, \mathsf{Count} \times \mathsf{CPI}_{\mathsf{A}} \times \mathsf{Cycle} \, \mathsf{Time}_{\mathsf{A}} \\ = \mathsf{I} \times 2.0 \times 250 \, \mathsf{ps} = \mathsf{I} \times 500 \, \mathsf{ps} & \quad \mathsf{A} \, \mathsf{is} \, \mathsf{faster...} \\ \mathsf{CPUTime}_{\mathsf{B}} = \mathsf{Instruction} \, \mathsf{Count} \times \mathsf{CPI}_{\mathsf{B}} \times \mathsf{Cycle} \, \mathsf{Time}_{\mathsf{B}} \\ = \mathsf{I} \times 1.2 \times 500 \, \mathsf{ps} = \mathsf{I} \times 600 \, \mathsf{ps} \\ \mathsf{CPUTime}_{\mathsf{B}} \\ \mathsf{CPUTime}_{\mathsf{A}} \end{array} = \frac{\mathsf{I} \times 600 \, \mathsf{ps}}{\mathsf{I} \times 500 \, \mathsf{ps}} = 1.2 \, \mathsf{curve} \, \mathsf{Instruction} \, \mathsf{Ins$ 



## **CPI in More Detail**

- If different instruction classes take different numbers of cycles
  - Average CPI affected by instruction mix

Clock Cycles = 
$$\sum_{i=1}^{n} (CPI_i \times Instruction Count_i)$$

Weighted average CPI

$$CPI = \frac{Clock Cycles}{Instruction Count} = \sum_{i=1}^{n} \left( CPI_i \times \frac{Instruction Count_i}{Instruction Count} \right)$$
Relative frequency



## **CPI Example**

 Alternative compiled code sequences using instructions in classes A, B, C

| Class            | А | В | С |
|------------------|---|---|---|
| CPI for class    | 1 | 2 | 3 |
| IC in sequence 1 | 2 | 1 | 2 |
| IC in sequence 2 | 4 | 1 | 1 |

- Sequence 1: IC = 5
  - Clock Cycles
     = 2×1 + 1×2 + 2×3
     = 10
  - Avg. CPI = 10/5 = 2.0

- Sequence 2: IC = 6
  - Clock Cycles
     = 4×1 + 1×2 + 1×3
     = 9





- Performance depends on
  - Algorithm: affects IC, possibly CPI
  - Programming language: affects IC, CPI
  - Compiler: affects IC, CPI
  - Instruction set architecture: affects IC, CPI, T<sub>c</sub>



# **Clock Rate and Power Trends**



#### In CMOS IC technology



# **Reducing Power**

- Suppose a new CPU has
  - 85% of capacitive load of old CPU
  - 15% voltage and 15% frequency reduction

$$\frac{P_{\text{new}}}{P_{\text{old}}} = \frac{C_{\text{old}} \times 0.85 \times (V_{\text{old}} \times 0.85)^2 \times F_{\text{old}} \times 0.85}{C_{\text{old}} \times V_{\text{old}}^2 \times F_{\text{old}}} = 0.85^4 = 0.52$$

- The power wall
  - We can't reduce voltage further
  - We can't remove more heat
- How else can we improve performance?



### **Uniprocessor Performance**





## **Multiprocessors**

- Multicore microprocessors
  - More than one processor per chip
- Requires explicitly parallel programming
  - Compare with instruction level parallelism
    - Hardware executes multiple instructions at once
    - Hidden from the programmer
  - Hard to do
    - Programming for performance
    - Load balancing
    - Optimizing communication and synchronization



## **SPEC CPU Benchmark**

- Benchmark: Programs used to measure performance
  - Supposedly typical of actual workload
- Standard Performance Evaluation Corp (SPEC)
  - Develops benchmarks for CPU, I/O, Web, …
- SPEC CPU2006
  - Elapsed time to execute a selection of programs
    - Negligible I/O, so focuses on CPU performance
  - Normalize relative to reference machine
  - Summarize as geometric mean of performance ratios
    - CINT2006 (integer) and CFP2006 (floating-point)





### Remark



- SPECRatio is just a ratio rather than an absolute execution time
- Note that when comparing 2 computers as a ratio, execution times on the reference computer drop out, so choice of reference computer is irrelevant



### CINT2006 for Intel Core i7 920

| Description                          | Name       | Instruction<br>Count x 10 <sup>9</sup> | СРІ  | Clock cycle time<br>(seconds x 10 <sup>-9</sup> ) | Execution<br>Time<br>(seconds) | Reference<br>Time<br>(seconds) | SPECratio |
|--------------------------------------|------------|----------------------------------------|------|---------------------------------------------------|--------------------------------|--------------------------------|-----------|
| Interpreted string processing        | perl       | 2252                                   | 0.60 | 0.376                                             | 508                            | 9770                           | 19.2      |
| Block-sorting<br>compression         | bzip2      | 2390                                   | 0.70 | 0.376                                             | 629                            | 9650                           | 15.4      |
| GNU C compiler                       | gcc        | 794                                    | 1.20 | 0.376                                             | 358                            | 8050                           | 22.5      |
| Combinatorial optimization           | mcf        | 221                                    | 2.66 | 0.376                                             | 221                            | 9120                           | 41.2      |
| Go game (AI)                         | go         | 1274                                   | 1.10 | 0.376                                             | 527                            | 10490                          | 19.9      |
| Search gene sequence                 | hmmer      | 2616                                   | 0.60 | 0.376                                             | 590                            | 9330                           | 15.8      |
| Chess game (AI)                      | sjeng      | 1948                                   | 0.80 | 0.376                                             | 586                            | 12100                          | 20.7      |
| Quantum computer<br>simulation       | libquantum | 659                                    | 0.44 | 0.376                                             | 109                            | 20720                          | 190.0     |
| Video compression                    | h264avc    | 3793                                   | 0.50 | 0.376                                             | 713                            | 22130                          | 31.0      |
| Discrete event<br>simulation library | omnetpp    | 367                                    | 2.10 | 0.376                                             | 290                            | 6250                           | 21.5      |
| Games/path finding                   | astar      | 1250                                   | 1.00 | 0.376                                             | 470                            | 7020                           | 14.9      |
| XML parsing                          | xalancbmk  | 1045                                   | 0.70 | 0.376                                             | 275                            | 6900                           | 25.1      |
| Geometric mean                       | -          | _                                      | _    | _                                                 | _                              | _                              | 25.7      |



### **SPEC Power Benchmark**

- Power consumption of server at different workload levels
  - Performance: ssj\_ops
  - Power: Watts (Joules/sec)

Overall ssj\_ops per Watt = 
$$\left(\sum_{i=0}^{10} ssj_ops_i\right) / \left(\sum_{i=0}^{10} power_i\right)$$

server side Java operations per second per watt



### SPECpower\_ssj2008 for Xeon X5650

| Target Load %                      | Performance<br>(ssj_ops) | Average Power<br>(Watts) |  |
|------------------------------------|--------------------------|--------------------------|--|
| 100%                               | 865,618                  | 258                      |  |
| 90%                                | 786,688                  | 242                      |  |
| 80%                                | 698,051                  | 224                      |  |
| 70%                                | 607,826                  | 204                      |  |
| 60%                                | 521,391                  | 185                      |  |
| 50%                                | 436,757                  | 170                      |  |
| 40%                                | 345,919                  | 157                      |  |
| 30%                                | 262,071                  | 146                      |  |
| 20%                                | 176,061                  | 135                      |  |
| 10%                                | 86,784                   | 121                      |  |
| 0%                                 | 0                        | 80                       |  |
| Overall Sum                        | 4,787,166                | 1,922                    |  |
| $\Sigma$ ssj_ops/ $\Sigma$ power = |                          | 2,490                    |  |





# 由台北到高雄

- 不能enhance的部份為在市區的時間: 0.5 + 0.5 = 1小時
- 可以enhance的部份為在高速公路上的4小時
   => 佔總時間的 4/(4+1) = 0.8 = F
- 現在改用飛機,可以enhance的部份縮短為1小時
   => S = 4/1 = 4



• When S -> ∞, speedup -> 5



## **Amdahl's Law**

Speedup due to enhancement E:

Speedup(E) =  $\frac{\text{Execution Time without E}}{\text{Execution Time with E}} = \frac{\text{Performance w/ E}}{\text{Performance w/o E}}$ 

 Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then,

Execution Time(w/E) = 
$$((1-F) + \frac{F}{S}) \times \text{Execution Time(w/oE)}$$

Speedup(w/ E) = 
$$\frac{1}{(1-F) + \frac{F}{S}} \quad s \stackrel{\approx}{\to} \infty \quad \frac{1}{1-F}$$



## Pitfall: Amdahl's Law

 Improving an aspect of a computer and expecting a proportional improvement in overall performance



- Example: multiply accounts for 80s/100s
  - How much improvement in multiply performance to get 5x overall?

$$20 = \frac{80}{n} + 20$$
 • Can't be done!

Corollary: make the common case fast



# Fallacy: Low Power at Idle

- Look back at i7 power benchmark
  - At 100% load: 258W
  - At 50% load: 170W (66%)
  - At 10% load: 121W (47%)
- Google data center
  - Mostly operates at 10% 50% load
  - At 100% load less than 1% of the time
- Consider designing processors to make power proportional to load



### **Pitfall: MIPS as a Performance Metric**

#### MIPS: Millions of Instructions Per Second



- Doesn't account for
  - Differences in ISAs between computers
  - Differences in complexity between instructions
- CPI varies between programs on a given CPU



# **Concluding Remarks**

- Cost/performance is improving
  - Due to underlying technology development
- Hierarchical layers of abstraction
  - In both hardware and software
- Instruction set architecture
  - The hardware/software interface
- Execution time: the best performance measure
- Power is a limiting factor
  - Use parallelism to improve performance

