#### Chapter 1

#### **Fundamentals of Quantitative Design and Analysis**

#### **Current Trends in Architecture**

- · Cannot continue to leverage Instruction-Level parallelism (ILP)
  - Single processor performance improvement ended in 2003
- · New models for performance:
  - Data-level parallelism (DLP)
  - Thread-level parallelism (TLP)
  - Request-level parallelism (RLP)
- · These require explicit restructuring of the application



· Productivity-based managed/interpreted programming languages



- e.g. start phones, tablet computers
- Emphasis on energy efficiency and real-time
- **Desktop Computing**
- Emphasis on price-performance
- Servers
  - Emphasis on availability, scalability, throughput
- Clusters / Warehouse Scale Computers
  - Used for "Software as a Service (SaaS)"
  - Emphasis on availability and price-performance
  - Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks
- · Embedded Computers
  - Emphasis: price





#### Flynn's Taxonomy

- · Single instruction stream, single data stream (SISD)
- Single instruction stream, multiple data streams (SIMD)
  - Vector architectures
     Multimedia extensions
  - Graphics processor units
- Multiple instruction streams, single data stream (MISD)
   No commercial implementation
- Multiple instruction streams, multiple data streams (MIMD)
  - Tightly-coupled MIMD
  - Loosely-coupled MIMD
- Compare with...
  - CUDA's SIMT
  - Modern NUMA server with multiple multicore processor and accelerators

#### **Trends in Technology**

- · Integrated circuit technology
  - Transistor density: 35%/year
  - Die size: 10-20%/year
  - Integration overall: 40-55%/year
    - LawsMoore, Dennard
- · DRAM capacity: 25-40%/year (slowing)
- Flash capacity: 50-60%/year
  - 15-20X cheaper/bit than DRAM
- · Magnetic disk technology: 40%/year
  - 15-25X cheaper/bit then Flash
  - 300-500X cheaper/bit than DRAM

#### Defining Computer Architecture

- "Old" view of computer architecture:
  - Instruction Set Architecture (ISA) design
  - i.e. decisions regarding:
    - registers, memory addressing, addressing modes, instruction operands, available operations, control flow instructions, instruction encoding

#### · "Real" computer architecture:

- Specific requirements of the target machine
- Design to maximize performance within constraints: cost, power, and availability
- Includes ISA, microarchitecture, hardware

#### Math Sidebar: Compound Interest

- · Suppose performance improves 50% per year
- How long does it take for performance to quadruple (factor of 4)?
- · Does it take 8 years?
  - $-8 \times 0.5 = 4$
- After 1 year: perf x (1 + 0.5)
- After 2 years: perf x (1 + 0.5)(1+0.5) = perf x (1+0.5)<sup>2</sup>
- After k years: perf x (1 + 0.5)<sup>k</sup>
- Answer:
  - $(1+0.5)^{k} = 4 = k = 3.42$  years

#### **Instruction Set: Debate Won?**

- Common instruction sets
  - CISC
    - x86 (but not the micro-code)
  - RISC
    - MIPS, HP PA, IBM Power, Sun SPARC, ARM
  - VLIW
    - Itanium, some GPUs (internally)
  - Vector
    - · Cray, NEC, ... (mostly gone)
- · Hybrids?
  - Intel Xeon Phi
    - x86 CISC
    - · RISC-like microcode?
    - · 512-bit vector floating-point

#### **Bandwidth and Latency**

- Bandwidth or throughput
  - Total work done in a given time
  - 10,000-25,000X improvement for processors
  - 300-1200X improvement for memory and disks
  - Units
    - · flop/s, B/s, b/s
- · Latency or response time
  - Time between start and completion of an event
  - 30-80X improvement for processors
  - 6-8X improvement for memory and disks
  - Units
    - CPU, memory: nano-second
    - Network: micro-seconds
    - · Disk: milli-seconds

### nputer Arc



#### **Dynamic Energy and Power**

#### Dynamic energy

- Transistor switch from 0 -> 1 or 1 -> 0
- 1/2 x Capacitive load x Voltage<sup>2</sup>

#### Dynamic power

- 1/2 x Capacitive load x Voltage<sup>2</sup> x Frequency switched
- Reducing clock rate (frequency) reduces power, not energy

   To reduce energy, lower the frequency of under-utilized or idle units

#### **Transistors and Wires**

#### Feature size

- Minimum size of transistor or wire in x or y dimension
- 10 microns in 1971 to .032 microns in 2011
- Transistor performance scales linearly
   Wire delay does not improve with feature size!
- Integration density scales quadratically

#### · Law's of silicon chip manufacturing

- Moore
- Dennard



#### **Power and Energy**

- · Problem: Get power in, get power out
- Thermal Design Power (TDP)
  - Characterizes sustained power consumption
  - Used as target for power supply and cooling system
  - Lower than peak power, higher than average power consumption
- Clock rate can be reduced dynamically to limit power consumption
- · Energy per task is often a better measurement

#### **Reducing Power**

- · Techniques for reducing power:
  - Do nothing well
  - Idle state power
    - C-states, P-states
      - Cost of switching between them
  - Dynamic Voltage-Frequency Scaling
     Implementations: silicon, OS-level, user-level
  - Low power state for DRAM, disks, interconnect
  - Overclocking, turning off cores
     Race to halt
    - Number of power planes in a single chip





#### **Thoughts on Scaling Limits**

#### · Feature size

- Silicon mesh size (quantum effects)
- Litography limits (wavelength)
- Wire cross-talk
- Frequency
  - Dynamic power dissipation
- Voltage
  - Reliability of switching when moving from 5V down to 0.7V
  - Near-threshold circuits
- Core count
  - On-chip interconnect wiring and messaging

Commercial offerings (PaaS, SaaS, ...)
 Service Level Agreements (SLAs) or SLObjectives

Dependability

- Service accomplishment vs. interruption
  - Transitions: failures and restorations
- · Module reliability
  - Mean time to failure (MTTF)
    - 1/MTTF = Failure In Time (FIT)
  - Mean time to repair (MTTR)
  - Mean time between failures (MTBF) = MTTF + MTTR
  - Availability = MTTF / MTBF = MTTF / (MTTF+MTTR)

#### **Trends in Cost**

- · Cost driven down by learning curve
  - How much we've learned about the manufacturing process
  - Yield varies at various price-points
    - · High-end vs. low-end parts:
      - IBM Cell and PS3
      - Intel Xeon Phi and Tianhe-2's Xeon Phi
- DRAM: price closely tracks cost
  - Standards, competition, patents
- · Microprocessors: price depends on volume
  - 10% less for each doubling of volume

# Measuring Performance

- Typical performance metrics: - Response time
- Response ti
   Throughput
- Speedup of X relative to Y
  - Execution time, / Execution time,
     Geometric average is best suitable for combining relative values
  - - Geometric average =  $\sqrt[n]{\prod_{i=1...n} a_i} = \sqrt[n]{a_1 \times a_2 \times ... \times a_n}$
- Execution time
  - Wall clock time: includes all system overheads
  - CPU time: only computation time
- Benchmarks
  - Kernels (e.g. matrix multiply)
  - Toy programs (e.g. sorting)
  - Synthetic benchmarks (e.g. Dhrystone)
  - Benchmark suites (e.g. SPEC06fp, TPC-C)

# suring Performance











## Optimize CPI or IPC? • 1980s – RISC era

- Minimize CPI
- 1990s
  - Superscalar RISC
  - Maximize IPC
- 2000s+
  - x86 ISA
  - Optimize both: x86 ISA and microcode