CS252 Graduate Computer Architecture Lecture 12 Branch Prediction Possible Projects

CS252Graduate Computer ArchitectureLecture 12Branch PredictionPossible Projects October 8th, 2003 Prof. John Kubiatowicz http://www.cs.berkeley.edu/~kubitron/courses/cs252-F03

CS252 Projects • Two People from this class • Projects can overlap with other classes • Exceptions to the two person requirement need to be OK’d • Amount of work: 3 Solid Weeks of work • Spread over the remainder of the term • Should be a miniature research project • State of the art (can’t redo something that others have done) • Should be publishable work • Must have solid methodology! • Elements: • Base architecture to measure against • Simulation or other analysis against some application set • Several variations on a theme

CS252 Projects • DynaCOMP related (or Introspective Computing) • OceanStore related • Smart Dust/NEST • ROC Related Projects • BRASS project related • Benchmarking Related (Yelick)

DynaCOMP:Introspective Computing Monitor • Biological Analogs for computer systems: • Continuous adaptation • Insensitivity to design flaws • Both hardware and software • Necessary if can never besure that all componentsare working properly… • Examples: • ISTORE -- applies introspectivecomputing to disk storage • DynaComp -- applies introspectivecomputing at chip level • Compiler always running and part of execution! Compute Adapt

DynaCOMP Vision Statement • Modern microprocessors gather profile information in hardware in order to generate predictions: Branches, dependencies, and values. • Processors such as the Pentium-II employ a primitive form of “compilation” to translate x86 operations into internal RISC-like micro-ops. • So, why not do all of this in software? Make use of a combination of explicit monitoring, dynamic compilation technology, and genetic algorithms to: • Simplify hardware, possibly using large on-chip multiprocessors built from simple processors. • Improve performance through feedback-driven optimization. Continuous: Execution, Monitoring, Analysis, Recompilation • Generate design complexity automatically so that designers are not required to. Use of explicit proof verification techniques to verify that code generation is correct. • This is aptly called Introspective Computing • Related idea: use of continuous observation to reduce power on buses!

The Thermodynamic Analogy • Large Systems have a variety of latent order • Connections between elements • Mathematical structure (erasure coding, etc) • Distributions peaked about some desired behavior • Permits “Stability through Statistics” • Exploit the behavior of aggregates (redundancy) • Subject to Entropy • Servers/Components, fail, attacks happen, system changes • Requires continuous repair • Apply energy (i.e. through servers) to reduce entropy • Introspection restores distributions

ThermoSpective Comp • Many Redundant Components (Fault Tolerance) • Continuous Repair (Entropy Reduction) • What about NanoComputing Domain? • How will you build reliable systems from unreliable components? Adapt Monitor

OceanStore Vision

Ubiquitous Devices  Ubiquitous Storage • Consumers of data move, change from one device to another, work in cafes, cars, airplanes, the office, etc. • Properties REQUIRED for Endeavour storage substrate: • Strong Security: data must be encrypted whenever in the infrastructure; resistance to monitoring • Coherence:too much data for naïve users to keep coherent “by hand” • Automatic replica management and optimization:huge quantities of data cannot be managed manually • Simple and automatic recovery from disasters: probability of failure increases with size of system • Utility model: world-scale system requires cooperation across administrative boundaries

Utility-based Infrastructure Canadian OceanStore • Service provided by confederation of companies • Monthly fee paid to one service provider • Companies buy and sell capacity from each other Sprint AT&T IBM Pac Bell IBM

Preliminary Smart Dust Mote Brett Warneke, Bryan Atwood, Kristofer Pister Berkeley Sensor and Actuator Center Dept. of Electrical Engineering and Computer Sciences University of California, Berkeley

Smart Dust 1-2mm

COTS Dust GOAL: • Get our feet wet RESULT: • Cheap, easy, off-the-shelf RF systems • Fantastic interest in cheap, easy, RF: • Industry • Berkeley Wireless Research Center • Center for the Built Environment (IUCRC) • PC Enabled Toys (Intel) • Endeavor Project (UCB) • Optical proof of concept

Smart Dust/Micro ServerProjects • David Culler and Kris Pister collaborating • What is the proper operating system for devices of this nature? • Linux or Window is not appropriate! • State machine execution model is much simpler! • Assume that little device is backed by servers in net. • Questions of hardware/software tradeoffs • What is the high-level organization of zillions of dust motes in the infrastructure??? • What type of computational/communication ability provides the right tradeoff between functionality and power consumption???

A glimpse into the future? • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • ISTORE HW in 5-7 years: • 2006 brick: System On a Chip integrated with MicroDrive • 9GB disk, 50 MB/sec from disk • connected via crossbar switch • From brick to “domino” • If low power, 10,000 nodes fit into one rack! • O(10,000) scale is our ultimate design point

ROC vision:Storage System of the Future • Availability, Maintainability, and Evolutionary growth key challenges for storage systems • Maintenance Cost ~ >10X Purchase Cost per year, • Even 2X purchase cost for 1/2 maintenance cost wins • AME improvement enables even larger systems • ISTORE has cost-performance advantages • Better space, power/cooling costs ($@colocation site) • More MIPS, cheaper MIPS, no bus bottlenecks • Compression reduces network $, encryption protects • Single interconnect, supports evolution of technology • Match to future software storage services • Future storage service software target clusters

Is Maintenance the Key? • Rule of Thumb: Maintenance 10X to 100X HW • so over 5 year product life, ~ 95% of cost is maintenance • VAX crashes ‘85, ‘93 [Murp95]; extrap. to ‘01 • Sys. Man.: N crashes/problem, SysAdminaction • Actions: set params bad, bad config, bad app install • HW/OS 70% in ‘85 to 28% in ‘93. In ‘01, 10%?

Availability benchmark methodology • Goal: quantify variation in QoS metrics as events occur that affect system availability • Leverage existing performance benchmarks • to generate fair workloads • to measure & trace quality of service metrics • Use fault injection to compromise system • hardware faults (disk, memory, network, power) • software faults (corrupt input, driver error returns) • maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads • the availability analogues of performance micro- and macro-benchmarks

North North Spin ½ particle: (Proton/Electron) Representation: |0> or |1> South South Quantum Architecture:Use of “Spin” for QuBits • Quantum effect gives “1” and “0”: • Either spin is “UP” or “DOWN” nothing in between • Superposition: Mix of “1” and “0”: • Written as: = C0|0> + C1|1> • An n-bit register can have 2n values simultaneously! = C000|000> + C001|001> + C010|010> + C011|011> + C100 |100> + C101 |101> + C110 |110> + C111 |111>

Skinner-Kane Si based computer • Silicon substrate • Phosphorus ion spin+ donor electron spin = qubit • A-gate • Hyperfine interaction • Electron-ion spin swap • S-gate • Electron shuttling • Global magnetic field • 0 <> 1 qubit flip • Single-electron transistors • Qubit readout

#!$**# Garbage In Cooling Zeros Out 000000 Interesting Ubiquitous Component:The Entropy Exchange Unit • Possibilities for cooling: • Spin-polarized photons spin-polarized electrons spin-polarized nucleons • Simple thermal cooling of some sort • Two material domains: • One material in contact with environment • Analysis of properties of such a system

. . . . . . e1- e1- e2- e1- e2- e1- e1- e2- e2- e2- e1- e2- e1- e1- e2- e1- e2- e2- e2- e1- Electron-ion spin swap Electron-ion spin swap Swap cell e1- e2- • A lot of steps for two qubits! P ion P ion

Electrons are too close Electron-ion spin swap Swap Cell Control Complexity Time Control signals • What a mess! Long pulse sequence…

Single-electron transistors (SETs) Y. Takahashi et. al. • Electrons move one-by-one through tunnel junction onto quantum dot and out other side • Work well at low temperatures • Low drive current (~5nA) and voltage swing (~40mV)

Swap control circuitACK! S-gate pulse cascade A-gate pulse repeats 24 times On-off A-gate pulse ratio (2:254) • Can this even be built with SETs?

In SIMD we trust? • Large control circuit/small swap cell ratio = SIMD • Like clock distribution network • Clock skew at 11.3GHz? • Error correction?

Brass Vision Statement • The emergence of high capacity reconfigurable devices is igniting a revolution in general-purpose processing. It is now becoming possible to tailor and dedicate functional units and interconnect to take advantage of application dependent dataflow. Early research in this area of reconfigurable computing has shown encouraging results in a number of spot areas including cryptography, signal processing, and searching --- achieving 10-100x computational density and reduced latency over more conventional processor solutions. • BRASS: Microprocessor & FPGA on single chip: • use some of millions of transitors to customize HW dynamically to application

Architecture Target • Integrated RISC core + memory system + reconfigurable array. • Combined RAM/Logic structure. • Rapid reconfiguration with many contexts. • Large local data memories and buffers. • These capabilities enable: • hardware virtualization • on-the-fly specialization 128 LUTs 2Mbit

SCORE: Stream-oriented computation model Goal: Provide view of reconfigurable hardware which exposes strengths while abstracting physical resources. • Computations are expressed as data-flow graphs. • Graphs are broken up into compute pages. • Compute pages are linked together in a data-flow manner with streams. • A run-time manager allocates and schedules pages for computations and memory.

Ok. Back to Branch Prediction

Stream of Instructions To Execute Instruction Fetch with Branch Prediction Out-Of-Order Execution Unit Correctness Feedback On Branch Results Review: Problem: “Fetch” unit • Instruction fetch decoupled from execution • Often issue logic (+ rename) included with Fetch

Branches must be resolved quickly for loop overlap! • In our loop-unrolling example, we relied on the fact that branches were under control of “fast” integer unit in order to get overlap! Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop • What happens if branch depends on result of multd?? • We completely lose all of our advantages! • Need to be able to “predict” branch outcome. • If we were to predict that branch was taken, this would be right most of the time. • Problem much worse for superscalar machines!

Review: Predicated Execution • Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP • If false, then neither store result nor cause exception • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. • IA-64: 64 1-bit condition fields selected so conditional execution of any instruction • This transformation is called “if-conversion” • Drawbacks to conditional instructions • Still takes a clock even if “annulled” • Stall if condition evaluated late • Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C

Dynamic Branch Prediction Problem History Information • Incoming stream of addresses • Fast outgoing stream of predictions • Correction information returned from pipeline Branch Predictor Incoming Branches { Address } Prediction { Address, Value } Corrections { Address, Value }

Branch PC Predicted PC Review: Branch Target Buffer • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) • Note: must check for branch match now, since can’t use wrong branch address (Figure 4.22, p. 273) • Return instruction addresses predicted with stack • Remember branch folding (Crisp processor)? PC of instruction FETCH =? Predict taken or untaken

Branch (Pattern?) History Table Predictor 0 Predictor 1 Branch PC • BHT is a table of “Predictors” • Usually 2-bit, saturating counters • Indexed by PC address of Branch – without tags • In Fetch state of branch: • BTB identifies branch • Predictor from BHT used to make prediction • When branch completes • Update corresponding Predictor Predictor 7

NT NT T T Review: Dynamic Branch Prediction(Jim Smith, 1981) • “Predictor”: 2-bit scheme where change prediction only if get misprediction twice • Red: stop, not taken • Green: go, taken • Adds hysteresis to decision making process T Predict Taken Predict Taken T NT Predict Not Taken Predict Not Taken NT

Correlating Branches • Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch • Two possibilities; Current branch depends on: • Last m most recently executed branches anywhere in programProduces a “GA” (for “global adaptive”) in the Yeh and Patt classification (e.g. GAg) • Last m most recent outcomes of same branch.Produces a “PA” (for “per-address adaptive”) in same classification (e.g. PAg) • Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table entry • A single history table shared by all branches (appends a “g” at end), indexed by history value. • Address is used along with history to select table entry (appends a “p” at end of classification) • If only portion of address used, often appends an “s” to indicate “set-indexed” tables (I.e. GAs)

GBHR PABHR PAPHT PABHR GPHT GPHT Discussion of Yeh and Patt classification • GAg: Global History Register, Global History Table • PAg: Per-Address History Register, Global History Table • PAp: Per-Address History Register, Per-Address History Table PAg PAp GAg

GBHR  GBHR Address PAPHT GPHT Other Global Variants:Try to Avoid Aliasing • GAs: Global History Register, Per-Address (Set Associative) History Table • Gshare: Global History Register, Global History Table with Simple attempt at anti-aliasing GAs GShare

What are Important Metrics? • Clearly, Hit Rate matters • Even 1% can be important when above 90% hit rate • Speed: Does this affect cycle time? • Space: Clearly Total Space matters! • Papers which do not try to normalize across different options are playing fast and lose with data • Try to get best performance for the cost

Accuracy of Different Schemes(Figure 4.21, p. 272) 18% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT Frequency of Mispredictions 0%

Discussion of Papers • A Comparative Analysis of Schemes for Correlated Branch Prediciton • Cliff Young, Nicolas Gloy and Michael D. Smith • An Analysis of Correlation and Predictability: What Makes Two-Level Branch Predictors Work? • Marius Evers, Sanjay J. Patel, Robert S. Chappel, and Yale N. Patt

Summary #1Dynamic Branch Prediction • Prediction becoming important part of scalar execution. • Prediction is exploiting “information compressibility” in execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch. • Either different branches (GA) • Or different executions of same branches (PA). • Branch Target Buffer: include branch address & prediction • Predicated Execution can reduce number of branches, number of mispredicted branches

Summary #2 • Prediction, prediction, prediction! • Over next couple of lectures, we will explore prediction of everything! Branches, Dependencies, Data • The high prediction accuracies will cause us to ask: • Is the deterministic Von Neumann model the right one???

CS252 Graduate Computer Architecture Lecture 12 Branch Prediction Possible Projects