CS 258 Parallel Computer Architecture Lecture 1 Introduction to Parallel Architecture

CS 258Parallel Computer ArchitectureLecture 1Introduction to Parallel Architecture January 23, 2008 Prof John D. Kubiatowicz

Computer Architecture Is … the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. Amdahl, Blaaw, and Brooks, 1964 SOFTWARE

software instruction set hardware The Instruction Set: a Critical Interface • Properties of a good abstraction • Lasts through many generations (portability) • Used in many different ways (generality) • Provides convenient functionality to higher levels • Permits an efficient implementation at lower levels • Changes very slowly! (Although this is increasing) • Is there a solid interface for multiprocessors? • No standard hardware interface

What is Parallel Architecture? • A parallel computer is a collection of processing elements that cooperate to solve large problems • Some broad issues: • Models of computation: PRAM? BSP? Sequential Consistency? • Resource Allocation: • how large a collection? • how powerful are the elements? • how much memory? • Data access, Communication and Synchronization • how do the elements cooperate and communicate? • how are data transmitted between processors? • what are the abstractions and primitives for cooperation? • Performance and Scalability • how does it all translate into performance? • how does it scale?

P P P P Bus Memory P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M Host P/M P/M P/M P/M Network Topologies of Parallel Machines • Symmetric Multiprocessor • Multiple processors in box with shared memory communication • Current MultiCore chips like this • Every processor runs copy of OS • Non-uniform shared-memory with separate I/O through host • Multiple processors • Each with local memory • general scalable network • Extremely light “OS” on node provides simple services • Scheduling/synchronization • Network-accessible host for I/O • Cluster • Many independent machine connected with general network • Communication through messages

Conventional Wisdom (CW) in Computer Architecture 1. Old CW: Power is free, but transistors expensive • New CW is the “Power wall”: Power is expensive, but transistors are “free” • Can put more transistors on a chip than have the power to turn on 2. Old CW: Only concern is dynamic power • New CW: For desktops and servers, static power due to leakage is 40% of total power 3. Old CW: Monolithic uniprocessors are reliable internally, with errors occurring only at pins • New CW: As chips drop below 65 nm feature sizes, they will havehigh soft and hard error rates

Conventional Wisdom (CW)in Computer Architecture 4. Old CW: By building upon prior successes, continue raising level of abstraction and size of HW designs • New CW: Wire delay, noise, cross coupling, reliability, clock jitter, design validation, … stretch development time and cost of large designs at ≤65 nm 5. Old CW: Researchers demonstrate new architectures by building chips • New CW: Cost of 65 nm masks, cost of ECAD, and design time for GHz clocks  Researchers no longer build believable chips 6. Old CW: Performance improves latency & bandwidth • New CW: BW improves > (latency improvement)2

Conventional Wisdom (CW) in Computer Architecture 7. Old CW: Multiplies slow, but loads and stores fast • New CW is the “Memory wall”: Loads and stores are slow, but multiplies fast • 200 clocks to DRAM, but even FP multiplies only 4 clocks 8. Old CW: We can reveal more ILP via compilers and architecture innovation • Branch prediction, OOO execution, speculation, VLIW, … • New CW is the “ILP wall”: Diminishing returns on finding more ILP 9. Old CW: 2X CPU Performance every 18 months • New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall

Uniprocessor Performance (SPECint) 3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006  Sea change in chip design: multiple “cores” or processors per chip • VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/year 1986 to 2002 • RISC + x86: ??%/year 2002 to present

Sea Change in Chip Design • Intel 4004 (1971): 4-bit processor,2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip • RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip • 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache • RISC II shrinks to  0.02 mm2 at 65 nm • Caches via DRAM or 1 transistor SRAM (www.t-ram.com) or 3D chip stacking • Proximity Communication via capacitive coupling at > 1 TB/s ? Ivan Sutherland @ Sun / Berkeley) • Processor is the new transistor?

ManyCore Chips: The future is here • Intel 80-core multicore chip (Feb 2007) • 80 simple cores • Two floating point engines /core • Mesh-like "network-on-a-chip“ • 100 million transistors • 65nm feature size Frequency Voltage Power Bandwidth Performance 3.16 GHz 0.95 V 62W 1.62 Terabits/s 1.01 Teraflops 5.1 GHz 1.2 V 175W 2.61 Terabits/s 1.63 Teraflops 5.7 GHz 1.35 V 265W 2.92 Terabits/s 1.81 Teraflops • “ManyCore” refers to many processors/chip • 64? 128? Hard to say exact boundary • How to program these? • Use 2 CPUs for video/audio • Use 1 for word processor, 1 for browser • 76 for virus checking??? • Something new is clearly needed here…

Conventional Wisdom (CW) in Computer Architecture 10. Old CW: Increasing clock frequency is primary method of performance improvement • New CW: Processors Parallelism is primary method of performance improvement 11. Old CW: Don’t bother parallelizing app, just wait and run on much faster sequential computer • New CW: Very long wait for faster sequential CPU • 2X uniprocessor performance takes 5 years? • End of La-Z-Boy Programming Era 12. Old CW: Less than linear scaling  failure • New CW: Given the switch to parallel hardware, even sublinear speedups are beneficial 13 New Moore’s Law is 2X processors (“cores”) per chip every technology generation, but same clock rate • “This shift toward increasing parallelism is not a triumphant stride forward based on breakthroughs …; instead, this … is actuallya retreat from even greater challenges that thwart efficient silicon implementation of traditional solutions.” The Parallel Computing Landscape: A Berkeley View, Dec 2006

Déjà vu all over again? • Multiprocessors imminent in 1970s, ‘80s, ‘90s, … • “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989) • Transputer was premature  Custom multiprocessors strove to lead uniprocessors Procrastination rewarded: 2X seq. perf. / 1.5 years • “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2004) • Difference is all microprocessor companies switch to multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs)  Procrastination penalized: 2X sequential perf. / 5 yrs Biggest programming challenge: 1 to 2 CPUs

CS258: Information Instructor: Prof John D. Kubiatowicz Office: 673 Soda Hall Phone: 643-6817 Email: kubitron@cs.berkeley.edu Office Hours: Wed 1:00 - 2:00 or by appt. Class: Mon, Wed 2:30-4:00pm 310 Soda Hall Web page: http://www.cs/~kubitron/courses/cs258-S08/ Lectures available online <Noon day of lecture Email: cs258@kubi.cs.berkeley.edu Clip signup link on web page (as soon as it is up)

Computer Architecture Topics (252+) Input/Output and Storage Disks, WORM, Tape RAID Emerging Technologies Interleaving Bus protocols DRAM Coherence, Bandwidth, Latency Memory Hierarchy L2 Cache Network Communication Other Processors L1 Cache Addressing, Protection, Exception Handling VLSI Instruction Set Architecture Pipelining and Instruction Level Parallelism Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, Dynamic Compilation

Computer Architecture Topics (258) Shared Memory, Message Passing, Data Parallelism Transactional Memory Checkpoint/Restart P M P M P M P M ° ° ° S Interconnection Network Network Interfaces Processor-Memory-Switch Topologies, Routing, Bandwidth, Latency, Reliability Multiprocessors Networks and Interconnections Programming Models/Communications Styles Reliability/Fault Tolerance Everything in previous slide but more so!

What will you get out of CS258? • In-depth understanding of the design and engineering of modern parallel computers • technology forces • Programming models • fundamental architectural issues • naming, replication, communication, synchronization • basic design techniques • cache coherence, protocols, networks, pipelining, … • methods of evaluation • from moderate to very large scale • across the hardware/software boundary • Study of REAL parallel processors • Research papers, white papers • Natural consequences?? • Massive Parallelism  Reconfigurable computing? • Message Passing Machines  NOW  Peer-to-peer systems?

SuperServers Departmenatal Servers Workstations Workstations Personal Computers Will it be worthwhile? • Absolutely! • Now, more than ever, industry trying to figure out how to build these new multicore chips…. • The fundamental issues and solutions translate across a wide spectrum of systems. • Crisp solutions in the context of parallel machines. • Pioneered at the thin-end of the platform pyramid on the most-demanding applications • migrate downward with time • Understand implications for software • Network attachedstorage, MEMs, etc?

Why Study Parallel Architecture? • Role of a computer architect: • To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost. • Parallelism: • Provides alternative to faster clock for performance • Applies at all levels of system design • Is a fascinating perspective from which to view architecture • Is increasingly central in information processing • How is instruction-level parallelism related to course-grained parallelism??

Is Parallel Computing Inevitable? This was certainly not clear just a few years ago Today, however: YES! • Industry is desperate for solutions! • Application demands: Our insatiable need for computing cycles • Technology Trends: Easier to build • Architecture Trends: Better abstractions • Current trends: • Today’s microprocessors are multiprocessors and/or have multiprocessor support • Servers and workstations becoming MP: Sun, SGI, DEC, COMPAQ!...

Why Me? The Alewife Multiprocessor • Cache-coherence Shared Memory • Partially in Software! • User-level Message-Passing • Rapid Context-Switching • Asynchronous network • One node/board

Lecture style • 1-Minute Review • 20-Minute Lecture/Discussion • 5- Minute Administrative Matters • 25-Minute Lecture/Discussion • 5-Minute Break (water, stretch) • 25-Minute Lecture/Discussion • Instructor will come to class early & stay after to answer questions Attention 20 min. Break “In Conclusion, ...” Time

Course Methodology • Study existing designs through research papers • We will read about a number of real multiprocessors • We will discuss network router designs • We will discuss cache coherence protocols • Etc… • High-level goal: • Understand past solutions so that… • We can make proposals for ManyCore designs • This is a critical point in parallel architecture design • Industry is really looking for suggestions • Perhaps you can make them listen to your ideas???

Research Paper Reading • As graduate students, you are now researchers. • Most information of importance to you will be in research papers. • Ability to rapidly scan and understand research papers is key to your success. • So: you will read lots of papers in this course! • Quick 1 paragraph summaries will be due in class • Students will take turns discussing papers • Papers will be scanned and on web page.

TextBook: Two leaders in field Text: Parallel Computer Architecture: A Hardware/Software Approach, By: David Culler & Jaswinder Singh Covers a range of topics We will not necessarily cover them in order.

How will grading work? No TA This Term! Rough Breakdown: • 20% Paper Summaries/Presentations • 30% One Midterm • 40% Research Project (work in pairs) • meet 3 times with me to see progress • give oral presentation • give poster session • written report like conference paper • 6 weeks work full time for 2 people • Opportunity to do “research in the small” to help make transition from good student to research colleague • 10% Class Participation

New Applications More Performance Application Trends • Application demand for performance fuels advances in hardware, which enables new appl’ns, which... • Cycle drives exponential increase in microprocessor performance • Drives parallel architecture harder • most demanding applications • Programmers willing to work really hard to improve high-end applications • Need incremental scalability: • Need range of system performance with progressively increasing cost

Metrics of Performance Application Answers per month Operations per second Programming Language Compiler (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins And What about: Programmability, Reliability, Energy?

Time (1 processor) Time (p processors) Speedup • Speedup (p processors) = • Common mistake: • Compare parallel program on 1 processor to parallel program on p processors • Wrong!: • Should compare uniprocessor program on 1 processor to parallel program on p processors • Why? Keeps you honest • It is easy to parallelize overhead.

Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected

Amdahl’s Law for parallel programs? Best you could ever hope to do: Worse: Overhead may kill your performance!

Where is Parallel Arch Going? Old view: Divergent architectures, no predictable pattern of growth. Application Software System Software Systolic Arrays SIMD Architecture Message Passing Dataflow Shared Memory • Uncertainty of direction paralyzed parallel software development!

Today • Extension of “computer architecture” to support communication and cooperation • Instruction Set Architecture plus Communication Architecture • Defines • Critical abstractions, boundaries, and primitives (interfaces) • Organizational structures that implement interfaces (hw or sw) • Compilers, libraries and OS are important bridges today • Still – not enough standardization! • Nothing close to the stable x86 instruction set!

P.S. Parallel Revolution May Fail • John Hennessy, President, Stanford University, 1/07:“…when we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.” “A Conversation with Hennessy & Patterson,” ACM Queue Magazine, 4:10, 1/07. • 100% failure rate of Parallel Computer Companies • Convex, Encore, Inmos (Transputer), MasPar, NCUBE, Kendall Square Research, Sequent, (Silicon Graphics), Thinking Machines, … • What if IT goes from a growth industry to areplacement industry? • If SW can’t effectively use 32, 64, ... cores per chip  SW no faster on new computer  Only buy if computer wears out

Can programmers handle parallelism? • Historically: Humans not as good at parallel programming as they would like to think! • Need good model to think of machine • Architects pushed on instruction-level parallelism really hard, because it is “transparent” • Can compiler extract parallelism? • Sometimes • How do programmers manage parallelism?? • Language to express parallelism? • How to schedule varying number of processors? • Is communication Explicit (message-passing) or Implicit (shared memory)? • Are there any ordering constraints on communication?

Is it obvious that more coresMore performance? • AMBER molecular dynamics simulation program • Starting point was vector code for Cray-1 • 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon, 891 on 128-processor Cray T3D

Granularity: • Is communication fine or coarse grained? • Small messages vs big messages • Is parallelism fine or coarse grained • Small tasks (frequent synchronization) vs big tasks • If hardware handles fine-grained parallelism, then easier to get incremental scalability • Fine-grained communication and parallelism harder than coarse-grained: • Harder to build with low overhead • Custom communication architectures often needed • Ultimate course grained communication: • GIMPS (Great Internet Mercenne Prime Search) • Communication once a month

Current Commercial Computing targets • Relies on parallelism for high end • Computational power determines scale of business that can be handled • Databases, online-transaction processing, decision support, data mining, data warehousing ... • Google, Yahoo, …. • TPC benchmarks (TPC-C order entry, TPC-D decision support) • Explicit scaling criteria provided • Size of enterprise scales with size of system • Problem size not fixed as p increases. • Throughput is performance measure (transactions per minute or tpm)

Scientific Computing Demand

Engineering Computing Demand • Large parallel machines a mainstay in many industries • Petroleum (reservoir analysis) • Automotive (crash simulation, drag analysis, combustion efficiency), • Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism), • Computer-aided design • Pharmaceuticals (molecular modeling) • Visualization • in all of the above • entertainment (films like Toy Story) • architecture (walk-throughs and rendering) • Financial modeling (yield and derivative analysis) • etc.

Can anyone afford high-end MPPs??? • ASCI (Accellerated Strategic Computing Initiative)ASCI White: Built by IBM • 12.3 TeraOps, 8192 processors (RS/6000) • 6TB of RAM, 160TB Disk • 2 basketball courts in size • Program it??? Message passing

Need New class of applications • Handheld devices with ManyCore processors! • Great Potential, right? • Human Interface applications very important:“The Laptop/handheld is the Computer” • ’07: HP number laptops > desktops • 1B+ Cell phones/yr, increasing in function • Obtellini demoed “Universal Communicator” (Combination cell phone, PC, and Video Device) • Apple iPhone • User wants Increasing Performance, Weeks or Months of Battery Power

Applications: Speech and Image Processing • Also CAD, Databases, . . .

Compelling Laptop/Handheld Apps • Meeting Diarist • Laptops/ Handhelds at meeting coordinate to create speaker identified, partially transcribed text diary of meeting • Teleconference speaker identifier, speech helper • L/Hs used for teleconference, identifies who is speaking, “closed caption” hint of what being said

Compelling Laptop/Handheld Apps • Health Coach • Since laptop/handheld always with you, Record images of all meals, weigh plate before and after, analyze calories consumed so far • “What if I order a pizza for my next meal? A salad?” • Since laptop/handheld always with you, record amount of exercise so far, show how body would look if maintain this exercise and diet pattern next 3 months • “What would I look like if I regularly ran less? Further?” • Face Recognizer/Name Whisperer • Laptop/handheld scans faces, matches image database, whispers name in ear (relies on Content Based Image Retrieval)

Content-Based Image Retrieval(Kurt Keutzer) See “Porting MapReduce to a GPU” Thursday 11:30 • Built around Key Characteristics of personal databases • Very large number of pictures (>5K) • Non-labeled images • Many pictures of few people • Complex pictures including people, events, places, and objects Relevance Feedback Query by example Similarity Metric Candidate Results Final Result Image Database 1000’s of images

What About….? • Many other applications might make sense • If Only….. • Parallel programming is already really hard • Who is going to write these apps??? • Domain Expert is not necessarily qualified to write complex parallel programs!

Need a Fresh Approach to Parallelism • Berkeley researchers from many backgrounds meeting since Feb. 2005 to discuss parallelism • Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John Kubiatowicz, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, Kathy Yelick, … • Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis • Tried to learn from successes in high performance computing (LBNL) and parallel embedded (BWRC) • Led to “Berkeley View” Tech. Report and new Parallel Computing Laboratory (“Par Lab”) • Goal: Productive, Efficient, Correct Programming of 100+ cores & scale as double cores every 2 years (!)

Why Target 100+ Cores? • 5-year research program aim 8+ years out • Multicore: 2X / 2 yrs  ≈ 64 cores in 8 years • Manycore: 8X to 16X multicore Automatic Parallelization, Thread Level Speculation

4 Themes of View 2.0/ Par Lab • Applications • Compelling apps drive top-down research agenda • Identify Common Computational Patterns • Breaking through disciplinary boundaries • Developing Parallel Software with Productivity, Efficiency, and Correctness • 2 Layers + Coordination & Composition Language + Autotuning • OS and Architecture • Composable primitives, not packaged solutions • Deconstruction, Fast barrier synchronization, Partitions

CS 258 Parallel Computer Architecture Lecture 1 Introduction to Parallel Architecture

CS 258 Parallel Computer Architecture Lecture 1 Introduction to Parallel Architecture

Presentation Transcript

CS 5513: Computer Architecture Lecture 1: Introduction

CS 258 Parallel Computer Architecture Lecture 1 Introduction to Parallel Architecture

Parallel Computer Architecture

ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation

Parallel computer architecture classification

CS 258 Parallel Computer Architecture Lecture 6 Router Design Fault Tolerance

CS 258 Parallel Computer Architecture Lecture 5 Routing

Parallel Architecture

CS 258 Parallel Computer Architecture Lecture 16 Snoopy Protocols I

CS 5513: Computer Architecture Lecture 1: Introduction

CS 258 Parallel Computer Architecture Lecture 5 Routing (Con’t)

Computer Architecture Parallel Processors

CS 258 Parallel Computer Architecture Lecture 5 Routing (Con’t)

ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications

Lecture 1: Parallel Architecture Intro

Lecture 1: Parallel Architecture Intro

CS 258 Parallel Computer Architecture Lecture 5 Routing

CS 258 Parallel Computer Architecture Lecture 14 Shared Memory Multiprocessors

CS 258 Parallel Computer Architecture Lecture 15 Sequential Consistency and Snoopy Protocols

CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme

CS 5513: Computer Architecture Lecture 1: Introduction