Download
high performance clusters part 1 performance n.
Skip this Video
Loading SlideShow in 5 Seconds..
High-Performance Clusters part 1: Performance PowerPoint Presentation
Download Presentation
High-Performance Clusters part 1: Performance

High-Performance Clusters part 1: Performance

86 Vues Download Presentation
Télécharger la présentation

High-Performance Clusters part 1: Performance

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. High-Performance Clusters part 1: Performance David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998

  2. Clusters have Arrived • … the SPAA / PDOC testbed going forward SPAA/PODC Clusters

  3. Berkeley NOW • http://now.cs.berkeley.edu/ SPAA/PODC Clusters

  4. NOW’s Commercial Version • 240 procesors, Active Messages, myrinet, ... SPAA/PODC Clusters

  5. Berkeley Massive Storage Cluster • serving Fine Art at www.thinker.org/imagebase/ • or try SPAA/PODC Clusters

  6. Commercial Scene SPAA/PODC Clusters

  7. What’s a Cluster? • Collection of independent computer systems working together as if a single system. • Coupled through a scalable, high bandwidth, low latency interconnect. SPAA/PODC Clusters

  8. Outline for Part 1 • Why Clusters NOW? • What is the Key Challenge? • How is it overcome? • How much performance? • Where is it going? SPAA/PODC Clusters

  9. Why Clusters? • Capacity • Availability • Scalability • Cost-effectiveness SPAA/PODC Clusters

  10. Interconnect Disk array A Disk array B Traditional Availability Clusters • VAX Clusters => IBM sysplex => Wolf Pack Clients Server B Server A SPAA/PODC Clusters

  11. Node Performance in Large System Engineering Lag Time Why HP Clusters NOW? • Time to market => performance • Technology • internet services SPAA/PODC Clusters

  12. Technology Breakthrough • Killer micro => Killer switch • single chip building block for scalable networks • high bandwidth • low latency • very reliable SPAA/PODC Clusters

  13. Opportunity: Rethink System Design • Remote memory and processor are closer than local disks! • Networking Stacks ? • Virtual Memory ? • File system design ? • It all looks like parallel programming • Huge demand for scalable, available, dedicated internet servers • big I/O, big compute SPAA/PODC Clusters

  14. $ $ Example: Traditional File System • Server resources at a premium • Client resources poorly utilized Server Fast Channel (HPPI) Clients $ RAID Disk Storage $$$ Global Shared File Cache ° ° °  Local Private File Cache Bottleneck • Expensive • Complex • Non-Scalable • Single point of failure SPAA/PODC Clusters

  15. P P P P P P P P File Cache File Cache File Cache File Cache File Cache File Cache File Cache File Cache Truly Distributed File System • VM: page to remote memory Scalable Low-Latency Communication Network Cluster Caching Local Cache Network RAID striping G = Node Comm BW / Disk BW SPAA/PODC Clusters

  16. Comm. Software Comm. Software Comm.. Software Comm. Software Network Interface Hardware Network Interface Hardware Network Interface Hardware Network Interface Hardware Fast Communication Challenge • Fast processors and fast networks • The time is spent in crossing between them Killer Platform ° ° ° ns ms µs Killer Switch SPAA/PODC Clusters

  17. P P P P P P P Opening: Intelligent Network Interfaces • Dedicated Processing power and storage embedded in the Network Interface • An I/O card today • Tomorrow on chip? Mryicom Net 160 MB/s Myricom NIC M M I/O bus (S-Bus) 50 MB/s M M $ M $ $ $ Sun Ultra 170 $ SPAA/PODC Clusters

  18. Our Attack: Active Messages • Request / Reply small active messages (RPC) • Bulk-Transfer (store & get) • Highly optimized communication layer on a range of HW Request handler Reply handler SPAA/PODC Clusters

  19. NOW System Architecture Parallel Apps Large Seq. Apps Sockets, Split-C, MPI, HPF, vSM Global Layer UNIX Process Migration Distributed Files Network RAM Resource Management UNIX Workstation UNIX Workstation UNIX Workstation UNIX Workstation Comm. SW Comm. SW Comm. SW Comm. SW Net Inter. HW Net Inter. HW Net Inter. HW Net Inter. HW Fast Commercial Switch (Myrinet) SPAA/PODC Clusters

  20. Cluster Communication Performance SPAA/PODC Clusters

  21. LogP • Latency in sending a (small) message between modules • overhead felt by the processor on sending or receiving msg • gap between successive sends or receives (1/rate) • Processors P ( processors ) P M P M P M ° ° ° o o (overhead) g (gap) L (latency) Limited Volume Interconnection Network ( L/g to a proc) Round Trip time: 2 x ( 2o + L) SPAA/PODC Clusters

  22. LogP Comparison • Direct, user-level network access • Generic AM, FM (uiuc), PM (rwc), Unet (cornell), … Latency 1/BW SPAA/PODC Clusters

  23. MPI over AM: ping-pong bandwidth SPAA/PODC Clusters

  24. MPI over AM: start-up SPAA/PODC Clusters

  25. Cluster Application Performance: NAS Parallel Benchmarks SPAA/PODC Clusters

  26. NPB2: NOW vs SP2 SPAA/PODC Clusters

  27. NPB2: NOW vs SGI Origin SPAA/PODC Clusters

  28. Where the Time Goes: LU SPAA/PODC Clusters

  29. Where the time goes: SP SPAA/PODC Clusters

  30. LU Working Set • 4-processor • traditional curve for small caches • Sharp knee >256KB (1 MB total) SPAA/PODC Clusters

  31. LU Working Set (CPS scaling) • Knee at global cache > 1MB • machine experiences drop in miss rate at specific size SPAA/PODC Clusters

  32. Application Sensitivity to Communication Performance SPAA/PODC Clusters

  33. Adjusting L, o, and g (and G) in situ • Martin, et al., ISCA 97 Host Workstation Host Workstation AM lib AM lib O: stall Ultra on msg write O: stall Ultra on msg read Lanai Lanai L: defer marking msg as valid until Rx + L Myrinet g: delay Lanai after msg injection (after fragment for bulk transfers) SPAA/PODC Clusters

  34. Calibration SPAA/PODC Clusters

  35. Split-C Applications Program Input P=16 P=32 (us) Msg Type Interval Radix Integer radix sort 16M 32-bit keys 13.7 7.8 6.1 msg EM3D(write) Electro-magnetic 80K Nodes, 40% rmt 88.6 38.0 8.0 write EM3D(read) Electro-magnetic 80K Nodes, 40% rmt 230.0 114.0 13.8 read Sample Integer sample sort 32M 32-bit keys 24.7 13.2 13.0 msg Barnes Hierarchical N-Body 1 Million Bodies 77.9 43.2 52.8 cached read P-Ray Ray Tracer 1 Million pixel image 23.5 17.9 156.2 cached read MurPHI Protocol Verification SCI protocol, 2 proc 67.7 35.3 183.5 Bulk Connect Connected Comp 4M nodes, 2-D mesh, 30% 2.3 1.2 212.6 BSP NOW-sort Disk-to-Disk Sort 32M 100-byte records 127.2 56.9 817.4 I/O Radb Bulk version Radix 16M 32-bit keys 7.0 3.7 852.7 Bulk SPAA/PODC Clusters

  36. Sensitivity to Overhead SPAA/PODC Clusters

  37. Comparative Impact SPAA/PODC Clusters

  38. Sensitivity to bulk BW (1/G) SPAA/PODC Clusters

  39. Cluster Communication Performance • Overhead, Overhead, Overhead • hypersensitive due to increased serialization • Sensitivity to gap reflects bursty communication • Surprisingly latency tolerant • Plenty of room for overhead improvement - How sensitive are distributed systems? SPAA/PODC Clusters

  40. Extrapolating to Low Overhead SPAA/PODC Clusters

  41. Direct Memory Messaging • Send region and receive region for each end of communication channel • Write through send region into remote rcv region SPAA/PODC Clusters

  42. MEMORY CHANNEL interconnect ° ° ° Link Interface PCT rx tx ctr ctr rcv dma Bus Interface PCI (33 MHz) B/A AlphaServer SMP Alpha Mem P - $ Direct Memory Interconnects • DEC Memory Channels • 3 us end-to-end • ~ 1us o, L • SCI • SGI • Shrimp (Princeton) 100 MB/s SPAA/PODC Clusters

  43. P P P P P P P P Scalability, Availability, and Performance FE FE FE • Scale disk, memory, proc independently • Random node serves query, all search • On (hw or sw) failure, lose random cols of index • On overload, lose random rows Inktomi Myrinet 100 Million Document Index SPAA/PODC Clusters

  44. Summary • Performance => Generality (see Part 2) • From Technology “Shift” to Technology “Trend” • Cluster communication becoming cheap • gigabit ethernet • System Area Networks becoming commodity • Myricom OEM, Tandem/Compaq ServerNet, SGI, HAL, Sun • Improvements in interconnect BW • gigabyte per second and beyond • Bus connections improving • PCI, ePCI, Pentium II cluster slot, … • Operating system out of the way • VIA SPAA/PODC Clusters

  45. Advice • Clusters are cheap, easy to build, flexible, powerful, general purpose and fun • Everybody doing SPAA or PODC should have one to try out their ideas • Can use Berkeley NOW through npaci • www.npaci.edu SPAA/PODC Clusters