1 / 20

Evolution of High Performance Cluster Architectures

Evolution of High Performance Cluster Architectures. David E. Culler culler@cs.berkeley.edu http://millennium.berkeley.edu/ NPACI 2001 All Hands Meeting. Much has changed since “NOW”. inktomi.berkeley.edu. NOW 110 UltraSparc +Myrinet. NOW1 SS+ATM/Myrinet. NOW0 HP+medusa FDDI.

Télécharger la présentation

Evolution of High Performance Cluster Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evolution of High Performance Cluster Architectures David E. Culler culler@cs.berkeley.edu http://millennium.berkeley.edu/ NPACI 2001 All Hands Meeting

  2. Much has changed since “NOW” inktomi.berkeley.edu NOW 110 UltraSparc +Myrinet NOW1 SS+ATM/Myrinet NOW0 HP+medusa FDDI

  3. Millennium Cluster Editions

  4. The Basic Argument • performance cost of engineering lag • miss the 2x per 18 months • => rapid assembly of leading edge HW and SW building blocks • => availability through fault masking, not inherent reliability • emergence of the “killer switch” • opportunities for innovation • move data between as fast as within machine • protected user-level communication • large-scale management • fault isolation • novel applications

  5. Clusters Took Off • scalable internet services • only way to match growth rate • changing supercomputer market • web hosting

  6. Engineering the Building Block • argument came full circle in ~98 • wide-array of 3U, 2U, 1U rack-mounted servers • thermals and mechanicals • processing per square-foot • 110 AC routing a mixed blessing • component OS & drivers • became the early entry to the market

  7. Emergence of the Killer Switch • ATM, Fiberchannel, FDDI “died” • ServerNet bumps along • IBM, SGI do the proprietary thing • little Myrinet just keeps going • quite nice at this stage • SAN standards shootout • NGIO + FutureIO => Infiniband • specs entire stack from phy to api • nod to IPv6 • big, complex, deeply integrated, DBC • Gigabit EtherNet steamroller... • limited by TCP/IP stack, NIC, and cost

  8. Opportunities for Innovation

  9. Switch Network Layer-7Switch Unexpected Breakthru: layer-7 switches • fell out of modern switch design • process packets in chunks • vast # of simultaneous connections • many line-speed packet filters per port • can be made redundant • => multi-gigabit cluster “front end” • virtualize IP address of services • move service within cluster • replicate it, distribute it • high-level xforms • fail-over, • load management

  10. e-Science any useful app should be a service

  11. Protected User-level messaging • Virtual Interface Architecture (VIA) emerged • primitive & complex relative to academic prototypes • industrial compromise • went dormant • Incorporated in Infiniband • big one to watch • Potential breakthrough • user-level TCP, UDP with IP NIC • storage over IP

  12. Management • workstation -> PC transition a step back • boot image distribution, OS distribution • network troubleshoot and service • multicast proved a powerful tool • emerging health monitoring and control • HW level • service level • OS level still a problem

  13. Rootstock Local Rootstock Server UC Berkeley Internet Rootstock Server Local Rootstock Server Local Rootstock Server

  14. Node A Node B Node C Node D rexecd rexecd rexecd rexecd Cluster IP Multicast Channel vexecd(Policy A) vexecd(Policy B) “Nodes AB” minimum $ rexec %rexec –n 2 –r 3 indexer Ganglia and REXEC Also: bWatch BPROC: Beowulf Distributed Process Space VA Linux Systems: VACM, VA Cluster Manager

  15. Network Storage • state-of-practice still NFS + local copies • local disk replica management lacking • NFS doesn’t scale • major source of naive user frustration • limited structured parallel access • SAN movement only changing the device interface • Need cluster content distribution, caching, parallel access and network striping see: GPFS, CFS, PVFS, HPSS, GFS,PPFS,CXFS, HAMFS,Petal, NASD...

  16. Service Service Service DDS lib DDS lib DDS lib Storage “brick” Storage “brick” Storage “brick” Storage “brick” Storage “brick” Storage “brick” Distributed Persistent Data Structure Alternative Clustered Service Distr Hash table API Redundant low latency high xput network System Area Network Single-node durable hash table

  17. Scalable Throughput

  18. Adaptive Parallel Aggregation Static Parallel Aggregation A D A A A A D D D D A D Distributed Queue A D A D “Performance Available” Storage

  19. Application Software • very little movement towards harnessing architectural potential • application as service • process stream of requests (not shell or batch) • grow & shrink on demand • replication for availability • data and functionality • tremendous internal bandwidth • outer-level optimizations, not algorithmic

  20. Time is NOW • finish the system area network • tackle the cluster I/O problem • come together around management tools • get serious about application services

More Related