170 likes | 183 Vues
UCB Millennium and the Vineyard Cluster Architecture. Phil Buonadonna University of California, Berkeley http://www.millennium.berkeley.edu. ½ TB. DLIB. Millennium Project. Hierarchical “Cluster of Clusters”. PIII-X 64x4. Ninja. PIII 32x2. PII PIII. Gigabit Ethernet (GbE). PII 8x2.
E N D
UCB Millenniumand theVineyard Cluster Architecture Phil Buonadonna University of California, Berkeley http://www.millennium.berkeley.edu
½ TB DLIB Millennium Project • Hierarchical “Cluster of Clusters” PIII-X 64x4 Ninja PIII 32x2 PII PIII Gigabit Ethernet (GbE) PII8x2 PII8x2 Astro Math PII8x2 PII8x2 PII8x2 Physics Bio CE UC Berkeley Millennium
Millennium Agenda • Investigate recent PC technologies in Clusters • NT/Linux • VI Architecture / GbE / Distributed I/O • Harvest the lessons learned from NOW • Robust, flexible remote execution • Distributed resource management • Investigate clusters that span administrative units • Turn-key cluster deployment • Sense of ownership • Investigate the “Computational Economy” Approach • Resource management with a natural sense of ownership • Enough heterogeneous interests to be worthwhile • Form basis for Sci. Computing, Internet Services, etc. UC Berkeley Millennium
Vineyard Cluster Architecture • Distributed resource utilization and management in a “Vineyard” of Clusters. Applications / Services Mgmt / Monitoring PBS I/O MPI VEXEC TOOLS REXEC - VIA / GM, GbE - Multicast - NT / Linux (2.2.x) - Stride Scheduler Rootstock Distribution UC Berkeley Millennium
Outline • Millennium Project • Vineyard Cluster SW Architecture • Important Component Technologies • Rootstock cluster SW distribution facility • REXEC: Robust Linux Remote Execution • Economic-based Resource allocation • CAN communication over VIA • IO Rivers • Directions and Discussion UC Berkeley Millennium
Rootstock • Disseminate easy-to-build PC cluster system software • Variety of cluster designs • well-engineered high-performance clusters • low-cost casual workgroup clusters • server farms • scalable internet servers • Root Cluster Server (CS) • Provides cluster software stock • Second-level customized distribution within each cluster from its own CS node UC Berkeley Millennium
Rootstock Cluster • Collection of nodes with IP connectivity • can be dedicated subnet, w/ or w/o NAT, or any collection • run nfsd (within cluster), httpd, ssl • One node designated as Cluster Root • serves as the root of administrative operations and mgmt. • may be same or different from other nodes • may participate in normal cluster operation or not => is trusted by other nodes and has storage for dialtone • May have designated front-end nodes or not • May have dedicated cluster-area-network (eg. Myrinet) or not. UC Berkeley Millennium
2. Make the CS “graft” - specify IP address - pckg removes - dchp, dns, nis,... sanity check and build - resolv.conf, /etc/hosts, ... constructs cluster build (lease) download CS build floppy 3. CS power-on build - xfer and localize DT - add local admin scripts - node build floppy Cluster leased builds K 4. Node power-on build - local stock from CS Rootstock Mechanics Cluster System Distribution Center cluster stock - build - os - drvrs - mill SW - os mods cs 1. Cluster Stock - Rootstock build pages - Full Current Linux - all fixes and pckgs - SSL, SSH - Cluster Drivers - Cluster System Layers - rexec, mpe, pbs - Optional SW ($) - Cluster Kernal Mods IP network CAN ... 5. Cluster Update button (future) - 2nd dialtone, CF engine, rolling update UC Berkeley Millennium
Computational Economy • Market-based approach to resource allocation • Optimizes for user value TimeShare API API BatchQueue Economic F.E. Access Modules Resources Apps(Value) Resource Managers UC Berkeley Millennium
REXEC Remote Execution • Secure, decentralized remote execution environment • Features • Decouples resource discovery and selection • Multiple Allocation Policies (VEXECs) • Decentralized control • Each client rexec is the root for a distributed task. • Dynamic discovery and configuration • Resource announcements on a cluster multi-cast channel • All Soft State • Simple, well-defined failure and cleanup models • “They all fall down” • Secure • Translates Pricing Mechanism to Resource Allocation UC Berkeley Millennium
REXEC / VEXEC • Components • rexecd, rexec & vexecd Node A Node B Node C Node D rexecd rexecd rexecd rexecd Cluster IP Multicast Channel vexecd(Policy A) vexecd(Policy B) “Node A” run indexer on Nodes AB at 3 credits/min minimum $ rexec %rexec –n 2 –r 3 indexer UC Berkeley Millennium
Interactive Pricing Mechanism • Most work on “economic mechanisms” focuses on single item or batch case • hold auctions (e.g., second-price sealed bid) • integrated into Vineyard PBS • interactive case needs to be very simple • Bidder i gets bi / åkbk of CPU at rate bi • enforced by stride scheduler • Running cluster mirror usage experiment • two identical clusters for one user community with $ accounts • one free and uncontrolled • one for bid and controlled • which is more desirable to use UC Berkeley Millennium
Communication / VIA • Multiple Physical Layers • Fast Ethernet • Gigabit Ethernet (Inter & Intra cluster net) • Myrinet w/ Lanai7 (Intra cluster net) • Transports • IP, IP Multicast • VI Architecture / GM • Explore integrated IPC and distributed I/O UC Berkeley Millennium
AM Architecture Proc A • Components • Endpoints • Virtual Networks • Bundles • Operations • Request / Reply • Short, Med, Long • Create, Map, Free • Poll, Wait • Credit based flow control Proc B Proc C UC Berkeley Millennium
AM-VIA Architecture • VI Queue (VIQ) • Logical channel for AM message type • VI & independent Send/Receive Queues • Independent request credit scheme (counter n) • MAP Object • Container for 3 VIQ’s • Short,Medium,Long • Single Registered Memory Region MAP Object UC Berkeley Millennium
AM-VIA Integration • Endpoints: Collection of MAP objects • Virtual network emulated by point-to-point connections • Bundle: Pair of VI Completion Queues • Send/Receive Proc A Proc B Proc C UC Berkeley Millennium