220 likes | 347 Vues
This presentation by Tim Smith from CERN/IT explores the evolution and future of the management fabric for CERN's computing infrastructure. Covering the current fabric challenges associated with the Large Hadron Collider (LHC) computing, it delves into the role of the DataGRID project, management solutions, and anticipated changes in technology. Key areas include high throughput computing, architecture considerations, and the complex interplay between hardware and software systems essential for supporting groundbreaking physics research at CERN.
E N D
Fabric Managementfor CERN ExperimentsPast, Present, and Future Tim Smith CERN/IT
Contents • The Fabric of CERN today • The new challenges of LHC computing • What has this got to do with the GRID • Fabric Management solutions of tomorrow? • The DataGRID Project Tim Smith: HEPiX @ JLab
Functionalities Batch and Interactive Disk servers Tape Servers + devices Stage servers Home directory servers Application servers Backup service Infrastructure Job Scheduler Authentication Authorisation Monitoring Alarms Console managers Networks Fabric Elements Tim Smith: HEPiX @ JLab
Fabric Technology at CERN PC Farms 10000 Multiplicity Scale 1000 PC Farms RISC Workstations 100 Scalable Systems SP2 CS2 SMPs SGI,DEC,HP,SUN 10 RISC Workstations Mainframes IBM Cray 1 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 Year Tim Smith: HEPiX @ JLab
Architecture Considerations • Physics applications have ideal data parallelism • mass ofindependent problems • No message passing • throughput rather than performance • resilience rather than ultimatereliability • Can build hierarchies of mass market components • High Throughput Computing Tim Smith: HEPiX @ JLab
Component Architecture High capacitybackboneswitch Application Server 100/1000baseT switch CPU CPU CPU CPU CPU Disk Server 1000baseT switch Tape Server Tape Server Tape Server Tape Server Tim Smith: HEPiX @ JLab
Analysis Chain: Farms event filter (selection & reconstruction) detector processed data event summary data raw data batch physics analysis event reconstruction analysis objects (extracted by physics topic) event simulation interactive physics analysis Tim Smith: HEPiX @ JLab
Multiplication ! tomog 1200 tapes pcsf 1000 nomad na49 800 na48 na45 #CPUs mta 600 lxbatch lxplus 400 lhcb l3c 200 ion eff cms 0 ccf Jul-97 Jan-98 Jul-98 Jan-99 Jul-99 Jan-00 atlas alice Tim Smith: HEPiX @ JLab
PC Farms Tim Smith: HEPiX @ JLab
Shared Facilities Tim Smith: HEPiX @ JLab
LHC Computing Challenge • The scale will be different • CPU 10k SI95 1M SI95 • Disk 30TB 3PB • Tape 600TB 9PB • The model will be different • There are compelling reasons why some of the farms and some of the capacity will not be located at CERN Tim Smith: HEPiX @ JLab
Estimated disk storage capacity at CERN Bad News: Tapes < factor 2 reduction in 8 years Significant fraction of cost Non-LHC LHC Moore’s Law Estimated CPU capacity at CERN Bad News: IO 1996: 4G @10MB/s 1TB – 2500MB/s 2000: 50G @ 20 MB/s 1TB – 400 MB/s Non-LHC ~10K SI951200 processors LHC Tim Smith: HEPiX @ JLab
Regional Centres:a Multi-Tier Model CERN – Tier 0 2.5 Gbps IN2P3 622 Mbps RAL FNAL Tier 1 155 mbps 155 mbps Uni n 622 Mbps Lab a Tier2 Uni b Lab c Department Desktop MONARC http://cern.ch/MONARC Tim Smith: HEPiX @ JLab
CERN – Tier 0 IN2P3 2.5 Gbps 622 Mbps RAL FNAL Tier 1 155 mbps Uni n 155 mbps Lab a Tier2 622 Mbps Uni b Lab c Department Desktop More realistically:a Grid Topology DataGRID http://cern.ch/grid Tim Smith: HEPiX @ JLab
Can we build LHC farms? • Positive predictions • CPU and disk price/performance trends suggest that the raw processing and disk storage capacities will be affordable, and • raw data rates and volumes look manageable • perhaps not today for ALICE • Space, power and cooling issues? • So probably yes… but can we manage them? • Understand costs - 1 PC is cheap, but managing 10000 is not! • Building and managing coherent systems from such large numbers of boxes will be a challenge. 1999: CDR @ 45MB/s for NA48! 2000: CDR @ 90MB/s for Alice! Tim Smith: HEPiX @ JLab
Management Tasks I • Supporting adaptability • Configuration Management • Machine / Service hierarchy • Automated registration / insertion / removal • Dynamic reassignment • Automatic Software Installation and Management (OS and applications) • Version management • Application dependencies • Controlled (re)deployment Tim Smith: HEPiX @ JLab
Management Tasks II • Controlling Quality of Service • System Monitoring • Orientation to the service NOT the machine • Uniform access to diverse fabric elements • Integrated with configuration (change) management • Problem Management • Identification of root causes (faults + performance) • Correlate network / system / application data • Highly automated • Adaptive - Integrated with configuration management Tim Smith: HEPiX @ JLab
Relevance to the GRID ? • Scalable solutions needed in absence of GRID ! • For the GRID to work it must be presented withinformationandopportunities • Coordinated and efficiently run centres • Presentable as a guaranteed quality resource • ‘GRID’ification : the interfaces Tim Smith: HEPiX @ JLab
Mgmt Tasks: A GRID centre • GRID enable • Support external requests: services • Publication • Coordinated + ‘map’able • Security: Authentication / Authorisation • Policies: Allocation / Priorities / Estimation / Cost • Scheduling • Reservation • Change Management • Guarantees • Resource availability / QoS Tim Smith: HEPiX @ JLab
Existing Solutions ? • The world outside is moving fast !! • Dissimilar problems • Virtual super computers (~200 nodes) • MPI, latency, interconnect topology and bandwith • Roadrunner, LosLobos, Cplant, Beowulf • Similar problems • ISPs / ASPs (~200 nodes) • Clustering: high availability / mission critical • The DataGRID : Fabric Management WP4 Tim Smith: HEPiX @ JLab
WP4 Partners • CERN (CH) Tim Smith • ZIB (D) Alexander Reinefeld • KIP (D) Volker Lindenstruth • NIKHEF (NL) Kors Bos • INFN (I) Michele Michelotto • RAL (UK) Andrew Sansum • IN2P3 (Fr) Denis Linglin Tim Smith: HEPiX @ JLab
Concluding Remarks • Years of experience in exploitinginexpensive mass market components • But we need to marry these withinexpensive highly scalablemanagement tools • Build components back together as a resource for the GRID Tim Smith: HEPiX @ JLab