450 likes | 476 Vues
Explore the planning, installation, and configuration of CERNopenlab's LCG Fabric, including fault tolerance, automation, and operational control. Learn about hardware selection, coupling components through software, and future infrastructure redesign.
E N D
Planning the LCG Fabric at CERNopenlab TCO WorkshopNovember 11th 2003Tony.Cass@CERN.ch
Fabric Area Overview Installation Configuration + monitoring Fault tolerance Automation, Operation, Control Infrastructure Electricity, Cooling, Space Batch system (LSF, CPU server) Storage system (AFS, CASTOR, disk server) Network Benchmarks, R&D, Architecture GRID services !? Prototype, Testbeds Purchase, Hardware selection, Resource planning Coupling of components through hardware and software
Agenda • Building Fabric • Batch Subsystem • Storage subsystem • Installation and Configuration • Monitoring and control • Hardware Purchase
Agenda • Building Fabric • Batch Subsystem • Storage subsystem • Installation and Configuration • Monitoring and control • Hardware Purchase
Building Fabric — I • B513 was constructed in the early 1970s and the machine room infrastructure has evolved slowly over time. • Like the eye, the result is often not ideal…
Current Machine Room Layout Problem: Normabarres run one way, services run the other…. Services Services Services Services
Building Fabric — I • B513 was constructed in the early 1970s and the machine room infrastructure has evolved slowly over time. • Like the eye, the result is often not ideal… • With the preparations for LHC we have the opportunity to remodel the infrastructure.
Future Machine Room Layout 9m double rows of racks for critical servers Aligned normabarres 18m double rows of racks 12 shelf units or 36 19” racks 528 box PCs 105kW 1440 1U PCs 288kW 324 disk servers 120kW(?)
Building Fabric — I • B513 was constructed in the early 1970s and the machine room infrastructure has evolved slowly over time. • Like the eye, the result is often not ideal… • With the preparations for LHC we have the opportunity to remodel the infrastructure. • Arrange services in clear groupings associated with power and network connections. • Clarity for general operations plus ease of service restart should there be any power failure. • Isolate critical infrastructure such as networking, mail and home directory services. • Clear monitoring of planned power distribution system. • Just “good housekeeping”, but we expect to reap the benefits during LHC operation.
Building Fabric — II • Beyond good housekeeping, though, there are building fabric issues that are intimately related with recurrent equipment purchase. • Raw power: We can support a maximum equipment load of 2.5MW. Does the recurrent additional cost of blade systems avoid investment in additional power capacity? • Power efficiency: Early PCs had power factors of ~0.7 and generated high levels of 3rd harmonics. Fortunately, we now see power factors of 0.95 or better, avoiding the need to install filters in the PDUs. Will this continue? • Many sites need to install 1U or 2U rack mounted systems for space reasons. This is not a concern for us at present but may become so eventually. • There is a link here to the previous point: the small power supplies for 1U systems often have poor power factors.
Agenda • Building Fabric • Batch Subsystem • Storage subsystem • Installation and Configuration • Monitoring and control • Hardware Purchase
Fabric Architecture Physical and logical coupling Hardware Software Level of complexity CPU Disk Motherboard, backplane, Bus, integrating devices (memory,Power supply, controller,..) Operating system, driver Storage tray, NAS server, SAN element PC Network (Ethernet, fibre channel, Myrinet, ….) Hubs, switches, routers Cluster Batch system, load balancing, Control software, Hierarchical Storage Systems Grid middleware Wide area network World wide cluster
Batch Subsystem • Looking purely at batch system issues, TCO is reduced as the efficiency of node usage increases. What are the dependencies? • The load characteristics • The batch scheduler • Chip technology • Processors/box • The operating system • Others?
Batch Subsystem • Looking purely at batch system issues, TCO is reduced as the efficiency of node usage increases. What are the dependencies? • The load characteristics • Not much we in IT can do here! • The batch scheduler • Chip technology • Processors/box • The operating system • Others?
Batch Subsystem • Looking purely at batch system issues, TCO is reduced as the efficiency of node usage increases. What are the dependencies? • The load characteristics • The batch scheduler • LSF is pretty good here, fortunately. • Chip technology • Processors/box • The operating system • Others?
Batch Subsystem • Looking purely at batch system issues, TCO is reduced as the efficiency of node usage increases. What are the dependencies? • The load characteristics • The batch scheduler • Chip technology • Take hyperthreading, for example. Tests have shown that, for HEP codes at least, hyperthreading wastes 20% of the system performance running two tasks on a dual processor machine. There are no clear benefits to running with hyperthreading enabled when running three tasks. What is the outlook here? • Processors/box • The operating system • Others?
Batch Subsystem • Looking purely at batch system issues, TCO is reduced as the efficiency of node usage increases. What are the dependencies? • The load characteristics • The batch scheduler • Chip technology • Processors/box • At present, a single 100baseT NIC would support the I/O load of a quad processor CPU server. Quad processor boxes would halve the cost of networking infrastructure—but they come at a hefty price premium (XEON MP vs XEON DP, heftier chassis, …). What is the outlook here? • And total system memory becomes an issue. • The operating system • Others?
Batch Subsystem • Looking purely at batch system issues, TCO is reduced as the efficiency of node usage increases. What are the dependencies? • The load characteristics • The batch scheduler • Chip technology • Processors/box • The operating system • Linux is getting better, but things such as processor affinity would be nice. • Relationship to hyperthreading… • Others?
Batch Subsystem • Looking purely at batch system issues, TCO is reduced as the efficiency of node usage increases. What are the dependencies? • The load characteristics • The batch scheduler • Chip technology • Processors/box • The operating system • Others?
Agenda • Building Fabric • Batch Subsystem • Storage subsystem • Installation and Configuration • Monitoring and control • Hardware Purchase
Storage subsystem CPU server + Fiber Channel Interface + tape drive == Tape server Processors “desktop+” node == CPU server CPU server + larger case + 6*2 disks == Disk server • Simple building blocks:
Storage subsystem — Disk Storage • TCO: Maximise available online capacity within fixed budget (material & personnel). • IDE based disk servers are much cheaper than high end SAN servers. But are we spending too much time on maintenance? • Yes, at present, but we need to analyse carefully the reasons for the current load. • Complexities of Linux drivers seem under control, but numbers have exploded. And are some problems related to batch of hardware? • Where is the optimum? Switching to fibre channel disks would reduce capacity by factor of ~5. • Naively, buy, say, 10% extra systems to cover failures. Sadly, this is not as simple as for CPU servers; active data on the servers must be reloaded elsewhere. • Always have duplicate data? => purchase 2x required space. Still cheaper than SAN? How does this relate to …
Storage System — Tapes • The first TCO question is “Do we need them?” • Disk storage costs are dropping…
Storage System — Tapes • The first TCO question is “Do we need them?” • Disk storage costs dropping… But • Disk servers need system administrators, idle tapes sitting in a tape silo don’t. • With disk only solution, we need storage for at least twice the total data volume to ensure no data loss. • Server lifetime of 3-5 years; data must be copied periodically. • Also an issue for tape, but the lifetime of a disk server is probably still less than the lifetime of a given tape media format. • Assumption today is that tape storage will be required.
Storage System — Tapes • Tape robotics is easy. • Bigger means better cost/slot.
Storage System — Tapes • Tape robotics is easy. • Bigger means better cost/slot. • Tape drives: High end vs LTO • TCO issue: LTO drives are cheaper than high end IBM and STK drives, but are they reliable enough for our use? • c.f. the IDE disk server area. • Real problem, though is tape media. • Vast portion of the data is accessed rarely but must be stored for long period. Strong pressure to select a solution that minimises an overall cost dominated by tape media.
Storage System — Managed Storage • Should CERN build or buy software systems? • How to measure the value of a software system? • Initial cost: • Build: Staff time to create required functionality • Buy: Initial purchase cost of system as delivered plus staff time to install and figure for CERN. • Ongoing cost • Build: Staff time to maintain system and add extra functionality • Buy: License/maintenance cost plus staff time to track releases. • Extra functionality that we consider useful may or may not arrive. • Choice: • Batch system: Buy LSF. • Managed storage system: Build CASTOR. • Use this model as we move on to consider system management software.
Agenda • Building Fabric • Batch Subsystem • Storage subsystem • Installation and Configuration • Monitoring and control • Hardware Purchase
Installation and Configuration • Reproducibility and guaranteed homogeneity of system configuration is a clear method to minimise ongoing system management costs. A management framework is required that can cope with the numbers of systems we expect. • We faced the same issues as we moved from mainframes to RISC systems. Vendor solutions offered then were linked to hardware—so we developed our own solution. • Is a vendor framework acceptable if we have a homogeneous park of Linux systems? • Being honest, why have we built our own again?
Installation and Configuration • Installation and configuration is only part of the overall computer centre management:
ELFms architecture Fault Mgmt System Monitoring System Node Configuration System Installation System
Installation and Configuration • Installation and configuration is only part of the overall computer centre management: • Systems provided by vendors cannot (yet) be integrated into such an overall framework. • And there is still a tendency to differentiate products on the basis of management software, not raw hardware performance. • This is a problem for us as we cannot ensure we always buy brand X rack mounted servers or blade systems. • In short, life is not so different from the RISC system era.
Agenda • Building Fabric • Batch Subsystem • Storage subsystem • Installation and Configuration • Monitoring and control • Hardware Purchase
Monitoring and Control • Assuming that there are clear interfaces, why not integrate a commercial monitoring package into our overall architecture? • Two reasons: • No commercial package meets (met) our requirements in terms of, say, long term data storage and access for analysis. • This could be considered self serving: we produce requirements that justify a build rather than buy decision. • Experience has show, repeatedly, that monitoring frameworks require effort to install and maintain, but don’t deliver the sensors we require. • Vendors haven’t heard of LSF, let alone AFS. • A good reason!
Hardware Management System • A specific example of the integration problem. Workflows must interface to local procedures for, e.g., LAN address allocation. Can we integrate a vendor solution? Do complete solutions exist?
Console Management • Done poorly now:
Console Management • We will do better:TCO issue: Do the benefits of a single console management system outweigh costs of developing our own? How do we integrate vendor supplied racks of preinstalled systems?
Agenda • Building Fabric • Batch Subsystem • Storage subsystem • Installation and Configuration • Monitoring and control • Hardware Purchase
Hardware Purchase • The issue at hand: How do we work within our purchasing procedures to purchase equipment that minimises our total cost of ownership? • At present, we eliminate vast areas of the multi-dimensional space by assuming we will rely on ELFms for system management and Castor for data management. Simplified[!!!] view: • CPU: White box vs 1U vs blades; install or ready packaged • Disk: IDE vs SAN; level of vendor integration • HELP! • Can we benefit from management software that comes with ready built racks of equipment in a multi-vendor environment?