270 likes | 278 Vues
Further expansion of HPE’s largest warm-water-cooled Apollo 8000 system: „ Prometheus ” at Cyfronet. Patryk Lasoń, Marek Magryś. ACC Cyfronet AGH-UST. established in 1973 part of AGH University of Science and Technology in Krakow , Poland
E N D
Furtherexpansion of HPE’slargestwarm-water-cooled Apollo 8000 system:„Prometheus” atCyfronet Patryk Lasoń, Marek Magryś
ACC Cyfronet AGH-UST • establishedin1973 • part of AGH University of Science and TechnologyinKrakow, Poland • providesfreecomputing resources for scientific institutions • centre of competenceinHPC and GridComputing • IT service management expertise (ITIL, ISO 20k) • member of PIONIERconsortium • operator of Krakow MAN • home for supercomputers
PL-Gridinfrastructure • Polish national IT infrastructure supporting e-Science • based upon resources of most powerful academic resource centres • compatible and interoperable with European Grid • offering grid and cloud computing paradigms • coordinated by Cyfronet • Benefits for users • unifiedinfrastructure from 5 separate compute centres • unified access to software, compute and storage resources • non-trivial quality of service • Challenges • unified monitoring, accounting, security • create environment of cooperation rather than competition • Federation – the key to success
PLGridCoreproject • Competence Centre in the Field of Distributed ComputingGrid Infrastructures • Duration: 01.01.2014 – 31.11.2015 • ProjectCoordinator: Academic Computer Centre CYFRONET AGH The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competencecentrein the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data.
ZEUS 374 TFLOPS #269 on Top500
New building 5 MW, UPS + diesel
Prometheus – Phase 1 • Installedin Q2 2015 • HP Apollo 8000 • 13 m2, 15 racks (3 CDU, 12 compute) • 1.65 PFLOPS • 1728 nodes, Intel Haswell E5-2680v3 • 41472 cores, 13824 per island • 216 TB DDR4 RAM • N+1/N+Nredundancy
Prometheus – Phase 2 • Installedin Q4 2015 • 4th island • 432 regularnodes (2 CPUs, 128 GB RAM) • 72 nodeswithGPGPUs (2x NVIDIA Tesla K40 XL) • 2.4 PFLOPS total performance (Rpeak) • 2140 TFLOPS inCPUs • 256 TFLOPS inGPUs • 2232 nodes, 53568 CPU cores, 279 TB RAM • <850 kWpower (includingcooling)
Prometheusstorage • Disklesscomputenodes • Separateprocurment for storage • Lustre on top of DDN hardware • Twofilesystems: • Scratch: 120 GB/s, 5 PB usablespace • Archive: 60 GB/s, 5 PB usablespace • HSM-ready • NFS for homedirectories and software
Prometheus: IB farbic Core IB switches services nodes storagenodes 576 CPU nodes 576 CPU nodes 576 CPU nodes 432 CPU nodes 72 GPU nodes Computeisle Computeisle Computeisle Computeisle Service isle
Whyliquidcooling? • Water: up to 1000x moreefficientheatexchangethan air • Less energy needed to movethecoolant • Hardware (CPUs, DIMMs) can handle ~80 C • Challenge: cool 100% of HW withliquid • networkswitches • PSUs
Whatabout MTBF? • The less movementthebetter • pumps • fans • HDDs • Example • pump MTBF: 50 000 h • fan MTBF: 50 000 h • 2300 node system MTBF: ~5 h
Why Apollo 8000? • Most energy efficient • Theonlysolutionwith 100% warmwatercooling • Highestdensity • Lowest TCO
Evenmore Apollo • Focusesalso on ‘1’ in PUE! • Power distribution • Less fans • Detailed monitoring • ‘energy to solution’ • Drynodemaintenance • Less cables • Prefabricatedpiping • Simplified management
System software • CentOS 7 • Boot to RAM over IB, imagedistributionwith HTTP • Wholemachinebootsupin 10 minuteswithjust 1 bootserver • Hostname/IP generator based on MAC collector • Data automaticallycollectedfrom APM and iLO • Graphical monitoring of power, temperature and networktraffic • SNMP data source, • GUI allowseasy problem location • Nowsyncedwith SLURM • SpectaculariLO LED blinking system developed for theofficiallaunch
Real application performance • Prometheusvs. Zeus intheory 4x differencecore to core • Storage system (scratch) 10x faster • More time to focus on the most popular codes • COSMOS++ – 4.4x • Quantum Espresso – 5.6x • ADF – 6x • Widelyused QC codewiththenamederrivedfrom a famousmathematician – 2x
Futureplans • Continue to moveusersfromtheprevious system • Add a fewlarge-memorynodes • Furtherimprovements of the monitoring tools • Detailed energy and temperature monitoring • Energy-awarescheduling • Collecttheannual energy and PUE