1 / 27

Patryk Lasoń, Marek Magryś

Further expansion of HPE’s largest warm-water-cooled Apollo 8000 system: „ Prometheus ” at Cyfronet. Patryk Lasoń, Marek Magryś. ACC Cyfronet AGH-UST. established in 1973 part of AGH University of Science and Technology in Krakow , Poland

Télécharger la présentation

Patryk Lasoń, Marek Magryś

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Furtherexpansion of HPE’slargestwarm-water-cooled Apollo 8000 system:„Prometheus” atCyfronet Patryk Lasoń, Marek Magryś

  2. ACC Cyfronet AGH-UST • establishedin1973 • part of AGH University of Science and TechnologyinKrakow, Poland • providesfreecomputing resources for scientific institutions • centre of competenceinHPC and GridComputing • IT service management expertise (ITIL, ISO 20k) • member of PIONIERconsortium • operator of Krakow MAN • home for supercomputers

  3. PL-Gridinfrastructure • Polish national IT infrastructure supporting e-Science • based upon resources of most powerful academic resource centres • compatible and interoperable with European Grid • offering grid and cloud computing paradigms • coordinated by Cyfronet • Benefits for users • unifiedinfrastructure from 5 separate compute centres • unified access to software, compute and storage resources • non-trivial quality of service • Challenges • unified monitoring, accounting, security • create environment of cooperation rather than competition • Federation – the key to success

  4. PLGridCoreproject • Competence Centre in the Field of Distributed ComputingGrid Infrastructures • Duration: 01.01.2014 – 31.11.2015 • ProjectCoordinator: Academic Computer Centre CYFRONET AGH The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competencecentrein the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data.

  5. ZEUS 374 TFLOPS #269 on Top500

  6. Zeus usage

  7. New building 5 MW, UPS + diesel

  8. 2.4 PFLOPS, #49 on Top500

  9. Prometheus – Phase 1 • Installedin Q2 2015 • HP Apollo 8000 • 13 m2, 15 racks (3 CDU, 12 compute) • 1.65 PFLOPS • 1728 nodes, Intel Haswell E5-2680v3 • 41472 cores, 13824 per island • 216 TB DDR4 RAM • N+1/N+Nredundancy

  10. Prometheus – Phase 2 • Installedin Q4 2015 • 4th island • 432 regularnodes (2 CPUs, 128 GB RAM) • 72 nodeswithGPGPUs (2x NVIDIA Tesla K40 XL) • 2.4 PFLOPS total performance (Rpeak) • 2140 TFLOPS inCPUs • 256 TFLOPS inGPUs • 2232 nodes, 53568 CPU cores, 279 TB RAM • <850 kWpower (includingcooling)

  11. Prometheusstorage • Disklesscomputenodes • Separateprocurment for storage • Lustre on top of DDN hardware • Twofilesystems: • Scratch: 120 GB/s, 5 PB usablespace • Archive: 60 GB/s, 5 PB usablespace • HSM-ready • NFS for homedirectories and software

  12. Prometheus: IB farbic Core IB switches services nodes storagenodes 576 CPU nodes 576 CPU nodes 576 CPU nodes 432 CPU nodes 72 GPU nodes Computeisle Computeisle Computeisle Computeisle Service isle

  13. Whyliquidcooling? • Water: up to 1000x moreefficientheatexchangethan air • Less energy needed to movethecoolant • Hardware (CPUs, DIMMs) can handle ~80 C • Challenge: cool 100% of HW withliquid • networkswitches • PSUs

  14. Whatabout MTBF? • The less movementthebetter • pumps • fans • HDDs • Example • pump MTBF: 50 000 h • fan MTBF: 50 000 h • 2300 node system MTBF: ~5 h

  15. Why Apollo 8000? • Most energy efficient • Theonlysolutionwith 100% warmwatercooling • Highestdensity • Lowest TCO

  16. Evenmore Apollo • Focusesalso on ‘1’ in PUE! • Power distribution • Less fans • Detailed monitoring • ‘energy to solution’ • Drynodemaintenance • Less cables • Prefabricatedpiping • Simplified management

  17. Secondaryloop

  18. System software • CentOS 7 • Boot to RAM over IB, imagedistributionwith HTTP • Wholemachinebootsupin 10 minuteswithjust 1 bootserver • Hostname/IP generator based on MAC collector • Data automaticallycollectedfrom APM and iLO • Graphical monitoring of power, temperature and networktraffic • SNMP data source, • GUI allowseasy problem location • Nowsyncedwith SLURM • SpectaculariLO LED blinking system developed for theofficiallaunch

  19. HPL: powerusage

  20. HPL: watertemperature

  21. Real application performance • Prometheusvs. Zeus intheory 4x differencecore to core • Storage system (scratch) 10x faster • More time to focus on the most popular codes • COSMOS++ – 4.4x • Quantum Espresso – 5.6x • ADF – 6x • Widelyused QC codewiththenamederrivedfrom a famousmathematician – 2x

  22. Futureplans • Continue to moveusersfromtheprevious system • Add a fewlarge-memorynodes • Furtherimprovements of the monitoring tools • Detailed energy and temperature monitoring • Energy-awarescheduling • Collecttheannual energy and PUE

More Related