370 likes | 378 Vues
Towards energy efficient HPC HP Apollo 8000 at Cyfronet Part I. Patryk Lasoń, Marek Magryś. ACC Cyfronet AGH-UST. established in 1973 part of AGH University of Science and Technology in Krakow , PL p rovides free computing resources for scientific institutions
E N D
Towardsenergyefficient HPC HP Apollo 8000 atCyfronet Part I Patryk Lasoń, Marek Magryś
ACC Cyfronet AGH-UST • establishedin1973 • part of AGH University of Science and TechnologyinKrakow, PL • providesfreecomputing resources for scientific institutions • centre of competenceinHPC and GridComputing • IT service management expertise (ITIL, ISO 20k) • member of PIONIER • operator of Krakow MAN • home for Zeus
PL-GridConsortium • Consortiumcreation – January 2007 • a response to requirements from Polish scientists • due to ongoing Grid activities in Europe (EGEE, EGI_DS) • Aim:significantextension of amount of computing resources provided to the scientific community (start of the PL-GridProgramme) • Development based on: • projectsfunded by the EuropeanRegional Development Fund as part of the InnovativeEconomy Program • closeinternationalcollaboration (EGI, ….) • previousprojects (5FP, 6FP, 7FP, EDA…) • National Network Infrastructure available: Pionier National Project • computingresources: Top500 list • Polish scientific communities: ~75% highly rated Polish publications in 5 Communities PL-Grid Consortium members: 5 High Performance Computing Polish Centres, representing Communities, coordinated by ACC Cyfronet AGH
PL-Gridinfrastructure • Polish national IT infrastructure supporting e-Science • based upon resources of most powerful academic resource centres • compatible and interoperable with European Grid • offering grid and cloud computing paradigms • coordinated by Cyfronet • Benefits for users • one infrastructure instead of 5 separate compute centres • unified access to software, compute and storage resources • non-trivial quality of service • Challenges • unified monitoring, accounting, security • create environment of cooperation rather than competition • Federation – the key to success
Competence Centre in the Field of Distributed Computing Grid Infrastructures PLGridCoreproject • Budget: total 104 949 901,16 PLN, including funding from the EC : 89 207 415,99 PLN • Duration: 01.01.2014 – 31.11.2015 • ProjectCoordinator: Academic Computer Centre CYFRONET AGH The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competencecentrein the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data.
PLGridCoreproject– services • Basic infrastructure services • Uniform access to distributed data • PaaSCloud for scientists • Applications maintenance environment of MapReduce type • End-user services • Technologies and environmentsimplementing the Open Science paradigm • Computing environment for interactiveprocessing of scientific data • Platform for development and execution of large-scale applications organized in a workflow • Automatic selection of scientific literature • Environment supporting data farming mass computations
HPC at Cyfronet 2013 2007 2008 2009 2010 2011 2012 Baribal Panda Zeus vSMP Mars Zeus Zeus Platon U3 FPGA Zeus GPU
374 TFLOPS #176, #1 in Poland
Zeus • over1300servers • HP BL2x220c blades • HP BL685cfatnodes (64 cores, 256 GB) • HP BL490cvSMPnodes (up to 768 cores, 6 TB) • HP SL390s GPGPU (2x,8x) nodes • InfinibandQDR (Mellanox+Qlogic) • >3 PB of diskstorage (Lustre+GPFS) • Scientific Linux 6, Torque/Moab
Zeus - statistics • 2400 registered users • >2000 jobs running simultaneously • >22000 jobs per day • 96 000 000 computing hours in 2013 • jobs lasting from minutes to weeks • jobs from 1 core to 4000 cores
Cooling Hot aisle Coldaisle Hot aisle Rack Rack 40°C 40°C 20°C
Whyupgrade? • Jobsgrowing • Usershatequeuing • New users, newrequirements • Technology movingforward • Power bill stayingthe same
Requirements • Petascale system • Lowest TCO • Energy efficient • Dense • Good MTBF • Hardware: • corecount • memorysize • networktopology • storage
DirectLiquidCooling! • Up to 1000x moreefficientheatexchangethan air • Less energy needed to movethecoolant • Hardware can handle • CPUs ~70C • memory ~80C • Hard to cool 100% of HW withliquid • networkswitches • PSUs
MTBF • The less movementthebetter • less pumps • less fans • less HDDs • Example • pump MTBF: 50 000 hrs • fan MTBF: 50 000 hrs • 1800 node system MTBF: 7 hrs
Thetopology Core IB switches services nodes storagenodes 576 computingnodes 576 computingnodes 576 computingnodes Service isle Computing isle Computing isle Computing isle
Itshouldcount • Max jobsize ~10k cores • FastestCPUs, but compatiblewith old codes • Twosocketsareenough • CPUs, not accelerators • Newestmemory • and morethanbefore • Fast interconnect • stillInfiniband • but no need for full CBB fattree
Thehard part • Public institution, public tender • Strict requirements • 1.65 PFLOPS, max. 1728 servers • 128 GB DDR4 per node • warm water cooling, no pumps inside nodes • infiniband topology • compute+cooling, dry-cooler only • Criteria: price, power, space
And thewinneris… • HP Apollo 8000 • Most energy efficient • Theonlysolutionwith 100% warmwatercooling • Leastfloorspaceneeded • Lowest TCO
Evenmore Apollo • Focusesalso on ‘1’ in PUE! • Power distribution • Less fans • Detailed monitoring • ‘energy to solution’ • Safermaintenance • Less cables • Prefabricatedpiping • Simplified management
System configuration • 1.65 PFLOPS (first 30. of the current Top500) • 1728 nodes, Intel Haswell E5-2680v3 • 41472 cores, 13824 per island • 216 TB DDR4 RAM • PUE ~1.05, 680 kW total power • 15 racks, 12.99 m2 • System ready for undisruptive upgrade • Scientific Linux 6 or 7
Prometheus • Created human • Gave fire to the people • Accelerated innovation • Defeated Zeus
Deployment plan • Contractsigned on 20.10.2014 • Installation of theprimaryloopstarted on 12.11.2014 • First delivery (service island) expected on 24.11.2014 • Apollo pipingshouldarrivebeforeChristmas • Maindeliveryin January • Installation and acceptanceinFebruary • Productionworksince Q2 2015
Futureplans • Benchmarking and Top500 submission • Evaluation of Scientific Linux 7 • Movingusersfromtheprevious system • Tuning of applications • Energy-awarescheduling • First experiencepresentedatHP-CAST 24
More information • www.cyfronet.krakow.pl/en • www.plgrid.pl/en