Maximizing Computing Resources for Scientific Breakthroughs at Cyfronet AGH-UST

Towardsenergyefficient HPC HP Apollo 8000 atCyfronet Part I Patryk Lasoń, Marek Magryś

ACC Cyfronet AGH-UST • establishedin1973 • part of AGH University of Science and TechnologyinKrakow, PL • providesfreecomputing resources for scientific institutions • centre of competenceinHPC and GridComputing • IT service management expertise (ITIL, ISO 20k) • member of PIONIER • operator of Krakow MAN • home for Zeus

International projects

PL-GridConsortium • Consortiumcreation – January 2007 • a response to requirements from Polish scientists • due to ongoing Grid activities in Europe (EGEE, EGI_DS) • Aim:significantextension of amount of computing resources provided to the scientific community (start of the PL-GridProgramme) • Development based on: • projectsfunded by the EuropeanRegional Development Fund as part of the InnovativeEconomy Program • closeinternationalcollaboration (EGI, ….) • previousprojects (5FP, 6FP, 7FP, EDA…) • National Network Infrastructure available: Pionier National Project • computingresources: Top500 list • Polish scientific communities: ~75% highly rated Polish publications in 5 Communities PL-Grid Consortium members: 5 High Performance Computing Polish Centres, representing Communities, coordinated by ACC Cyfronet AGH

PL-Gridinfrastructure • Polish national IT infrastructure supporting e-Science • based upon resources of most powerful academic resource centres • compatible and interoperable with European Grid • offering grid and cloud computing paradigms • coordinated by Cyfronet • Benefits for users • one infrastructure instead of 5 separate compute centres • unified access to software, compute and storage resources • non-trivial quality of service • Challenges • unified monitoring, accounting, security • create environment of cooperation rather than competition • Federation – the key to success

Competence Centre in the Field of Distributed Computing Grid Infrastructures PLGridCoreproject • Budget: total 104 949 901,16 PLN, including funding from the EC : 89 207 415,99 PLN • Duration: 01.01.2014 – 31.11.2015 • ProjectCoordinator: Academic Computer Centre CYFRONET AGH The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competencecentrein the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data.

PLGridCoreproject– services • Basic infrastructure services • Uniform access to distributed data • PaaSCloud for scientists • Applications maintenance environment of MapReduce type • End-user services • Technologies and environmentsimplementing the Open Science paradigm • Computing environment for interactiveprocessing of scientific data • Platform for development and execution of large-scale applications organized in a workflow • Automatic selection of scientific literature • Environment supporting data farming mass computations

HPC at Cyfronet 2013 2007 2008 2009 2010 2011 2012 Baribal Panda Zeus vSMP Mars Zeus Zeus Platon U3 FPGA Zeus GPU

ZEUS

374 TFLOPS #176, #1 in Poland

Zeus • over1300servers • HP BL2x220c blades • HP BL685cfatnodes (64 cores, 256 GB) • HP BL490cvSMPnodes (up to 768 cores, 6 TB) • HP SL390s GPGPU (2x,8x) nodes • InfinibandQDR (Mellanox+Qlogic) • >3 PB of diskstorage (Lustre+GPFS) • Scientific Linux 6, Torque/Moab

Zeus - statistics • 2400 registered users • >2000 jobs running simultaneously • >22000 jobs per day • 96 000 000 computing hours in 2013 • jobs lasting from minutes to weeks • jobs from 1 core to 4000 cores

Cooling Hot aisle Coldaisle Hot aisle Rack Rack 40°C 40°C 20°C

Future system

Whyupgrade? • Jobsgrowing • Usershatequeuing • New users, newrequirements • Technology movingforward • Power bill stayingthe same

New building

Requirements • Petascale system • Lowest TCO • Energy efficient • Dense • Good MTBF • Hardware: • corecount • memorysize • networktopology • storage

Cooling

DirectLiquidCooling! • Up to 1000x moreefficientheatexchangethan air • Less energy needed to movethecoolant • Hardware can handle • CPUs ~70C • memory ~80C • Hard to cool 100% of HW withliquid • networkswitches • PSUs

MTBF • The less movementthebetter • less pumps • less fans • less HDDs • Example • pump MTBF: 50 000 hrs • fan MTBF: 50 000 hrs • 1800 node system MTBF: 7 hrs

Thetopology Core IB switches services nodes storagenodes 576 computingnodes 576 computingnodes 576 computingnodes Service isle Computing isle Computing isle Computing isle

Itshouldcount • Max jobsize ~10k cores • FastestCPUs, but compatiblewith old codes • Twosocketsareenough • CPUs, not accelerators • Newestmemory • and morethanbefore • Fast interconnect • stillInfiniband • but no need for full CBB fattree

Thehard part • Public institution, public tender • Strict requirements • 1.65 PFLOPS, max. 1728 servers • 128 GB DDR4 per node • warm water cooling, no pumps inside nodes • infiniband topology • compute+cooling, dry-cooler only • Criteria: price, power, space

And thewinneris… • HP Apollo 8000 • Most energy efficient • Theonlysolutionwith 100% warmwatercooling • Leastfloorspaceneeded • Lowest TCO

Evenmore Apollo • Focusesalso on ‘1’ in PUE! • Power distribution • Less fans • Detailed monitoring • ‘energy to solution’ • Safermaintenance • Less cables • Prefabricatedpiping • Simplified management

System configuration • 1.65 PFLOPS (first 30. of the current Top500) • 1728 nodes, Intel Haswell E5-2680v3 • 41472 cores, 13824 per island • 216 TB DDR4 RAM • PUE ~1.05, 680 kW total power • 15 racks, 12.99 m2 • System ready for undisruptive upgrade • Scientific Linux 6 or 7

Prometheus • Created human • Gave fire to the people • Accelerated innovation • Defeated Zeus

Deployment plan • Contractsigned on 20.10.2014 • Installation of theprimaryloopstarted on 12.11.2014 • First delivery (service island) expected on 24.11.2014 • Apollo pipingshouldarrivebeforeChristmas • Maindeliveryin January • Installation and acceptanceinFebruary • Productionworksince Q2 2015

Futureplans • Benchmarking and Top500 submission • Evaluation of Scientific Linux 7 • Movingusersfromtheprevious system • Tuning of applications • Energy-awarescheduling • First experiencepresentedatHP-CAST 24

prometheus@cyfronet.pl

More information • www.cyfronet.krakow.pl/en • www.plgrid.pl/en

Maximizing Computing Resources for Scientific Breakthroughs at Cyfronet AGH-UST

Maximizing Computing Resources for Scientific Breakthroughs at Cyfronet AGH-UST

Presentation Transcript

Towards efficient data collection at Statistics Sweden

Towards Scalable and Energy-Efficient Memory System Architectures

Towards s mart and energy efficient systems

Driving HPC energy efficiency

HP Update IDC HPC Forum

Towards Scalable and Energy-Efficient Memory System Architectures

CYFRONET

I am Apollo….

Towards Energy Efficient Hadoop

Empowering efficient HPC with Dell

Towards Energy Efficient MapReduce

Towards Efficient Compilation of the HPJava Language for HPC

HPC at GE

HPC at INRIA

Cracow - CYFRONET

Towards Real Energy-efficient Network Design (TREND)

TOWARDS AN ENERGY EFFICIENT NATION

Energy Management Systems Market - Growing Inclination towards Efficient Energy Management

Apollo HPC Cluster