HPGC 2006 Workshop on High-Performance Grid Computing

HPGC 2006 Workshop on High-Performance Grid Computing at IPDPC 2006 Rhodes Island, Greece, April 25 – 29, 2006 Major HPC Grid Projects From Grid Testbeds to Sustainable High-Performance Grid Infrastructures Wolfgang Gentzsch, D-Grid, RENCI, GGF GFSG, e-IRG wgentzsch@d-grid.de Thanks to: Eric Aubanel, Virendra Bhavsar, Michael Frumkin, Rob F. Van der Wijngaart

HPGC 2006 Workshop on High-Performance Grid Computing at IPDPC 2006 Rhodes Island, Greece, April 25 – 29, 2006 Major HPC Grid Projects From Grid Testbeds to Sustainable High-Performance Grid Infrastructures Wolfgang Gentzsch, D-Grid, RENCI, GGF GFSG, e-IRG wgentzsch@d-grid.de Thanks to: Eric Aubanel, Virendra Bhavsar, Michael Frumkin, Rob F. Van der Wijngaart and INTEL

Focus • … on HPC capabilities of grids • … on sustainable grid infrastructures • … selected six major HPC grid projects: • UK e-Science, US TeraGrid, NAREGI Japan, • EGEE and DEISA Europe, D-Grid Germany • … and I apologize for not mentioning • Your favorite grid project, but…

Too Many Major Grids to mention them all:

Edinburgh Glasgow DL Newcastle Belfast Manchester Cambridge Oxford Hinxton RAL Cardiff London Southampton UK e-Science Gridstarted in early 2001$400 Mio Application independent

NGS Overview:User view • Resources • 4 Core clusters • UK’s National HPC services • A range of partner contributions • Access • Support UK academic researchers • Light weight peer review for limited “free” resources • Central help desk • www.grid-support.ac.uk

NGS Overview:Oganisational view • Management • GOSC Board • Strategic direction • Technical Board • Technical coordination and policy • Grid Operations Support Centre • Manages the NGS • Operates the UK CA + over 30 RA’s • Operates central helpdesk • Policies and procedures • Manage and monitor partners

PP + Astronomy Large facilities Eng. + Phys. Sci biology Env. Sci Humanities Medicine Sociology Users by discipline NGS Use Files stored Over 320 users CPU time by user Users by institution

NGS Development • Core Node refresh • Expand partnership • HPC • Campus Grids • Data Centres • Digital Repositories • Experimental Facilities • Baseline services • Aim to map user requirements onto standard solutions • Support convergence/interoperability • Move further towards project (VO) support • Support collaborative projects • Mixed economy • Core resources • Shared resources • Project/project/contract specific resources

The Architecture of Gateway Services Grid Portal Server The Users Desktop TeraGrid Gateway Services Proxy Certificate Server / vault User Metadata Catalog Application Workflow Application Deployment Application Events Resource Broker App. Resource catalogs Replica Mgmt Core Grid Services Security Notification Service Resource Allocation Grid Orchestration Data Management Service Accounting Service Policy Reservations And Scheduling Administration & Monitoring Courtesy Jay Boisseau Web Services Resource Framework – Web Services Notification Physical Resource Layer

TeraGrid Use 1600 users 600 users

Delivering User Priorities in 2005 Overall Score (depth of need) Partners in Need (breadth of need) Remote File Read/Write High-Performance File Transfer Coupled Applications, Co-scheduling Grid Portal Toolkits Results of in-depth discussions with 16 TeraGrid user teams during first annual user survey (August 2004). Grid Workflow Tools Batch Metascheduling Global File System Client-Side Computing Tools Batch Scheduled Parameter Sweep Tools Advanced Reservations Data Capability Type Grid Computing Science Gateways

NanotechGrid Apps (Biotech Grid Apps) (OtherApps) “NanoGrid”IMS ~10TF (BioGridRIKEN) Other Inst. National ResearchGrid Middleware R&D National AAA Infr. Grid Middleware R&D Grid R&D Infrastr.15 TF => 100TF SuperSINET National Research Grid Infrastructure (NAREGI) 2003-2007 • Petascale Grid Infrastructure R&D for Future Deployment • $45 mil (US) + $16 mil x 5 (2003-2007) = $125 mil total • PL: Ken Miura (FujitsuNII) • Sekiguchi(AIST), Matsuoka(Titech), Shimojo(Osaka-U), Aoyagi (Kyushu-U)… • Participation by multiple (>= 3) vendors, Fujitsu, NEC, Hitachi, NTT, etc. • NOT AN ACADEMIC PROJECT, ~100FTEs • Follow and contribute to GGF Standardization, esp. OGSA NEC Focused “Grand Challenge” Grid Apps Areas Osaka-U Titech AIST Fujitsu IMS Hitachi U-Kyushu

NAREGI Software Stack (Beta Ver. 2006) Grid-Enabled Nano-Applications (WP6) Grid PSE Grid Visualization Grid Programming (WP2) -Grid RPC -Grid MPI WP3 Grid Workflow (WFML (Unicore+ WF)) Distributed Information Service(CIM) Data (WP4) Super Scheduler WP1 Packaging （WSRF (GT4+Fujitsu WP1) + GT4 and other services) Grid VM (WP1) Grid Security and High-Performance Grid Networking(WP5) SuperSINET NII Research Organizations IMS Major University Computing Centers Computing Resources and Virtual Organizations

GridMPI • MPI applications run on the Grid environment • Metropolitan area, high-bandwidth environment:  10 Gpbs,  500 miles (smaller than 10ms one-way latency) • Parallel Computation • Larger than metropolitan area • MPI-IO computing resource site A computing resource site B Wide-area Network Single (monolithic) MPI application over the Grid environment

EGEE Infrastructure Scale > 180 sites in 39 countries ~ 20 000 CPUs > 5 PB storage > 10 000 concurrent jobs per day > 60 Virtual Organisations Country participating in EGEE

The EGEE project • Objectives • Large-scale, production-quality infrastructure for e-Science • leveraging national and regional grid activities worldwide • consistent, robust and secure • improving and maintaining the middleware • attracting new resources and users from industry as well as science • EGEE • 1st April 2004 – 31 March 2006 • 71 leading institutions in 27 countries, federated in regional Grids • EGEE-II • Proposed start 1 April 2006 (for 2 years) • Expanded consortium • > 90 partners in 32 countries (also non-European partners) • Related projects, incl. • BalticGrid • SEE-GRID • EUMedGrid • EUChinaGrid • EELA

Applications on EGEE • More than 20 applications from 7 domains • High Energy Physics • 4 LHC experiments (ALICE, ATLAS, CMS, LHCb) • BaBar, CDF, DØ, ZEUS • Biomedicine • Bioinformatics (Drug Discovery, GPS@, Xmipp_MLrefine, etc.) • Medical imaging (GATE, CDSS, gPTM3D, SiMRI 3D, etc.) • Earth Sciences • Earth Observation, Solid Earth Physics, Hydrology, Climate • Computational Chemistry • Astronomy • MAGIC • Planck • Geo-Physics • EGEODE • Financial Simulation • E-GRID Another 8 applications from 4 domains are in evaluation stage

Steps for “Grid-enabling” applications II • Tools to easily access Grid resources through high level Grid middleware (gLite) • VO management (VOMS etc.) • Workload management • Data management • Information and monitoring • Application can • interface directly to gLite or • use higher level services such as portals, application specific workflow systems etc.

EGEE Performance Measurements • Information about resources (static & dynamic) • Computing: machine properties (CPUs, memory architecture, ..), platform properties (OS, compiler, other software, …), load • Data: storage location, access properties, load • Network: bandwidth, load • Information about applications • Static: computing and data requirements to reduce search space • Dynamic: changes in computing and data requirements (might need re-scheduling) Plus • Information about Grid services (static & dynamic) • Which services available • Status • Capabilities

Permanent Grid Infrastructure Sustainability: Beyond EGEE-II • Need to prepare for permanent Grid infrastructure • Maintain Europe’s leading position in global science Grids • Ensure a reliable and adaptive support for all sciences • Independent of project funding cycles • Modelled on success of GÉANT • Infrastructure managed centrally in collaboration with national bodies

e-Infrastructures Reflection Group: e-IRG Mission: … to support on political, advisory and monitoring level, the creation of a policy and administrative framework for the easy and cost-effective shared use of electronic resources in Europe (focusing on Grid-computing, data storage, and networking resources) across technological, administrative and national domains.

DEISA PerspectivesTowards cooperative extreme computing in Europe Victor Alessandrini IDRIS - CNRS va@idris.fr

The DEISA Supercomputing Environment(21.900 processors and 145 Tf in 2006, more than 190 Tf in 2007) • IBM AIX Super-cluster • FZJ-Julich, 1312 processors, 8,9 teraflops peak • RZG – Garching, 748 processors, 3,8 teraflops peak • IDRIS, 1024 processors, 6.7 teraflops peak • CINECA, 512 processors, 2,6 teraflops peak • CSC, 512 processors, 2,6 teraflops peak • ECMWF, 2 systems of 2276 processors each, 33 teraflops peak • HPCx, 1600 processors, 12 teraflops peak • BSC, IBM PowerPC Linux system (MareNostrum) 4864 processeurs, 40 teraflops peak • SARA, SGI ALTIX Linux system, 1024 processors, 7 teraflops peak • LRZ, Linux cluster (2.7 teraflops) moving to SGI ALTIX system (5120 processors and 33 teraflops peak in 2006, 70 teraflops peak in 2007) • HLRS, NEC SX8 vector system, 646 processors, 12,7 teraflops peak. V. Alessandrini, IDRIS-CNRS

DEISA objectives • To enable Europe’s terascale science by the integration of Europe’s most powerful supercomputing systems. • Enabling scientific discovery across a broad spectrum of science and technology is the only criterion for success • DEISA is a European Supercomputing Service built on top of existing national services. • Integration of national facilities and services, together with innovative operational models • Main focus is HPC and Extreme Computing applications that cannot by supported by the isolated national services • Service providing model is the transnational extension of national HPC centers: • Operations, • User Support and Applications Enabling, • Network Deployment and Operation, • Middleware services. V. Alessandrini, IDRIS-CNRS

About HPC • Dealing with large complex systems requires exceptional computational resources. For algorithmic reasons, resources grow much faster than the systems size and complexity. • Dealing with huge datasets, involving large files. Typical datasets are several PBytes. • Little usage of commercial or public domain packages. Most applications are corporate codes incorporating specialized know how. Specialized user support is important. • Codes are fine tuned and targeted for a relatively small number of well identified. computing platforms. They are extremely sensitive to the production environment. • Main requirement for high performance is bandwidth (processor to memory, processor to processor, node to node, system to system). V. Alessandrini, IDRIS-CNRS

HPC and Grid Computing • Problem: the speed of light is not big enough • Finite signal propagation speed boosts message passing latencies in a WAN from a few microseconds to tens of milliseconds (if A is in Paris and B in Helsinki) • If A and B are two halves of a tightly coupled complex system, communications are frequent and the enhanced latencies will kill performance. • Grid computing works best for embarrassingly parallel applications, or coupled software modules with limited communications. • Example: A is an ocean code, and B an atmospheric code. There is no bulk interaction. • Large, tightly coupled parallel applications should be run in a single platform. This is why we still need high end supercomputers. • DEISA implements this requirement by rerouting jobs and balancing the computational workload at a European scale. V. Alessandrini, IDRIS-CNRS

Applications for Grids • Single-CPU Jobs:jobmix, many users, many serial applications, suitable for grid (e.g in universities and research centers) • Array Jobs: 100s/1000s of jobs, one user, one serial application, varying input parameters, suitable for grid (e.g. parameter studies in Optimization, CAE, Genomics, Finance) • Massively Parallel Jobs, loosely coupled:one job, one user, one parallel application, no/low communication, scalable, fine-tune for grid (time-explicit algorithms, film rendering, pattern recognition) • Parallel Jobs, tightly coupled:one job, one user, one parallel application, high interprocs communication, not suitable for distribution over the grid, but for parallel system in the grid (time-implicit algorithms, direct solvers, large linear algebra equation systems)

German D-Grid Project Part of 100 Mio Euro e-Science in Germany Objectives of e-Science Initiative • Building one Grid Infrastructure in Germany • Combine existing German grid activities • Development of e-science services for the research community • Science Service Grid: „Services for Scientists“ • Important: Sustainability • Production grid infrastructure after the funding period • Integration of new grid communities (2. generation) • Evaluation of new business models for grid services

e-Science Projects D-Grid Knowledge Management Integration Project Astro-Grid C3-Grid HEP-Grid IN-Grid MediGrid Textgrid ONTOVERSE WIKINGER WIN-EM Im Wissensnetz . . . Generic Grid Middleware and Grid Services eSciDoc VIOLA

DGI D-Grid Middleware Infrastructure User Application Development and User Access GAT API GridSphere Plug-In UNICORE Nutzer High-levelGrid Services SchedulingWorkflow Management Monitoring LCG/gLite Data management Basic Grid Services AccountingBilling User/VO-Mngt Globus 4.0.1 Security Resourcesin D-Grid DistributedCompute Resources NetworkInfrastructur DistributedData Archive Data/Software

Key Characteristics of D-Grid • Generic Grid infrastructure for German research communities • Focus on Sciences and Scientists, not industry • Strong influence of international projects: EGEE, Deisa, CrossGrid, CoreGrid, GridLab, GridCoord, UniGrids, NextGrid, … • Application-driven (80% of funding), not infrastructure-driven • Focus on implementation, not research • Phase 1 & 2: 50 MEuro, 100 research organizations

Conclusion: moving towards Sustainable Grid Infrastructures OR Why Grids are here to stay !

Reason #1: Benefits • Resource Utilization: increase from 20% to 80+% • Productivity:more work done in shorter time • Agility: flexible actions and re-actions • On Demand: get resources, when you need them • Easy Access: transparent, remote, secure • Sharing:enable collaboration over the network • Failover: migrate/restart applications automatically • Resource Virtualization: access compute services, not servers • Heterogeneity:platforms, OSs, devices, software • Virtual Organizations: build & dismantle on the fly

Reason #2: StandardsThe Global Grid Forum • Community-driven set of working groups that are developing standards and best practices for distributed computing efforts • Three primary functions: community, standards, and operations • Standards Areas: Infrastructure, Data, Compute, Architecture, Applications, Management, Security, and Liaison • Community Areas: Research Applications, Industry Applications, Grid Operations, Technology Innovations, and Major Grid Projects • Community Advisory Board represents the different communities and provides input and feedback to GGF

Reason #3: Industry EGA, Enterprise Grid Alliance • Industry-driven consortium to implement standards in industry products and make them interoperable • Founding members: EMC, Fujitsu Siemens Computers, HP, NEC, Network Appliance, Oracle and Sun, plus 20+ Associate Members • May 11, 2005: Enterprise Grid Reference Model v1.0

Reason #3: Industry EGA, Enterprise Grid Alliance • Industry-driven consortium to implement standards in industry products and make them interoperable • Founding members: EMC, Fujitsu Siemens Computers, HP, NEC, Network Appliance, Oracle and Sun, plus 20+ Associate Members • May 11, 2005: Enterprise Grid Reference Model v1.0 Feb06: GGF & EGF signed a letter of intent to merge. A joint team is planning the transition, expected to be complete this summer

Reason #4: OGSAONE Open Grid Services Architecture OGSA Grid Technologies Web Services OGSA Open Grid Service Architecture Integrates grid technologies with Web Services (OGSA => WS-RF) Defines the key components of the grid OGSA enables the integration of services and resources across distributed, heterogeneous, dynamic, virtual organizations – whether within a single enterprise or extending to external resource-sharing and service-provider relationships.”

2. discover resource, MDS 3. submit job, GRAM 4. transfer data, GridFTP secure environment, GSI Reason #5: Quasi-Standard Tools Example: The Globus Toolkit • Globus Toolkit provides four major functions for building grids Courtesy Gridwise Technologies

. . . . and • Seamless, secure, intuitive access to distributed resources & data • Available as Open Source • Features: intuitive GUI with single sign-on, X.509 certificates for AA, workflow engine for multi-site, multi-step workflows, job monitoring, application support, secure data transfer, resource management, and more • In production Courtesy: Achim Streit, FZJ

Client UNICORE NJS UUDB Gateway IDB TSI GRAM Client GridFTP Client Uspace Globus 2 MDS GRAM Gatekeeper GridFTP Server GRAM Job-Manager RMS WS-Resource based Resource Management Framework for dynamic resource information and resource negotiation Globus 2.4  UNICORE Client Portal Command Line WS-RF WS-RF WS-RF WS-RF Gateway + Service Registry Gateway WS-RF WS-RF WS-RF Workflow Engine Network Job Supervisor FileTransfer UserManagement(AAA) Monitoring ResourceManagement ApplicationSupport WS-RF WS-RF WS-RF Courtesy: Achim Streit, FZJ

Reason #6: Global Grid Community

#7: Projects/Initiatives Testbeds Companies • CO Grid • Compute-against-Cancer • D-Grid • DeskGrid • DOE Science Grid • EEGE • EuroGrid • European DataGrid • FightAIDS@home • Folding@home • GRIP • NASA IPG • NC BioGrid • NC Startup Grid • NC Statewide Grid • NEESgrid • NextGrid • Nimrod • Ninf • NRC-BioGrid • OpenMolGrid • OptIPuter • Progress • SETI@home • TeraGrid • UniGrids • Virginia Grid • WestGrid • White Rose Grid • . . . • Altair • Avaki • Axceleon • Cassatt • Datasynapse • Egenera • Entropia • eXludus • GridFrastructure • GridIron • GridSystems • Gridwise • GridXpert • HP Utility Data Center • IBM Grid Toolbox • Kontiki • Metalogic • Noemix • Oracle 10g • Parabon • Platform • Popular Power • Powerllel/Aspeed • Proxima • Softricity • Sun N1 • TurboWorx • United Devices • Univa • . . . • ActiveGrid • BIRN • Condor-G • Deisa • Dame • EGA • EnterTheGrid • GGF • Globus • Globus Alliance • GridBus • GridLab • GridPortal • GRIDtoday • GriPhyN • I-WAY • Knowledge Grid • Legion • MyGrid • NMI • OGCE • OGSA • OMII • PPDG • Semantic Grid • TheGridReport • UK eScience • Unicore • . . .

Degree Grid@Asia Nessi-Grid Challengers GridCoord SIMDAT industrial simulations BeinGrid business experiments BREIN agents & semantics NextGRID service architecture Akogrimo mobile services XtreemOS Linux based Grid operating system GridTrust InteliGrid OntoGrid GridEcon GridComp CoreGRID six virtual laboratories Gredia A-Ware UniGrids Grid4all HPC4U KnowArc ArguGrid Sorma g-Eclipse Chemomen tum Datamining Grid Edutain@ Grid K-WF Grid QosCosGrid Provenance AssessGrid #8: FP6 Grid Technologies Projects Call 5 start: Summer 2006 EU Funding: 124 M€ supporting the NESSI ETP & Grid community Grid services, business models trust, security platforms, user environments data, knowledge, semantics, mining Specific support action Integrated project Network of excellence Specific targeted research project

Gbit-E switch Gbit-E switch Gbit-E switch Gbit-E switch Gbit-E switch V240 / V880 NFS V240 / V880 NFS V880 QFS/NFS Server V880 QFS/NFS Server V240 / V880 NFS V240 / V880 NFS FC Switch Reason #9: Enterprise Grids SunRay Access Browser Access via GEP Workstation Access Optional Control Network (Gbit-E) Myrinet Myrinet Servers, Blades, & VIZ Myrinet Linux Racks Grid Manager Workstations Sun Fire Link Data Network (Gbit-E) NAS/NFS Simple NFS HA NFS Scalable QFS/NFS

Gbit-E switch Gbit-E switch Gbit-E switch Gbit-E switch Gbit-E switch V240 / V880 NFS V240 / V880 NFS V880 QFS/NFS Server V880 QFS/NFS Server V240 / V880 NFS V240 / V880 NFS FC Switch Enterprise Grid Reference Architecture SunRay Access Browser Access via GEP Access Workstation Access Optional Control Network (Gbit-E) Myrinet Myrinet Servers, Blades, & VIZ Myrinet Linux Racks Compute Grid Manager Workstations Sun Fire Link Data Network (Gbit-E) Data NAS/NFS Simple NFS HA NFS Scalable QFS/NFS

1000s of Enterprise Grids in Industry • Life Sciences • Startup and cost efficient • Custom research or limited use applications • Multi-day application runs (BLAST) • Exponential Combinations • Limited administrative staff • Complementary techniques • Electronic Design • Time to Market • Fastest platforms, largest Grids • License Management • Well established application suite • Large legacy investment • Platform Ownership issues • Financial Services • Market simulations • Time IS Money • Proprietary applications • Multiple Platforms • Multiple scenario execution • Need instant results & analysis tools • High Performance Computing • Parallel Reservoir Simulations • Geophysical Ray Tracing • Custom in-house codes • Large scale, multi-platform execution

ENTERPRISE WANS LANS Reason #10 : Grid Service Providers Example: BT Pre-GRID IT asset usage 10-15 % • Inside data center, within Firewall • Virtual use of own IT assets • The GRID virtualiser engine inside Firewall: • Opens up under-used ICT assets • improves TCO, ROI and Apps performance BUT • Intra-enterprise GRID is self limiting • Pool of virtualised assets is restricted by firewall • Does not support Inter-Enterprise usage • BT is focussing on managed Grid solution ENTERPRISE WANS LANS Virtualised assets GRID Engine Post-GRID IT asset usage 70-75 % Courtesy: Piet Bel, BT

ENTERPRISE A WANS LANS BT’s Virtual Private Grid ( VPG ) ENTERPRISE WANS LANS Virtualised IT assets GRID Engine BT NETWORK GRID ENGINE Courtesy: Piet Bel, BT

Reason #11: There will be a Market for Grids

HPGC 2006 Workshop on High-Performance Grid Computing

HPGC 2006 Workshop on High-Performance Grid Computing

Presentation Transcript

Introduction to Grid Computing with High Performance Computing

High Performance Computing Workshop HPC 101

High Performance Computing Workshop HPC 101

High Performance Computing Workshop HPC 101

High Performance and Grid Computing Group

High-Performance Grid Computing and Research Networking

High-Performance Grid Computing and Research Networking

SADC HIGH PERFORMANCE COMPUTING WORKSHOP

High-Performance Grid Computing and Research Networking

High-Performance Grid Computing and Research Networking

Lightweight grid computing workshop, 3rd May 2006

High-Performance Grid Computing and Research Networking

High-Performance Grid Computing and Research Networking

Grid Middleware for High Performance Computing

High-Performance Grid Computing and Research Networking

High Performance Cluster and Grid Computing

High-Performance Grid Computing and Research Networking

High-Performance Grid Computing and Research Networking

High-Performance Grid Computing and Research Networking