1 / 26

Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Global Data Grids The Need for Infrastructure. Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu. Extending the Grid Reach in Europe Brussels, Mar. 23, 2001 http://www.phys.ufl.edu/~avery/griphyn/talks/avery_brussels_23mar01.ppt. Global Data Grid Challenge.

Télécharger la présentation

Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Global Data Grids The Need for Infrastructure Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu Extending the Grid Reach in Europe Brussels, Mar. 23, 2001http://www.phys.ufl.edu/~avery/griphyn/talks/avery_brussels_23mar01.ppt Paul Avery

  2. Global Data Grid Challenge “Global scientific communities, served by networks with bandwidths varying by orders of magnitude, need to perform computationally demanding analyses of geographically distributed datasets that will grow by at least 3 orders of magnitude over the next decade, from the 100 Terabyte to the 100 Petabyte scale.” Paul Avery

  3. Data Intensive Science: 2000-2015 • Scientific discovery increasingly driven by IT • Computationally intensive analyses • Massive data collections • Rapid access to large subsets • Data distributed across networks of varying capability • Dominant factor: data growth (1 Petabyte = 1000 TB) • 2000 ~0.5 Petabyte • 2005 ~10 Petabytes • 2010 ~100 Petabytes • 2015 ~1000 Petabytes? How to collect, manage, access and interpret this quantity of data? Paul Avery

  4. Data Intensive Disciplines • High energy & nuclear physics • Gravity wave searches (e.g., LIGO, GEO, VIRGO) • Astronomical sky surveys (e.g., Sloan Sky Survey) • Global “Virtual” Observatory • Earth Observing System • Climate modeling • Geophysics Paul Avery

  5. Data Intensive Biology and Medicine • Radiology data • X-ray sources (APS crystallography data) • Molecular genomics (e.g., Human Genome) • Proteomics (protein structure, activities, …) • Simulations of biological molecules in situ • Human Brain Project • Global Virtual Population Laboratory (disease outbreaks) • Telemedicine • Etc. Commercial applications not far behind Paul Avery

  6. The Large Hadron Collider at CERN “Compact” Muon Solenoid at the LHC Standard man Paul Avery

  7. LHC Computing Challenges • Complexity of LHC environment and resulting data • Scale: Petabytes of data per year (100 PB by 2010) • Global distribution of people and resources 1800 Physicists 150 Institutes 32 Countries CMS Experiment Paul Avery

  8. Tier 0 (CERN) 3 3 3 3 T2 T2 3 T2 Tier 1 3 3 T2 T2 3 3 3 3 3 3 4 4 4 4 Global LHC Data Grid Hierarchy Tier0 CERNTier1 National LabTier2 Regional Center at UniversityTier3 University workgroupTier4 Workstation GriPhyN: • R&D • Tier2 centers • Unify all IT resources Paul Avery

  9. Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center HPSS HPSS HPSS HPSS Global LHC Data Grid Hierarchy Experiment ~PBytes/sec Online System ~100 MBytes/sec Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size CERN Computer Center > 20 TIPS Tier 0 +1 HPSS 2.5-10 Gb/sec USA Center Italy Center UK Center France Center Tier 1 2.5-10 Gb/sec Tier 2 ~622 Mbits/sec Tier 3 Institute ~0.25TIPS Institute Institute Institute 100 - 1000 Mbits/sec Physics data cache Physicists work on analysis “channels”. Each institute has ~10 physicists working on one or more channels Tier 4 Workstations,other portals Paul Avery

  10. Image Data Standards Source Catalogs Specialized Data: Spectroscopy, Time Series, Polarization Information Archives: Derived & legacy data: NED,Simbad,ADS, etc Discovery Tools: Visualization, Statistics Global Virtual Observatory Multi-wavelength astronomy,Multiple surveys Paul Avery

  11. GVO: The New Astronomy • Large, globally distributed database engines • Integrated catalog and image databases • Multi-Petabyte data size • Gbyte/s aggregate I/O speed per site • High speed (>10 Gbits/s) backbones • Cross-connecting, correlating the major archives • Scalable computing environment • 100s–1000s of CPUs for statistical analysis and discovery Paul Avery

  12. Infrastructure for Global Grids Paul Avery

  13. Grid Infrastructure • Grid computing sometimes compared to electric grid • You plug in to get resource (CPU, storage, …) • You don’t care where resource is located • This analogy might have an unfortunate downside • You might need different sockets! Paul Avery

  14. Role of Grid Infrastructure • Provide essential common Grid infrastructure • Cannot afford to develop separate infrastructures • Meet needs of high-end scientific collaborations • Already international and even global in scope • Need to share heterogeneous resources among members • Experiments drive future requirements • Be broadly applicable outside science • Government agencies: National, regional (EU), UN • Non-governmental organizations (NGOs) • Corporations, business networks (e.g., supplier networks) • Other “virtual organizations” • Be scalable to the Global level • But EU + US is a good starting point Paul Avery

  15. A Path to Common Grid Infrastructure • Make a concrete plan • Have clear focus on infrastructure and standards • Be driven by high-performance applications • Leverage resources & act coherently • Build large-scale Grid testbeds • Collaborate with industry Paul Avery

  16. Building Infrastructure from Data Grids • 3 Data Grid projects recently funded • Particle Physics Data Grid (US, DOE) • Data Grid applications for HENP • Funded 2000, 2001 • http://www.ppdg.net/ • GriPhyN (US, NSF) • Petascale Virtual-Data Grids • Funded 9/2000 – 9/2005 • http://www.griphyn.org/ • European Data Grid (EU) • Data Grid technologies, EU deployment • Funded 1/2001 – 1/2004 • http://www.eu-datagrid.org/ • HEP in common • Focus: infrastructure development & deployment • International scope Paul Avery

  17. Background on Data Grid Projects • They support several disciplines • GriPhyN: CS, HEP (LHC), gravity waves, digital astronomy • PPDG: CS, HEP (LHC + current expts), Nuc. Phys., networking • DataGrid: CS, HEP, earth sensing, biology, networking • They are already joint projects • Each serving needs of multiple constituencies • Each driven by high-performance scientific applications • Each has international components • Their management structures are interconnected • Each project developing and deploying infrastructure • US$23M (additional proposals for US$35M) What if they join forces? Paul Avery

  18. A Common Infrastructure Opportunity • GriPhyN + PPDG + EU-DataGrid + national efforts • France, Italy, UK, Japan • Have agreed to collaborate, develop joint infrastructure • Initial meeting March 4 in Amsterdam to discuss issues • Future meetings in June, July • Preparing management document • Joint management, technical boards + steering committee • Coordination of people, resources • An expectation that this will lead to real work • Collaborative projects • Grid middleware • Integration into applications • Grid testbed: iVDGL • Network testbed (Foster): T3 = Transatlantic Terabit Testbed Paul Avery

  19. iVDGL • International Virtual-Data Grid Laboratory • A place to conduct Data Grid tests at scale • A concrete manifestation of world-wide grid activity • A continuing activity that will drive Grid awareness • A basis for further funding • Scale of effort • For national, international scale Data Grid tests, operations • Computationally and data intensive computing • Fast networks • Who • Initially US-UK-EU • Other world regions later • Discussions w/ Russia, Japan, China, Pakistan, India, South America Paul Avery

  20. iVDGL Parameters • Local control of resources vitally important • Experiments, politics demand it • US, UK, France, Italy, Japan, ... • Grid Exercises • Must serve clear purposes • Will require configuration changes  not trivial • “Easy”, intra-experiment tests first (10-20%, national, transatlantic) • “Harder” wide-scale tests later (50-100% of all resources) • Strong interest from other disciplines • Our CS colleagues (wide scale tests) • Other HEP + NP experiments • Virtual Observatory (VO) community in Europe/US • Gravity wave community in Europe/US/(Japan?) • Bioinformatics Paul Avery

  21. Revisiting the Infrastructure Path • Make a concrete plan • GriPhyN + PPDG + EU DataGrid + national projects • Have clear focus on infrastructure and standards • Already agreed • COGS (Consortium for Open Grid Software) to drive standards? • Be driven by high-performance applications • Applications are manifestly high-perf: LHC, GVO, LIGO/GEO/Virgo, … • Identify challenges today to create tomorrow’s Grids Paul Avery

  22. Revisiting the Infrastructure Path (cont) • Leverage resources & act coherently • Well-funded experiments depend on Data Grid infrastructure • Collab. with national laboratories: FNAL, BNL, RAL, Lyon, KEK, … • Collab. with other Data Grid projects: US, UK, France, Italy, Japan • Leverage new resources: DTF, CAL-IT2, … • Work through Global Grid Forum • Build and maintain large-scale Grid testbeds • iVDGL • T3 • Collaboration with industry  next slide • EC investment in this opportunity • Leverage and extend existing projects, worldwide expertise • Invest in testbeds • Work with national projects (US/NSF, UK/PPARC, …) Part of same infrastructure Paul Avery

  23. Collaboration with Industry • Industry efforts are similar, but only in spirit • ASP, P2P, home PCs, … • IT industry mostly has not invested in Grid R&D • We have different motives, objectives, timescales • Still many areas of common interest • Clusters, storage, I/O • Low cost cluster management • High-speed, distributed databases • Local and wide-area networks, end-to-end performance • Resource sharing, fault-tolerance, … • Fruitful collaboration requires clear objectives • EC could play important role in enabling collaborations Paul Avery

  24. Status of Data Grid Projects • GriPhyN • US$12M funded by NSF/ITR 2000 program (5 year R&D) • 2001 supplemental funds requested for initial deployments • Submitting 5-year proposal ($15M) to NSF • Intend to fully develop production Data Grids • Particle Physics Data Grid • Funded in 1999, 2000 by DOE ($1.2 M per year) • Submitting 3-year Proposal ($12M) to DOE Office of Science • EU DataGrid • 10M Euros funded by EU (3 years, 2001 – 2004) • Submitting proposal in April for additional funds • Other projects? Paul Avery

  25. Grid References • Grid Book • www.mkp.com/grids • Globus • www.globus.org • Global Grid Forum • www.gridforum.org • PPDG • www.ppdg.net • EU DataGrid • www.eu-datagrid.org/ • GriPhyN • www.griphyn.org Paul Avery

  26. Summary • Grids will qualitatively and quantitatively change the nature of collaborations and approaches to computing • Global Data Grids provide challenges needed to build tomorrows Grids • We have a major opportunity to create common infrastructure • Many challenges during the coming transition • New grid projects will provide rich experience and lessons • Difficult to predict situation even 3-5 years ahead Paul Avery

More Related