1 / 25

Data and Knowledge Grids

Data and Knowledge Grids. Chaitan Baru Co-Director, Data and Knowledge Systems SDSC. Introduction. SDSC is leading-edge site of NPACI SDSC is one of the nodes in the TeraGrid

jeneil
Télécharger la présentation

Data and Knowledge Grids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data and Knowledge Grids Chaitan Baru Co-Director, Data and Knowledge Systems SDSC

  2. Introduction • SDSC is leading-edge site of NPACI • SDSC is one of the nodes in the TeraGrid • SDSC, via NPACI thrust areas, works with a number of applications—Earth System Science, Neuroscience, Molecular Biology, Digital Sky, … • SDSC works on a number of non-NPACI (including, industry) projects • The DAKS program receives 80% of funding from non-NPACI sources • The SDSC DAKS Program co-leads the data activities in Cal-(IT)2 via the SDSC/Cal-(IT)2 Data and Knowledge Engineering Lab

  3. Introduction • The SDSC Data and Knowledge Systems (DAKS) program is unique in the nation. It supports: • Computer Science R&D • Applications-driven research • Development of robust software systems • Production data and visualization systems • Involved in Grid-based computing… • (very) High speed networking, fewer, high-performance nodes, “big”, possibly complex, data • …also, Internet-based computing • Web clients, Web databases and mediation, Web services, e.g. the Information Integration Testbed (I2T) Project • Web-based grid computing

  4. Sensornets (real-time data, video streams) • ROADNet • ActiveCampus • Monitoring Health of Civil Infrastructure DAKSTechnology Layers Applications: Ecoinformatics, environmental science… Visualization Data Mining, Simulation Modeling, Analysis, Data Fusion Knowledge-Based Integration Advanced Query Processing Grid Storage (Curated Database) Filesystems, Database Systems High speed networking Networked Storage (SAN) Storage hardware

  5. Information Integration Testbed(NSF Digital Government/ITR grants) Clients • “Parameterized” views • Resource discovery • Service discovery XML-based Mediator • Mediation of geospatial information • Accuracy, resolution issues XML queries XML UDDI WSDL SOAP Java Servlets WSDL WSDL SOAP Sociology Workbench SOAP Stats Server XML Metadata files Oracle DBMS

  6. Community Grid Projects • GriPhyN—Grid Physics Network (NSF ITR) • NVO—National Virtual Observatory (NSF ITR) • BIRN—Biomedical Informatics Research Network (NCRR/NIH) • GEON—GEOsciences Network

  7. Request for “full sweep” of data (10’s-100’s TB) Recalibrate data GriPhyN: The LIGO Project • Use of COTS DBMS Store raw data and basic “products” 1000 Channels Of data, every 2-3 seconds Filtering Request for data Channels/Time (GB-TB) Result Data Analysis

  8. Correlate across Catalogs Result Catalog A Data mining Catalog B Digital Sky ProjectsNational Virtual Observatory (NVO) Load into DBMS Image Analysis Sky Catalogs Digital images

  9. BIRN • Integrating data from different brain mapping research sites • UCSD, UCLA, Caltech, Duke, Mass General, Harvard • Mouse and human brain • BIRN Data/Knowledge Grid • High-speed networking • Access to distributed data • Semantic mediation • Intra-species and inter-species queries • Visualization and analysis tools

  10. Example of BIRN Federation Are there changes in axon diameter, and/or number, in the optic nerve of EAE animals, before the development of gross structural changes? Integrated View Integrated View Definition Mediator Wrapper Wrapper Wrapper Wrapper Web CaBP, Expasy Electron microscopy Histology MRI

  11. BIRN Layered Architecture allows query access to descriptive and computed information from multiple sources allows exploration and manipulation of images and volumes Presentation/Visualization/Application Layer Data Integration Layer (Mediator) Computational Grid Virtual Data Grid (SRB) Network Layer provides file and collection-level access to any data from any source

  12. GEON • An outcome of the Geoinformatics community workshops • GEON Geoscience Research Themes • Earth's Surface: The Critical Interface Among Humans, Water, the Atmosphere, and Tectonics • Biodiversity: Geoscience and Evolution • Exploring the 4D Architecture of Continents • GEON Information Technology Research • GEON “Deep” Data Modeling and Semantic Mediation of 4D data sets • 4D Visualization and Augmented Reality • Data grids and distributed computing

  13. Geosciences R. Arrowsmith, Arizona State University N. Christensen, University of Wisconsin M. Crawford, Bryn Mawr C. Duffy, Pennsylvania State University C. Flessa, University of Arizona A. Gary, University of Utah B. Huber, Smithsonian Institution R. Keller, University of Texas El Paso A. Levander, Rice University M. Liu, University of Missouri C. Marshall, Harvard University D. McLaughlin, Massachusetts Institute of Technology C. Meertens, UNAVCO D. McLaughlin, MIT C. Meertens, UNAVCO J. Oldow, University of Idaho D. Seber, Cornell University A.K. Sinha, Virginia Tech W. Snyder, Boise State University H. Staudigel, Scripps Institution of Oceanography H. Wang, University of Wisconsin Information Technology M. Bailey, San Diego Supercomputer Center C. Baru, San Diego Supercomputer Center B. Ludaescher, San Diego Supercomputer Center P. Papadopoulos, San Diego Supercomputer Center Y. Papakonstantinou, University of California San Diego T. Smith, University of California Santa Barbara Education and Outreach M. Marlino, Digital Library for Earth System Education (DLESE) GEON Participants

  14. Government USGS NASA NOAA NGDC State Geologists Association Academia IRIS Cal-(IT)2 Industry ESRI Oracle Sun Panoram GEON Partners

  15. Where GEON Information Integration ExampleBiodiversity: The Paleobiology DatabaseCharles Marshall, Harvard Selection Criteria Biological Attributes “Locality” <name> WhereWhenWho Species A Species B Paleoenvironment Synonymy Tectonic Setting Museum holdings Paleogeography Phylogeny Paleolatitude Minerology International Timescale #1 Sequence Stratigraphy Body Mass International Timescale #2 Geochemistry Lithology When

  16. Complex Multiple-World Integration Scenarios • Current database integration issues only address • Structural/Schema Conflicts • common semistructured data model (XML) • schema transformations/integration (XML queries & transforms) • Limited Query Capabilities • capability based rewriting (e.g., TSIMMIS) • These scenarios are “one-world” (e.g. electronic parts catalogs) or simple multiple world (e.g. “home buyer”) • Problem: Semantic mediation in complex multiple worlds • complex, disjoint, seemingly unrelated data • “hidden semantics” in complex, indirect relationships

  17. Augmented Reality Facility (ARF)Simulation of database information overlaid on ground reality(Photograph of San Elijo Lagoon, San Diego County, CA)

  18. Scaling the “Network” • Technology: hardware, software • Disseminating “best practices” • Keeping technologies and technological skills up to date

  19. A Common Opportunity • Creating the Data Institute • Common distributed cyberinfrastructure for science communities • Much commonality in IT problems across domains • Support for training of scientists and data managers (“wetware”) • Training in DBMS, GIS, Web, Wireless, Taxonomic DB, Metadata • IT state-of-the-art moves quickly • Dedicated, funded center to develop/modify existing technology • Some requirements of science applications are not directly addressed by commercial technology • “Riding the market” • Leverage industry linkages and commercial technology

  20. A Common Opportunity • Creating the Data Institute • Information clearinghouse/digital library • Leverage what SDSC/Cal-(IT)2 is already doing • Long-term preservation/sustenance of data and software tools • Leverage SDSC’s work with the National Archives and Records Administration (NARA), Library of Congress (LoC), and California Digital Library (CDL) • National Ecological Data Archive • Create sustained community services • E.g. Science UDDI (Universal Description, Discovery, and Integration)

  21. Thematic Views Disciplinary Views Geophysics, Petrology, Tectonics, Geology, Paleontology,... Earth's Surface GEON Discovery Center (Portal) Biodiversity 4D Continental Architecture Virtual Collections Virtual Collections Virtual Collections GEOSCIENCES Community GEON GEON Participants Dataset Providers Tool Providers Collection Providers Mediation Teams (View Providers) GEON Interdisciplinary Themes Integrated Views Services Visualization, Digital Library, Collaboration Knowledge-Based Integration / Semantic Mediation Domain maps, process maps GEON Collections GEON Data Grid Services Authentication, distributed data management, persistent archives Storage / Networks / Computers USGS IRIS ADEPT/ADL NASA UNAVCO DLESE HPSS NGDC/NOAA NSDL SAN Linux Clusters

  22. IF  THEN  IF  THEN  IF  THEN  Structural Constraints (DTDs), Parent, Child, Sibling, ... A = (B*|C),D B = ... . . .... .... .... XML Elements .... (XML) Objects Raw Data Raw Data ConceptualModels Raw Data Structural vs. Model-Based Mediation Integrated-DTD := XQuery(Src1-DTD,...) Integrated-CM := CM-QL(Src1-CM,...) DOMAIN MAP Logical Domain Constraints No Domain Constraints Classes, Relations, is-a, has-a, ... C1 C2 R C3 XML Models

  23. C, C++, Linux I/O Unix Shell SRB Databases DB2, Oracle, Sybase Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Metadata Catalog Application Resource, User Java, NT Browsers Prolog Predicate Third-party copy Web User Defined Remote Proxies MCAT HRM Dublin Core DataCutter Application Meta-data

  24. OC-12 vBNS Abilene MREN OC-12 OC-3 TeraGrid: 13.6 TF, 6.8 TB memory, 79 TB internal disk, 576 network disk ANL 1 TF .25 TB Memory 25 TB disk Caltech 0.5 TF .4 TB Memory 86 TB disk Extreme Blk Diamond 574p IA-32 Chiba City 256p HP X-Class 32 32 24 32 32 128p HP V2500 128p Origin 24 32 24 92p IA-32 32 HR Display & VR Facilities 5 4 8 5 8 HPSS HPSS NTON OC-48 OC-12 Calren ESnet HSCC MREN/Abilene Starlight Chicago & LA DTF Core Switch/Routers Cisco 65xx Catalyst Switch (256 Gb/s Crossbar) Juniper M160 OC-12 ATM OC-48 OC-12 GbE NCSA 6+2 TF 4 TB Memory 240 TB disk SDSC 4.1 TF 2 TB Memory 225 TB SAN vBNS Abilene Calren ESnet OC-12 OC-12 OC-12 OC-3 Myrinet 4 8 HPSS 300 TB UniTree 2 Myrinet 4 10 1024p IA-32 320p IA-64 1176p IBM SP 1.7 TFLOPs Blue Horizon 14 Sun Server 15xxp Origin 4 16 2 x Sun E10K

  25. SDSC “node” configured to be best site for data-oriented computing in the world Argonne 1 TF 0.25 TB Memory 25 TB disk Caltech 0.5 TF 0.4 TB Memory 86 TB disk TeraGrid Backbone (40 Gbps) vBNS Abilene Calren ESnet NCSA 8 TF 4 TB Memory 240 TB disk HPSS 300 TB Myrinet Clos Spine Sun SDSC 4.1 TFLOP 2 TB Memory ~25 TB internal disk ~225 TB network disk Blue Horizon IBM SP 1.7 TFLOPs 2 x Sun E10K

More Related