1 / 26

Data Management Services Reagan W. Moore San Diego Supercomputer Center

Data Management Services Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive, La Jolla, CA 92093-0505 Phone: 858 534-5073 FAX: 858 534-5152 E-mail: moore@sdsc.edu http://www.npaci.edu/DICE/. Staff Reagan Moore Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta

rachel
Télécharger la présentation

Data Management Services Reagan W. Moore San Diego Supercomputer Center

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management Services Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive, La Jolla, CA 92093-0505 Phone: 858 534-5073 FAX: 858 534-5152 E-mail: moore@sdsc.edu http://www.npaci.edu/DICE/

  2. Staff Reagan Moore Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher Richard Marciano Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu Students - GSRA Martin Kuhl Liying Sui Yang Yu Valter Crescenzi Students - Undergrad Interns Peter Shin Roman Olshanowsky Shabbar Tambawala Pratik Mukhopadhyay +/- NN Data Intensive Computing Environment Group

  3. Topics • Data management systems • Examples of large-scale data management • Characterization of data, information, and knowledge for digital libraries

  4. Evolution of Data Management Collection - managed data Use database to organize attributes about data objects Separate information management from data storage Support APIs for information discovery, data access Database A Storage Storage Resource Broker Integration accomplished through a data handling system which characterizes the storage systems

  5. Evolution of Data Management Distributed Data Collection Same name space Same schema Separate administration domains Heterogeneous database instances Database A Database B Storage Resource Broker Integration requires the ability to characterize both the schemas and the table structures of each information repository

  6. Data Grids Data Grid - linking multiple data collections Separate name spaces Separate schema Separate administration domains Heterogeneous database instances Database A Data grid Database B The data grid is itself a collection that provides mechanisms to hide latency and manage semantics

  7. Astronomy Sky Survey Data Grid 1. Portals and Workbenches 2.Knowledge & Resource Management Bulk Data Analysis Metadata View Data View Catalog Analysis 3. Standard APIs and Protocols Concept space 4.Grid Security Caching Replication Backup Scheduling Information Discovery Metadata delivery Data Discovery Data Delivery 5. Standard Metadata format, Data model, Wire format 6. Catalog Mediator Data mediator Catalog/Image Specific Access Compute Resources Catalogs Data Archives Derived Collections 7.

  8. Federated Digital Libraries Virtual Data Grid - linking multiple data collections Ability to execute processes to recreate derived data Database A Services Virtual Data Grid Database B Services The virtual data grid integrates data grid and digital library technology to manage processes

  9. Portals & Clients Portals & Clients Portals & Clients NSDL Services NSDL Services Other NSDL Services NSDL Collections NSDL Collections NSDL Collections Core Services: annotation CI Services query transform CI Services topic-map registry referenced items & collections Core Services: metadata normalizing CI Services personalization referenced items & collections Referenced Items & Collections Core Collection- Building Services metadata harvesting CI Services discussion Core Collection- Building Services persistent storage CI Services visualization... User Interfaces NSDL Usage Enhancement Delivery Presentation Aggregation - Channels Information about collections Core NSDL Bus Meta-data delivery Data delivery Query Global Ids Security Network Metadata & data access-based services Virtual Collections & Mediators Collection Building

  10. Persistent Archive Persistent archive Describe archived data as collections Describe processes used to create collections Manage evolution of technology Database A (today) Virtual Data Grid Database A (tomorrow) The persistent archive is itself a virtual data grid that provides mechanisms to manage relationships over time

  11. ERA Concept model

  12. Data Management Systems • Distributed data collections • Single name space • Distributed data storage systems • Data Grid - integration of multiple data collections • Each collection has a separate name space • Infrastructure that interconnects the collections can use its own name space, containers, replication • Virtual Data Grids - federation of digital libraries • In addition, support interoperability between services for manipulation, presentation, discovery of digital objects • Persistent archive • In addition, manage evolution of technology components

  13. Distributed Environment Hurdles • Access to data distributed across multiple administration domains • Access to local name spaces • Persistence / consistency of distributed digital objects • Latency hiding mechanisms

  14. Distributed Data Collection • Logical organization of distributed digital objects into a collection • Access through federated servers • Collection-owned data, implies the server at each storage repository runs under a collection user-ID • Collection attributes define a global namespace • Self-consistent attribute update on all data accesses • Support for multiple access APIs • Extensible support for access to any type of storage system (archive, file system, database) • Extensible collection attributes

  15. Logical Collections • Separate the organization of digital objects into a collection from their physical storage location • Metadata catalog to manage attributes about the digital objects • Data handling system to manage interaction with remote storage systems

  16. Interoperability across Data and Information Repositories • Define a representation for storage that is independent of the implementation of the storage system • Unix file system semantics - Open/Close/Read/Write/Seek • Define a representation of a collection that is independent of the choice of database • XML DTD defining schema, table structures

  17. Defining Collection Attributes • Composing schema - define sets of attributes that are needed for each collection function • SRB metadata - Unix file system semantics • Provenance metadata - Dublin Core • Resource metadata - User access control lists • Discipline metadata - User defined attributes

  18. C, C++, Linux I/O Unix Shell SRB Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Application Resource, User Java, NT Browsers Prolog Predicate Third-party copy Web User Defined Remote Proxies MCAT HRM Dublin Core DataCutter Application Meta-data

  19. Latency Management • Data streaming • Overlap I/O access time with data movement • Data caching • Create a local copy to minimize I/O access time • Data replication • Choose between multiple sources for data access • Data aggregation • Use containers to hold multiple small data sets • I/O aggregation • Use remote proxies to do remote filtering/data subsetting

  20. Minimizing Latency in I/O Pipes Data Aggregation Remote Proxies Staging Streaming Caching Replication Network Destination Source

  21. Knowledge Management • Must manage semantic relationships between the multiple name spaces • Data Grid • Must manage procedural relationships between digital library services • Federated digital library • Must manage structural relationships between different versions of software systems • Persistent archive

  22. Differentiating between Data, Information, and Knowledge • Data • Digital object • Objects are streams of bits • Information • Any tagged data, which is treated as an attribute. • Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object • Knowledge • Relationships between attributes • Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional

  23. Types of Knowledge Relationships • Logical / semantic • Digital Library cross-walks • Temporal / procedural • Workflow systems • Spatial / structural • GIS systems • Functional / algorithmic • Scientific feature analysis

  24. Knowledge Based Digital Libraries Ingest Services Management Access Services Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Model-based Access) Information Repository Attribute- based Query XML DTD Attributes Semantics SDLIP Information (Data Handling System - SRB) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF

  25. Information Management Projects • Digital Libraries • NSF Digital Library Initiative, Phase II - UCSB, Stanford • Digital Embryo digital library - GMU • NPACI Digital Sky - Caltech 2MASS sky survey • CDL - AMICO • NSF NSDL - UCAR / DLESE • Grid Environments • NASA Information Power Grid - NASA Ames • DOE Data Visualization Corridor - LLNL • DOE Particle Physics Data Grid - Stanford, Caltech • NSF Grid Physics Network - U Fl • Persistent Archives • NARA Persistent Archive • NHPRC - Scalable archives

  26. Further Information http://www.npaci.edu/DICE

More Related