1 / 59

Integration of Data Grids, Digital Libraries, and Persistent Archives

This paper explores the integration of data grids, digital libraries, and persistent archives to support data sharing, discovery, and preservation. It discusses the concepts behind data management, production data grid examples, and the challenges faced in this integration. The paper also highlights the common requirements for data management and presents data management concepts and mechanisms.

staciet
Télécharger la présentation

Integration of Data Grids, Digital Libraries, and Persistent Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integration of Data Grids, Digital Libraries, and Persistent Archives(Storage Resource Broker - SRB) Arcot Rajasekar Michael Wan Reagan W. Moore (sekar, mwan, moore)@sdsc.edu

  2. SDSC SRB Team • Reagan Moore • Michael Wan • Arcot Rajasekar • Wayne Schroeder • Arun Jagatheesan • Charlie Cowart • Lucas Gilbert • George Kremenek • Sheau-Yen Chen • Bing Zhu • Roman Olschanowsky (BIRN) • Vicky Rowley (BIRN) • Marcio Faerman (SCEC) • Antoine De Torcy (IN2P3) • Students & emeritus • Erik Vandekieft • Reena Mathew • Xi (Cynthia) Sheng • Allen Ding • Grace Lin • Qiao Xin • Daniel Moore • Ethan Chen • Jon Weinburg

  3. Topics • Concepts behind data management • Production data grid examples • Integration of data grids with digital libraries and persistent archives

  4. Data Grid • Support data sharing between institutions • Discover relevant data without knowing the file name • Access data without knowing the storage location or storage access protocol • Retrieve data using your preferred API • Organize distributed data in a collection hierarchy • Manage latency in wide-area-networks • Manage PetaBytes of data and hundreds of millions of files

  5. Digital Library • Provide curation services • Organization, description, and management of data • Support schema extension • Provide access services • Discovery, browsing, presentation, and manipulation of data • Federate semantics across collections • Digital library crosswalks

  6. Persistent Archive • Support archival processes • Appraisal, accession, arrangement, description, preservation, and access • Manage technology evolution while preserving integrity and authenticity of data • Minimize risk of data loss • Preserve collections for hundreds of years • Data replication

  7. Challenges • Each community assigns different meanings to terms used to describe their requirements • Data grid community • Persistent Archive is the infrastructure that manages storage technology evolution while preserving a collection • Archivist community • Persistent Archive is the collection that is being preserved in some choice of infrastructure • Together they define a preservation environment

  8. Challenges • Preservation community traditionally views technology evolution as the problem rather than the solution • Preservation requires the ability to manipulate old formats • Digital library community attempts to assert exact meaning for semantics. • Metadata Encoding and Transmission Standard is one approach towards the creation of a metadata framework with the ability to support extension schema • Data grid community has not chosen standards for distributed data management • Computer science is just starting to understand how to characterize and manage data, information, and knowledge

  9. To Make Progress • Develop simplest possible description for describing data, information, and knowledge management • Identify common infrastructure components • Apply in production settings • Iterate, based on new expectations for data management

  10. Common Requirements forData Management • Distributed data sources • Management across administrative domains • Heterogeneity • Multiple types of storage repositories • Scalability • Support for billions of digital entities, PetaBytes of data • Preservation • Management of technology evolution

  11. SRB Collections at SDSC

  12. Data Management Concepts(Elements) • Collection • The organization of digital entities to simplify management and access. • Context • The information that describes the digital entities in a collection. • Content • The digital entities in a collection

  13. Types of Context Metadata • Descriptive • Provenance information, discovery attributes • Administrative • Location, ownership, size, time stamps • Structural • Data model, internal components • Behavioral • Display and manipulation operations • Authenticity • Audit trails, checksums, access controls

  14. Metadata Standards • METS - Metadata Encoding Transmission Standard • Defines standard structure and schema extension • OAIS - Open Archival Information System • Preservation packages for submission, archiving, distribution • OAI - Open Archives Initiative • Metadata retrieval based on Dublin Core provenance attributes

  15. Data Management Concepts(Mechanisms) • Curation • The process of creating the context • Closure • Assertion that the collection has global properties, including completeness and homogeneity under specified operations • Consistency • Assertion that the context represents the content

  16. Information Technologies • Data collecting • Sensor systems, object ring buffers and portals • Data organization • Collections, manage data context • Data sharing • Data grids, manage heterogeneity • Data publication • Digital libraries, support discovery • Data preservation • Persistent archives, manage technology evolution • Data analysis • Processing pipelines, manage knowledge extraction

  17. Assertion • Data Grids provide the underlying abstractions required to support • Digital libraries • Curation processes • Distributed collections • Discovery and presentation services • Persistent archives • Management of technology evolution • Preservation of authenticity • The management of data requires the use of information (semantic labels). • The management of information requires the use of knowledge (relationships).

  18. Data Grid Terms • Data • Bits - zeros and ones • Digital Entity • The bits that form an image of reality (file, object, image, data, metadata, string of bits, structured sets of string of bits) • Information • Semantic labels applied to data • Metadata • Semantic label and the associated data (attribute name and attribute value) • Knowledge • Relationships between semantic labels applied to data • Relationships used to assert the application of a semantic label

  19. Data Grid Components • Federated client-server architecture • Servers can talk to each other independently of the client • Infrastructure independent naming • Logical names for users, resources, files, applications • Collective ownership of data • Collection-owned data, with infrastructure independent access control lists • Context management • Record state information in a metadata catalog from data grid services such as replication • Abstractions for dealing with heterogeneity

  20. Data Grid Abstractions • Logical name space for files • Global persistent identifier • Storage repository virtualization • Standard operations supported on storage systems • Information repository virtualization • Standard operations to manage collections in databases • Access virtualization • Standard interface to support alternate APIs • Latency management mechanisms • Aggregation, parallel I/O, replication, caching • Security interoperability • GSSAPI, inter-realm authentication, collection-based authorization

  21. Storage Repository Virtualization User Application Database File System Archive

  22. Storage Repository Virtualization Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries User Application Common set of operations for interacting with every type of storage repository Database File System Archive

  23. Mappings on ResourceName Space • Define logical resource name • List of physical resources • Replication • Write to logical resource completes when all physical resources have a copy • Load balancing • Write to a logical resource completes when copy exist on next physical resource in the list • Fault tolerance • Write to a logical resource completes when copies exist on “k” of “n” physical resources

  24. Containers • Archivists store hardcopy in “cardboard boxes” • A container is the digital equivalent, the aggregation of digital files into a single file, with an associated “packing list” • Containers are used to minimize access latency, keep similar digital entities together

  25. Data Stored at SDSC • HPSS archive • Stores 1 Petabyte of data • Stores 17 million files • Storage Resource Broker data grid • Stores 114 Terabytes of data • Stores 31 million files • Containers are used to aggregate files before loading into HPSS

  26. C, C++, Libraries Unix Shell Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Application Linux I/O OAI WSDL Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication SRB Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase, SQLServer Drivers HRM

  27. Production Data Grid • SDSC Storage Resource Broker • Federated client-server system, managing • Over 100 TBs of data at SDSC • Over 25 million files • Manages data collections stored in • Archives (HPSS, UniTree, ADSM, DMF) • Hierarchical Resource Managers • Tapes, tape robots • File systems (Unix, Linux, Mac OS X, Windows) • FTP sites • Databases (Oracle, DB2, Postgres, SQLserver, Sybase, Informix) • Virtual Object Ring Buffers

  28. Data Virtualization User Application Database At U Md File System at U Texas Archive at SDSC

  29. Data Virtualization Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system User Application Common naming convention and set of attributes for describing digital entities Database At U Md File System at U Texas Archive at SDSC

  30. Logical Name Space • Persistent, location-independent identifiers for digital entities • Organized as collection hierarchy • Attributes mapped to logical name space • Attributed managed in a database • Types of administrative metadata • Physical location of file • Owner, size, creation time, update time • Access controls

  31. File Identifiers • Logical file name • Infrastructure independent • Used to organize files into a collection hierarchy • Globally unique identifier • GUID for asserting equivalence across collections • Descriptive metadata • Support discovery • Physical file name • Location of file

  32. Information Repository Virtualization User Application • Operations used to manage • administrative, descriptive, • user-defined metadata • Import from XML file • Export to XML file • Bulk load • Bulk unload • Schema extension • Access controls • Dynamic SQL generation Common operations for managing a catalog in a database Choice of database for Metadata Catalog

  33. Map from API to remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries C, C++, Libraries Unix Shell Access Virtualization Application Linux I/O OAI WSDL DLL / Python Java, NT Browsers GridFTP Common operations performed on all storage repositories

  34. Technology Evolution • All components of the “Persistent Archive” will evolve • Hardware systems • Software systems • Protocols • Access methods • Encoding syntax for digital entities • Create drivers for each new storage repository protocol • Migrate data to each new storage system • Manage evolution of the encoding syntax through either transformative migration or emulation

  35. Are Repeated Media Migrations Feasible? • At SDSC, cartridge capacity has increased from 200 Mbytes to 200 Gbytes for same cartridge cost • Only migrate to new technology when the cost per Gigabyte is a factor of two lower • Then the media cost is fixed when sum over all migrations (1 + 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + …) = 2 • SDSC migrates to new media to reduce cost • All tape are stored in robots to minimize labor costs

  36. Transformative Migration versus Emulation versus Digital Ontology • Transformative Migration • Transform the encoding format to a new standard • Can combine encoding format transformation with media migration • Emulation • Create a transportable parser for the original encoding format • Migrate emulator forward in time • Example - Multivalent Browser (written in Java) for parsing pdf, laTex, … • Digital ontology • Characterize the structures and relationships present within the digital entity • Migrate the characterization forward in time

  37. Persistent Archives • When migrate from an old technology to a new technology, both versions are available. • Virtualization mechanisms used for federation across space can be used to manage migration over time • Persistent archives can be built on data grid infrastructure

  38. Automation of Archival Processes

  39. Data Grid Core Capabilities

  40. Information Repository Abstraction

  41. Distributed Resilient Architecture

  42. Virtual Data Grid

  43. Federated SRB server model Peer-to-peer Brokering Read Application Parallel Data Access Logical Name Or Attribute Condition 1 6 5/6 SRB server SRB server 3 4 5 SRB agent SRB agent 2 Server(s) Spawning R1 MCAT 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control R2 Data Access

  44. Latency Management -Bulk Operations • Bulk register • Create a logical name for a file • Load context (metadata) • Bulk load • Create a copy of the file on a data grid storage repository • Bulk unload • Provide containers to hold small files and pointers to each file location • Bulk delete • Requests for bulk operations for access control, …

  45. SRB Latency Management Remote Proxies, Staging Data Aggregation Containers Prefetch Network Destination Network Destination Source Caching Client-initiated I/O Streaming Parallel I/O Replication Server-initiated I/O

  46. Southern California Earthquake Center • Build community digital library • Manage simulation and observational data • Anelastic wave propagation output • 10 TBs, 1.5 million files • Provide web-based interface • Support standard services on digital library • Manage data distributed across multiple sites • USC, SDSC, UCSB, SDSU, SIO • Provide standard metadata • Community based descriptive metadata • Administrative metadata • Application specific metadata

  47. SCEC Digital Library Technologies • Portals • Knowledge interface to the library, presenting a coherent view of the services • Knowledge Management Systems • Organize relationships between SCEC concepts and semantic labels • Process management systems • Data processing pipelines to create derived data products • Web services • Uniform capabilities provided across SCEC collections • Data grid • Management of collections of distributed data • Computational grid • Access to distributed compute resources • Persistent archive • Management of technology evolution

  48. Metadata Organization (Domain View versus Run View) Provenance Simulation Model Program Computer System Velocity Model Fault Model Domain ... Spatial Temporal Physical Numerical Run Output Domain List Formatting

More Related