Integration of Data Grids, Digital Libraries, and Persistent Archives

Integration of Data Grids, Digital Libraries, and Persistent Archives(Storage Resource Broker - SRB) Arcot Rajasekar Michael Wan Reagan W. Moore (sekar, mwan, moore)@sdsc.edu

SDSC SRB Team • Reagan Moore • Michael Wan • Arcot Rajasekar • Wayne Schroeder • Arun Jagatheesan • Charlie Cowart • Lucas Gilbert • George Kremenek • Sheau-Yen Chen • Bing Zhu • Roman Olschanowsky (BIRN) • Vicky Rowley (BIRN) • Marcio Faerman (SCEC) • Antoine De Torcy (IN2P3) • Students & emeritus • Erik Vandekieft • Reena Mathew • Xi (Cynthia) Sheng • Allen Ding • Grace Lin • Qiao Xin • Daniel Moore • Ethan Chen • Jon Weinburg

Topics • Concepts behind data management • Production data grid examples • Integration of data grids with digital libraries and persistent archives

Data Grid • Support data sharing between institutions • Discover relevant data without knowing the file name • Access data without knowing the storage location or storage access protocol • Retrieve data using your preferred API • Organize distributed data in a collection hierarchy • Manage latency in wide-area-networks • Manage PetaBytes of data and hundreds of millions of files

Digital Library • Provide curation services • Organization, description, and management of data • Support schema extension • Provide access services • Discovery, browsing, presentation, and manipulation of data • Federate semantics across collections • Digital library crosswalks

Persistent Archive • Support archival processes • Appraisal, accession, arrangement, description, preservation, and access • Manage technology evolution while preserving integrity and authenticity of data • Minimize risk of data loss • Preserve collections for hundreds of years • Data replication

Challenges • Each community assigns different meanings to terms used to describe their requirements • Data grid community • Persistent Archive is the infrastructure that manages storage technology evolution while preserving a collection • Archivist community • Persistent Archive is the collection that is being preserved in some choice of infrastructure • Together they define a preservation environment

Challenges • Preservation community traditionally views technology evolution as the problem rather than the solution • Preservation requires the ability to manipulate old formats • Digital library community attempts to assert exact meaning for semantics. • Metadata Encoding and Transmission Standard is one approach towards the creation of a metadata framework with the ability to support extension schema • Data grid community has not chosen standards for distributed data management • Computer science is just starting to understand how to characterize and manage data, information, and knowledge

To Make Progress • Develop simplest possible description for describing data, information, and knowledge management • Identify common infrastructure components • Apply in production settings • Iterate, based on new expectations for data management

Common Requirements forData Management • Distributed data sources • Management across administrative domains • Heterogeneity • Multiple types of storage repositories • Scalability • Support for billions of digital entities, PetaBytes of data • Preservation • Management of technology evolution

SRB Collections at SDSC

Data Management Concepts(Elements) • Collection • The organization of digital entities to simplify management and access. • Context • The information that describes the digital entities in a collection. • Content • The digital entities in a collection

Types of Context Metadata • Descriptive • Provenance information, discovery attributes • Administrative • Location, ownership, size, time stamps • Structural • Data model, internal components • Behavioral • Display and manipulation operations • Authenticity • Audit trails, checksums, access controls

Metadata Standards • METS - Metadata Encoding Transmission Standard • Defines standard structure and schema extension • OAIS - Open Archival Information System • Preservation packages for submission, archiving, distribution • OAI - Open Archives Initiative • Metadata retrieval based on Dublin Core provenance attributes

Data Management Concepts(Mechanisms) • Curation • The process of creating the context • Closure • Assertion that the collection has global properties, including completeness and homogeneity under specified operations • Consistency • Assertion that the context represents the content

Information Technologies • Data collecting • Sensor systems, object ring buffers and portals • Data organization • Collections, manage data context • Data sharing • Data grids, manage heterogeneity • Data publication • Digital libraries, support discovery • Data preservation • Persistent archives, manage technology evolution • Data analysis • Processing pipelines, manage knowledge extraction

Assertion • Data Grids provide the underlying abstractions required to support • Digital libraries • Curation processes • Distributed collections • Discovery and presentation services • Persistent archives • Management of technology evolution • Preservation of authenticity • The management of data requires the use of information (semantic labels). • The management of information requires the use of knowledge (relationships).

Data Grid Terms • Data • Bits - zeros and ones • Digital Entity • The bits that form an image of reality (file, object, image, data, metadata, string of bits, structured sets of string of bits) • Information • Semantic labels applied to data • Metadata • Semantic label and the associated data (attribute name and attribute value) • Knowledge • Relationships between semantic labels applied to data • Relationships used to assert the application of a semantic label

Data Grid Components • Federated client-server architecture • Servers can talk to each other independently of the client • Infrastructure independent naming • Logical names for users, resources, files, applications • Collective ownership of data • Collection-owned data, with infrastructure independent access control lists • Context management • Record state information in a metadata catalog from data grid services such as replication • Abstractions for dealing with heterogeneity

Data Grid Abstractions • Logical name space for files • Global persistent identifier • Storage repository virtualization • Standard operations supported on storage systems • Information repository virtualization • Standard operations to manage collections in databases • Access virtualization • Standard interface to support alternate APIs • Latency management mechanisms • Aggregation, parallel I/O, replication, caching • Security interoperability • GSSAPI, inter-realm authentication, collection-based authorization

Storage Repository Virtualization User Application Database File System Archive

Storage Repository Virtualization Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries User Application Common set of operations for interacting with every type of storage repository Database File System Archive

Mappings on ResourceName Space • Define logical resource name • List of physical resources • Replication • Write to logical resource completes when all physical resources have a copy • Load balancing • Write to a logical resource completes when copy exist on next physical resource in the list • Fault tolerance • Write to a logical resource completes when copies exist on “k” of “n” physical resources

Containers • Archivists store hardcopy in “cardboard boxes” • A container is the digital equivalent, the aggregation of digital files into a single file, with an associated “packing list” • Containers are used to minimize access latency, keep similar digital entities together

Data Stored at SDSC • HPSS archive • Stores 1 Petabyte of data • Stores 17 million files • Storage Resource Broker data grid • Stores 114 Terabytes of data • Stores 31 million files • Containers are used to aggregate files before loading into HPSS

C, C++, Libraries Unix Shell Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Application Linux I/O OAI WSDL Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication SRB Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase, SQLServer Drivers HRM

Production Data Grid • SDSC Storage Resource Broker • Federated client-server system, managing • Over 100 TBs of data at SDSC • Over 25 million files • Manages data collections stored in • Archives (HPSS, UniTree, ADSM, DMF) • Hierarchical Resource Managers • Tapes, tape robots • File systems (Unix, Linux, Mac OS X, Windows) • FTP sites • Databases (Oracle, DB2, Postgres, SQLserver, Sybase, Informix) • Virtual Object Ring Buffers

Data Virtualization User Application Database At U Md File System at U Texas Archive at SDSC

Data Virtualization Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system User Application Common naming convention and set of attributes for describing digital entities Database At U Md File System at U Texas Archive at SDSC

Logical Name Space • Persistent, location-independent identifiers for digital entities • Organized as collection hierarchy • Attributes mapped to logical name space • Attributed managed in a database • Types of administrative metadata • Physical location of file • Owner, size, creation time, update time • Access controls

File Identifiers • Logical file name • Infrastructure independent • Used to organize files into a collection hierarchy • Globally unique identifier • GUID for asserting equivalence across collections • Descriptive metadata • Support discovery • Physical file name • Location of file

Information Repository Virtualization User Application • Operations used to manage • administrative, descriptive, • user-defined metadata • Import from XML file • Export to XML file • Bulk load • Bulk unload • Schema extension • Access controls • Dynamic SQL generation Common operations for managing a catalog in a database Choice of database for Metadata Catalog

Map from API to remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries C, C++, Libraries Unix Shell Access Virtualization Application Linux I/O OAI WSDL DLL / Python Java, NT Browsers GridFTP Common operations performed on all storage repositories

Technology Evolution • All components of the “Persistent Archive” will evolve • Hardware systems • Software systems • Protocols • Access methods • Encoding syntax for digital entities • Create drivers for each new storage repository protocol • Migrate data to each new storage system • Manage evolution of the encoding syntax through either transformative migration or emulation

Are Repeated Media Migrations Feasible? • At SDSC, cartridge capacity has increased from 200 Mbytes to 200 Gbytes for same cartridge cost • Only migrate to new technology when the cost per Gigabyte is a factor of two lower • Then the media cost is fixed when sum over all migrations (1 + 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + …) = 2 • SDSC migrates to new media to reduce cost • All tape are stored in robots to minimize labor costs

Transformative Migration versus Emulation versus Digital Ontology • Transformative Migration • Transform the encoding format to a new standard • Can combine encoding format transformation with media migration • Emulation • Create a transportable parser for the original encoding format • Migrate emulator forward in time • Example - Multivalent Browser (written in Java) for parsing pdf, laTex, … • Digital ontology • Characterize the structures and relationships present within the digital entity • Migrate the characterization forward in time

Persistent Archives • When migrate from an old technology to a new technology, both versions are available. • Virtualization mechanisms used for federation across space can be used to manage migration over time • Persistent archives can be built on data grid infrastructure

Automation of Archival Processes

Data Grid Core Capabilities

Information Repository Abstraction

Distributed Resilient Architecture

Virtual Data Grid

Federated SRB server model Peer-to-peer Brokering Read Application Parallel Data Access Logical Name Or Attribute Condition 1 6 5/6 SRB server SRB server 3 4 5 SRB agent SRB agent 2 Server(s) Spawning R1 MCAT 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control R2 Data Access

Latency Management -Bulk Operations • Bulk register • Create a logical name for a file • Load context (metadata) • Bulk load • Create a copy of the file on a data grid storage repository • Bulk unload • Provide containers to hold small files and pointers to each file location • Bulk delete • Requests for bulk operations for access control, …

SRB Latency Management Remote Proxies, Staging Data Aggregation Containers Prefetch Network Destination Network Destination Source Caching Client-initiated I/O Streaming Parallel I/O Replication Server-initiated I/O

Southern California Earthquake Center • Build community digital library • Manage simulation and observational data • Anelastic wave propagation output • 10 TBs, 1.5 million files • Provide web-based interface • Support standard services on digital library • Manage data distributed across multiple sites • USC, SDSC, UCSB, SDSU, SIO • Provide standard metadata • Community based descriptive metadata • Administrative metadata • Application specific metadata

SCEC Digital Library Technologies • Portals • Knowledge interface to the library, presenting a coherent view of the services • Knowledge Management Systems • Organize relationships between SCEC concepts and semantic labels • Process management systems • Data processing pipelines to create derived data products • Web services • Uniform capabilities provided across SCEC collections • Data grid • Management of collections of distributed data • Computational grid • Access to distributed compute resources • Persistent archive • Management of technology evolution

Metadata Organization (Domain View versus Run View) Provenance Simulation Model Program Computer System Velocity Model Fault Model Domain ... Spatial Temporal Physical Numerical Run Output Domain List Formatting

Integration of Data Grids, Digital Libraries, and Persistent Archives

Integration of Data Grids, Digital Libraries, and Persistent Archives

Presentation Transcript

Digital Libraries, Data Grids, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu htt

Information Management and Distributed Data

Data Management Services Reagan W. Moore San Diego Supercomputer Center

Global Resource Naming

Knowledge-Based Persistent Archives Reagan W. Moore San Diego Supercomputer Center

Arun Jagatheesan Reagan Moore San Diego Supercomputer Center (SDSC)

Storage Resource Broker Tutorial

Persistent Management of Distributed Data Reagan W. Moore General Atomics, Inc.

Reagan Moore Richard Marciano Arcot Rajasekar Michael Wan Wayne Schroeder Antoine de Torcy

Persistent Archive Research Group GGF5 Reagan W. Moore San Diego Supercomputer Center

Arcot Rajasekar , Reagan Moore, Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

Rule-Based Distributed Data Management

Policy-Based Distributed Data Management Systems

Rule-Based Preservation Systems

Data Grid Federation

Rule-Based Data Management Systems

Arcot Rajasekar Michael Wan Reagan W. Moore (sekar, mwan, moore)@sdsc

Data Grid Interactions with Firewalls Michael Wan Reagan Moore {mwan,moore}@sdsc

Data Grids, Digital Libraries, and Persistent Archives

Rule-Based Distributed Data Management iRODS 1.0 - Jan 23, 2008 irods.sdsc

Michael Moore

Information Management and Distributed Data