1 / 48

Data Grids, Digital Libraries, and Persistent Archives

Data Grids, Digital Libraries, and Persistent Archives. ESIP Federation Meeting Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan, moore)@sdsc.edu. SDSC SRB Team. Arun Jagatheesan George Kremenek Sheau-Yen Chen Arcot Rajasekar Reagan Moore Michael Wan Roman Olschanowsky Bing Zhu

adler
Télécharger la présentation

Data Grids, Digital Libraries, and Persistent Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Grids, Digital Libraries, and Persistent Archives ESIP Federation Meeting Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan, moore)@sdsc.edu

  2. SDSC SRB Team • Arun Jagatheesan • George Kremenek • Sheau-Yen Chen • Arcot Rajasekar • Reagan Moore • Michael Wan • Roman Olschanowsky • Bing Zhu • Charlie Cowart • Wayne Schroeder • Vicky Rowley (BIRN) • Lucas Gilbert • Marcio Faerman (SCEC) • Antoine De Torcy (IN2P3) • Students & emeritus • Erik Vandekieft • Reena Mathew • Xi (Cynthia) Sheng • Allen Ding • Grace Lin • Qiao Xin • Daniel Moore • Ethan Chen • Jon Weinburg

  3. Concepts behind data management Production data grid examples Integration of data grids with digital libraries and persistent archives Data Grid demonstration based on the Storage Resource Broker (SRB) Topics

  4. Collection The organization of digital entities to simplify management and access. Context The information that describes the digital entities in a collection. Content The digital entities in a collection Data Management Concepts(Elements)

  5. Descriptive Provenance information, discovery attributes Administrative Location, ownership, size, time stamps Structural Data model, internal components Behavioral Display and manipulation operations Authenticity Audit trails, checksums, access controls Types of Context Metadata

  6. METS - Metadata Encoding Transmission Standard Defines standard structure and schema extension OAIS - Open Archival Information System Preservation packages for submission, archiving, distribution OAI - Open Archives Initiative Metadata retrieval based on Dublin Core provenance attributes Metadata Standards

  7. Curation The process of creating the context Closure Assertion that the collection has global properties, including completeness and homogeneity under specified operations Consistency Assertion that the context represents the content Data Management Concepts(Mechanisms)

  8. Data collecting Sensor systems, object ring buffers and portals Data organization Collections, manage data context Data sharing Data grids, manage heterogeneity Data publication Digital libraries, support discovery Data preservation Persistent archives, manage technology evolution Data analysis Processing pipelines, manage knowledge extraction Information Technologies

  9. Distributed data sources Management across administrative domains Heterogeneity Multiple types of storage repositories Scalability Support for billions of digital entities, PBs of data Preservation Management of technology evolution Data Management Challenges

  10. Distributed data sources Inter-realm authentication and authorization Heterogeneity Storage repository abstraction Scalability Differentiation between context and content management Preservation Support for automated processing (migration, archival processes) Data Grids

  11. Data Grids provide the underlying abstractions required to support Digital libraries Curation processes Distributed collections Discovery and presentation services Persistent archives Management of technology evolution Preservation of authenticity Assertion

  12. SRB Collections at SDSC

  13. Digital libraries and persistent archives can be built on data grids Common capabilities are needed for each environment Multiple examples of production systems across scientific disciplines and federal agencies Common Infrastructure

  14. Federated client-server architecture Servers can talk to each other independently of the client Infrastructure independent naming Logical names for users, resources, files, applications Collective ownership of data Collection-owned data, with infrastructure independent access control lists Context management Record state information in a metadata catalog from data grid services such as replication Abstractions for dealing with heterogeneity Data Grid Components

  15. Logical name space for files Global persistent identifier Storage repository abstraction Standard operations supported on storage systems Information repository abstraction Standard operations to manage collections in databases Access abstraction Standard interface to support alternate APIs Latency management mechanisms Aggregation, parallel I/O, replication, caching Security interoperability GSSAPI, inter-realm authentication, collection-based authorization Data Grid Abstractions

  16. C, C++, Libraries Unix Shell Databases DB2, Oracle, Postgres Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Application Linux I/O OAI WSDL Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication SRB Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase, SQLServer Drivers HRM

  17. SDSC Storage Resource Broker Federated client-server system, managing Over 90 TBs of data at SDSC Over 16 million files Manages data collections stored in Archives (HPSS, UniTree, ADSM, DMF) Hierarchical Resource Managers Tapes, tape robots File systems (Unix, Linux, Mac OS X, Windows) FTP sites Databases (Oracle, DB2, Postgres, SQLserver, Sybase, Informix) Virtual Object Ring Buffers Production Data Grid

  18. Federated SRB server model Peer-to-peer Brokering Read Application Parallel Data Access Logical Name Or Attribute Condition 1 6 5/6 SRB server SRB server 3 4 5 SRB agent SRB agent 2 Server(s) Spawning R1 MCAT 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control R2 Data Access

  19. Global, location-independent identifiers for digital entities Organized as collection hierarchy Attributes mapped to logical name space Attributed managed in a database Types of administrative metadata Physical location of file Owner, size, creation time, update time Access controls Logical Name Space

  20. Logical file name Infrastructure independent Used to organize files into a collection hierarchy Globally unique identifier GUID for asserting equivalence across collections Descriptive metadata Support discovery Physical file name Location of file File Identifiers

  21. Define logical resource name List of physical resources Replication Write to logical resource completes when all physical resources have a copy Load balancing Write to a logical resource completes when copy exist on next physical resource in the list Fault tolerance Write to a logical resource completes when copies exist on “k” of “n” physical resources Mappings on Name Space

  22. Integrate data management system, data processing system, and data storage system into a modular unit Commodity based disk systems (1 TB) Memory (1 GB) CPU (1.7 Ghz) Network connection (Gig-E) Linux operating system Data Grid technology to manage name spaces User names (authentication, authorization) File names Collection hierarchy Grid Bricks

  23. Hardware components Intel Celeron 1.7 GHz CPU SuperMicro P4SGA PCI Local bus ATX mainboard 1 GB memory (266 MHz DDR DRAM) 3Ware Escalade 7500-12 port PCI bus IDE RAID 10 Western Digital Caviar 200-GB IDE disk drives 3Com Etherlink 3C996B-T PCI bus 1000Base-T Redstone RMC-4F2-7 4U ten bay ATX chassis Linux operating system Cost is $2,200 per Tbyte plus tax Gig-E network switch costs $500 per brick Effective cost is about $2,700 per TByte Data Grid Brick

  24. Used to implement “picking” environments for 10-TB collections Web-based access Web services (WSDL/SOAP) for data subsetting Implemented 15-TBs of storage Astronomy sky surveys, NARA prototype persistent archive, NSDL web crawls Must still apply Linux security patches to each Grid Brick Grid Bricks at SDSC

  25. Logical name space for files Common collection hierarchy across modules Collection owned data Data grid manages access control lists for files Data grid manages authentication Logical resource name Aggregate modules under a single logical name Support load leveling across modules Grid Brick Logical Names

  26. Bulk register Create a logical name for a file Load context (metadata) Bulk load Create a copy of the file on a data grid storage repository Bulk unload Provide containers to hold small files and pointers to each file location Requests for bulk operations for delete, access control, … Latency Management -Bulk Operations

  27. SRB Latency Management Remote Proxies, Staging Data Aggregation Containers Prefetch Network Destination Network Destination Source Caching Client-initiated I/O Streaming Parallel I/O Replication Server-initiated I/O

  28. 2MASS (2 Micron All Sky Survey): Bruce Berriman, IPAC, Caltech; John Good, IPAC, Caltech, Wen-Piao Lee, IPAC, Caltech NVO (National Virtual Observatory): Tom Prince, Caltech, Roy Williams CACR, Caltech, John Good, IPAC, Caltech SDSC – SRB : Arcot Rajasekar, Mike Wan, George Kremenek, Reagan Moore Latency ManagementExample - Digital Sky Project

  29. Digital Sky Data Ingestion SRB SUN E10K Data Cache star catalog Informix SUN HPSS 800 GB …. input tapes from telescopes 10 TB SDSC IPAC CALTECH

  30. http://www.ipac.caltech.edu/2mass The input data was originally written to DLT tapes in the order seen by the telescope 10 TBytes of data, 5 million files Ingestion took nearly 1.5 years - manual loading of tapes Images aggregated into 147,000 containers by SRB Digital Sky - 2MASS

  31. average 3000 images a day Digital Sky Web-based Data Retrieval SRB SUN E10K Informix SUNs WEB IPAC CALTECH HPSS 800 GB WEB SUNs SGIs …. 10 TB JPL SDSC

  32. Extract image cutout from Digital Palomar Sky Survey Image size 1 Gbyte Shipped image to server for extracting cutout took 2-4 minutes (5-10 Mbytes/sec) Remote proxy performed cutout directly on storage repository Extracted cutout by partial file reads Image cutouts returned in 1-2 seconds Remote proxies are a mechanism to aggregate I/O commands Remote Proxies

  33. Manage interactions with a virtual object ring buffer Demonstrate federation of ORBs Demonstrate integration of archives, VORBs and file systems Support queries on objects in VORBs Real-Time DataExample - RoadNet Project

  34. Federated VORB Operation Logical Name of the Sensor wiith Stream Characteristics Automatically Contact ORB2 Through VORB server At Nome Get Sensor Data ( from Boston) VORB server 1 VORB agent VORB server 4 San Diego Check ORB1 ORB1 is down VORB agent 3 2 6 ORB1 Nome VCAT 5 Format Data and Transfer Contact VORB Catalog: 1.Logical-to-Physical mapping Physical Sensors Identified 2. Identification of Replicas ORB1 and ORB2 are identified as sources of reqd. data 3.Access & Audit Control ORB2 R2 Check ORB2 ORB2 is up. Get Data

  35. Access Abstraction Example - Data Assimilation Office HSI has implemented metadata schema in SRB/MCAT Origin: host, path, owner, uid, gid, perm_mask, [times] Ingestion: date, user, user_email, comment Generation: creator (name, uid, user, gid), host (name, arch, OS name & flags), compiler (name, version, flags), library, code (name, version), accounting data Data description: title, version, discipline, project, language, measurements, keywords, sensor, source, prod. status, temporal/spatial coverage, location, resolution, quality Fully compatible with GCMD

  36. Data Management System: Software Architecture

  37. DODS Access Environment Integration

  38. Data grids provide the ability to name, organize, and manage data on distributed storage resources Federation provides a way to name, organize, and manage data on multiple data grids. Data Grid Federation

  39. Consistency constraints in federations Cross-register a digital entity from one collection into another Who manages the access control lists? Who maintains consistency between context and content? How can federation systems be characterized? Peer-to-Peer Federated Systems

  40. Each SRB zone uses a metadata catalog (MCAT) to manage the context associated with digital content Context includes: Administrative, descriptive, authenticity attributes Users Resources Applications SRB Zones

  41. Mechanisms to impose consistency and access constraints on: Resources Controls on which zones may use a resource User names (user-name / domain / SRB-zone) Users may be registered into another domain, but retain their home zone, similar to Shibboleth Data files Controls on who specifies replication of data MCAT metadata Controls on who manages updates to metadata SRB Peer-to-Peer Federation

  42. Occasional Interchange - for specified users Replicated Catalogs- entire state information replication Resource Interaction- data replication Replicated Data Zones- no user interactions between zones Master-Slave Zones- slaves replicate data from master zone Snow-Flake Zones- hierarchy of data replication zones User / Data Replica Zones- user access from remote to home zone Nomadic Zones “SRB in a Box”- synchronize local zone to parent zone Free-floating “myZone” - synchronize without a parent zone Archival “BackUp Zone”- synchronize to an archive SRB Version 3.0.1 released December 19, 2003 Peer-to-Peer Federation

  43. Principle peer-to-peer federation approaches (1536 possible combinations)

  44. Comparison of peer-to-peer federation approaches Free Floating Partial User-ID Sharing Occasional Interchange Partial Resource Sharing Replicated Data System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing Hierarchical Zone Organization One Shared User-ID No Metadata Synch Resource Interaction Nomadic User and Data Replica System Managed Replication System Set Access Controls System Controlled Partial Metadata Synch No Resource Sharing System Managed Replication Connection From Any Zone Complete Resource Sharing Replicated Catalog Snow Flake Super Administrator Zone Control Master Slave System Controlled Complete Metadata Synch Complete User-ID Sharing Archival

  45. Knowledge Based Data Grid Roadmap Ingest Services Management Access Services Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Model-based Access) Information Repository Attribute- based Query Attributes Semantics SDLIP Information XML DTD (Data Handling System) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF

  46. Use web browser to access a collection housed at SDSC Retrieve an image Browse through a collection Search for a file Examine grid federation Data Grid Demonstration

  47. For More Information Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE http://www.npaci.edu/DICE/SRB http://www.npaci.edu/dice/srb/mySRB/mySRB.html

More Related