1 / 39

Knowledge-Based Persistent Archives Reagan W. Moore San Diego Supercomputer Center

Knowledge-Based Persistent Archives Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive, La Jolla, CA 92093-0505 Phone: 858 534-5073 FAX: 858 534-5152 E-mail: moore@sdsc.edu. Staff Reagan Moore Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek

ivrit
Télécharger la présentation

Knowledge-Based Persistent Archives Reagan W. Moore San Diego Supercomputer Center

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Knowledge-Based Persistent Archives Reagan W. Moore San Diego Supercomputer Center 9500 Gilman Drive, La Jolla, CA 92093-0505 Phone: 858 534-5073 FAX: 858 534-5152 E-mail: moore@sdsc.edu

  2. Staff Reagan Moore Chaitan Baru Sheau Yen Chen Charles Cowart Amarnath Gupta George Kremenek Bertram Ludäscher Richard Marciano Arcot Rajasekar Abe Singer Michael Wan Ilya Zaslavsky Bing Zhu Students - GSRA Martin Kuhl Liying Sui Yang Yu Valter Crescenzi Students - Undergrad Interns Peter Shin Roman Olshanowsky Shabbar Tambawala Pratik Mukhopadhyay +/- NN Data Intensive Computing Environment

  3. Research and Development Activities - FY00 • Demonstration of scalable systems • Expansion of persistent archive Framework • Knowledge-based persistent archives • Demonstration of archivable forms for new types of data • Web, GIS, compound documents, collections • Knowledge and anomaly processing • Tightness of fit of XML DTDs • Self validating archives as a preservation strategy

  4. Topics • Persistent archive functionality • Characterization of • Data / Information / Knowledge • Integration of Digital Library, Grid environments, and Persistent Archives

  5. Persistent Archive • Manage digital objects for the “life of the republic” • Maintain ability to discover and access digital objects while supporting hardware and software systems evolve

  6. Fundamental Concept for a Persistent Archive • Persistence requires migration over time onto new technology • While the migration occurs, a persistent archive must be able to interoperate with both the old technology and the new technology. • A persistent archive is an interoperability system.

  7. Implicit Concepts for Persistent Archive • Infrastructure independence • Data set access • Authentication • Collection management • Presentation • Non-proprietary formatting • Information models • XML - Information markup language • GML - Graphics markup language • Support for ingestion, management, access • Accessioning workbench, archive, access workbench

  8. Standard Information Markup Language • XML representation of metadata attributes • Standardization of DTDs - MOA II DTD for text • Standardization of markup language • XML based representation of collection structure • Attributes defining the physical layout of a schema into relational tables (foreign keys, attribute data types, …) • XML databases & XML organized data collections • Commercial systems: Excelon, TAMINO, Oracle8i, • XML based Topic Maps • Represent relationships between collection domain concepts, collection attibutes

  9. E-mail Collection • Test of the scalability of the technology • Archived a one-million record E-mail collection (1999) • Ingestion • Tagged E-mail using XML syntax (6 required, 13optional, 1000 user-defined tags) • Created description of the collection • Aggregated E-mail into containers, stored in an archive • Retrieved collection description, created database, and optimized for query • Total time was 27 hours (used 10 Mbit/sec Ethernet)

  10. What Types of Interoperability are Needed? • Data management (digital objects) • Ability to work with multiple types of storage systems, across separate administration domains • Information management (attributes) • Ability to define a collection independent of database choice • Ability to migrate collection onto new databases • Knowledge management (relationships) • Ability to manage relationships • Ability to map domain concepts to collection attributes

  11. Simplest Definitions • Data • Digital object • Objects are streams of bits • Information • Any tagged data, which is treated as an attribute. • Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object • Knowledge • Relationships between attributes • Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional

  12. Types of Knowledge Relationships • Logical / semantic • Digital Library cross-walks • Temporal / procedural • Workflow systems • Spatial / structural • GIS systems • Functional / algorithmic • Scientific feature analysis

  13. ANATOM

  14. Data Archive Ingest Services Management Access Services Ingestion platform Data repositories Access platform Interoperability Standards Interoperability Protocols

  15. Collection Based Persistent Archive Ingest Services Management Access Services Information Repository Attribute- based Query Attributes Semantics SDLIP Information XML DTD (Data Handling System - SRB / FTP / HTTP) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF

  16. Knowledge Based Persistent Archive Ingest Services Management Access Services Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Topic Maps / Buckets / Model-based Access) Information Repository Attribute- based Query Attributes Semantics SDLIP Information XML DTD (Data Handling System - SRB / FTP / HTTP) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF

  17. Ingestion Processes for Collection Creation Accession Template Closure Concept/Attribute Attribute Inverse Indexing Information Generation Knowledge Generation Attribute Selection Attribute Tagging Occurrence Tagging View Management Data Organization Collection

  18. Examples of Implied KnowledgeSenate Legislative Activities • Structural knowledge • Pertinent information embedded in document headers • Procedural knowledge • Naming convention • Senator represented by last name • Senator represented by last name and state • Senator represented by last name, first name, and state • Collection knowledge • Referenced senators include senators no longer in the senate

  19. Knowledge Generation • Accessioning Template • Defines the concepts under which the data objects will be tagged and organized • Attribute selection • Define the attributes that represent the information content associated with the domain concepts • Tag attributes using minimal constraint language, such as XML or XMLSchema • Evaluate closure of mined attributes compared to expected attributes • Refine concept map

  20. Information Generation • Create occurrence index • (Occurrence, attribute, value) • This is needed to be able to recreate original form of digital object • Analyze completeness of information • Inverse index of attribute values • Identifies unexpected values - consistency • Analyze closure of collection • Are additional attributes needed to represent inverse index value ranges?

  21. Data Organization • Archive preferred views of collection • Original data • XML tagged representation • Minimal representation of consolidated information • ‘Noise-free’version based upon occurrence tags • Object-relational database version • Archive occurrence tagged view • Archive ingestion procedures that transform collection from the original digital objects to the preferred views

  22. Information Management Projects • Digital Libraries • NSF Digital Library Initiative, Phase II - UCSB, Stanford • Digital Embryo digital library - GMU • NPACI Digital Sky - Caltech 2MASS sky survey • CDL - AMICO • NSF NSDL - UCAR / DLESE • Grid Environments • NASA Information Power Grid - NASA Ames • DOE Data Visualization Corridor - LLNL • DOE Particle Physics Data Grid - Stanford, Caltech • NSF Grid Physics Network - U Fl • Persistent Archives • NARA Persistent Archive • NHPRC - Scalable archives

  23. ERA Concept model

  24. File SID DBLobj SID Obj SID SRB Unix DB2 Oracle ADSM HPSS Data Handling System SDSC Storage Resource Broker & Meta-data Catalog Application Resource Third-party copy User Remote Proxies MCAT Dublin Core DataCutter Application Meta-data

  25. 1. NVO Portals and Workbenches NVO Data Grid 2. Knowledge & Resource Management Bulk Data Analysis Metadata View Data View Catalog Analysis 3. Concept space Standard APIs and Protocols 4.Grid Security Caching Replication Backup Scheduling Information Discovery Metadata delivery Data Discovery Data Delivery 5. Standard Metadata format, Data model, Wire format 6. Catalog Mediator Data mediator Catalog/Image Specific Access Compute Resources Catalogs Data Archives Derived Collections 7. Data model, schema, services, and mapping to NVO concept space published into (2) when a collection joins the federation

  26. Persistent Archive Framework • Persistent archive functionality - Accessioning platform • Data management - Archive Markup Language (AML), Container management • Collection management - Validation of collection, collection characterization • Knowledge management - Workflow staging, procedure management for ingestion process, anomaly detection, characterization of inherent implied knowledge • Scale - collections of millions to billions of objects

  27. Globus Data Grid Architecture Appln Discipline-Specific Data Grid Application Coherency control, replica selection, task management, virtual data catalog, virtual data code catalog, … User Replica catalog, replica management, co-allocation, certificate authorities, metadata catalogs, Collective Access to data, access to computers, access to network performance data, … Resource Communication, service discovery (DNS), authentication, authorization, delegation Connect Storage systems, clusters, networks, network caches, … Fabric

  28. Persistent Archive Framework • Persistent archive functionality - Repository • Data management - Storage system (robot, media, caching software), media migration, disaster recovery (archive namespace to container mapping) • Collection management - Container to object mapping, object metadata storage • Knowledge management - Transaction logging, AML migration on access or on media migration • Scale - thousands of collections, billions of objects, petabytes of data

  29. Globus Protocols, Services, and Interfaces Occur at Each Level Applications Languages/Frameworks User Service APIs and SDKs User Service Protocols User Services Collective Service APIs and SDKs Collective Service Protocols Collective Services Resource APIs and SDKs Resource Service Protocols Resource Services Connectivity APIs Connectivity Protocols Local Access APIs and Protocols Fabric Layer

  30. Persistent Archive Framework • Persistent archive functionality - Access platform • Data management - Data caching, container caching, disk cache management • Information management - Collection instantiation, access query, browsing support • Knowledge management - Order processing and workflow tracking, product authentication, usage characterization, presentation management • Scale - Millions of accesses per day

  31. Application “Specialized services”: user- or appln-specific distributed services Application User Internet Protocol Architecture “Managing multiple resources”: ubiquitous infrastructure services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Globus Layered Grid Architecture(By Analogy to Internet Architecture)

  32. Persistent Archive Framework • Persistent archive functionality - ARC • Data management - Finding aid storage • Collection management - Catalog of collections, access query, browse, disaster backup mechanisms, collection discriptors • Knowledge management - Characterization of finding aid efficiency, presentation management, concept spaces spanning collections • Scale - thousands of collections

  33. referenced items & collections referenced items & collections Referenced Items & Collections Portals & Clients Portals & Clients Portals & Clients NSDL Services NSDL Services Other NSDL Services NSDL Collections NSDL Collections NSDL Collections Core Services: annotation CI Services query transform CI Services topic-map registry Core Services: metadata normalizing CI Services personalization Core Collection- Building Services metadata harvesting CI Services discussion Core Collection- Building Services persistent storage CI Services visualization... User Interfaces Usage Enhancement Delivery Presentation Aggregation - Channels Information about collections Core NSDL Bus Meta-data delivery Data delivery Query Global Ids Security Network Metadata & data access-based services Virtual Collections & Mediators Collection Building

  34. Cross Cutting Issues • Global namespace • Metadata used by data handling system to locate containers • Metadata used to characterize objects in containers • Metadata used to characterize collections • Metadata used to locate collections • Consistency of metadata while updating

  35. Cross Cutting Issues • Knowledge management • Workflow systems to monitor state of system, monitor transactions, monitor updates to system architecture, monitor consistency of global namespace • Data distribution • Caching of data between accessioning platform, archive, and access platform • Consistency during updates

  36. Cross Cutting Issues • Security • Authentication across platforms • Authorization across platforms for updates • Consistency of architecture • Audit trails for updates • Validation of integrity of system • State management for system components

  37. Research Challenges- 2000 • Infrastructure independence • Progress on archivable form creation • Digital paper • Finding aids for a million collections • Concept spaces that support identification of collection • Product authentication • Tracking all updates, movements, media migrations, collection instantiations • Choice of Archival Markup Language • Tracking of E-commerce implementations • Knowledge management systems • Workflow, ingestion processing steps, system evolution procedures, finding aid concept spaces

  38. Further Information http://www.npaci.edu/DICE

More Related