1 / 50

Approaches to the Integration of Distributed and Heterogeneous Data Resources

Approaches to the Integration of Distributed and Heterogeneous Data Resources . Ahmet Sayar Indiana University Computer Science Department. Motivation. Integrating data from multiple data sources Distributed query and transactions of data.

salena
Télécharger la présentation

Approaches to the Integration of Distributed and Heterogeneous Data Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approaches to the Integration of Distributed and Heterogeneous Data Resources Ahmet Sayar Indiana University Computer Science Department

  2. Motivation • Integrating data from multiple data sources • Distributed query and transactions of data. • Definitions and adoptions of data, metadata and their storages. • Accessing the data seamlessly. • Transparency, support for heterogeneity, extensibility and scalability.

  3. Outline • Data Integration Approaches • Application Specific Solutions • Application-Integration Framework • ASIS (Application Specific Information System) • Database Federation • Ogsa-DAI (Ogsa-Data Access and Integration) • CompareASIS with Ogsa-DAI • Digital Libraries • SRB (Storage Resource Broker) • Sompel’s Digital Library Approach • CompareASIS with SRB and Sompel’s DL

  4. Application Specific Solutions • The most common means of data integration • Expensive -in terms of time and skills • Developing and using requires deep system knowledge • Better results for special-purpose applications • Fragile • Changes to the underlying sources may easily break the application • Hard to extend • A new data source requires new code to be written

  5. Outline • Data Integration Approaches • Application Specific Solutions • Application-Integration Framework • ASIS • Database Federation • Ogsa-DAI • CompareASIS with Ogsa-DAI • Digital Libraries • SRB • Sompel’s DL • CompareASIS with SRB and Sompel’s DL

  6. Application-Integration Framework • It can also be called component-based framework • Such as CORBA or Filters with common interfaces • Not necessarily address data integration issues • Based on common data model (such as CML and GML) • With adaptors, if the source change the adaptor may have to change, but application may never see it. • Adding a new source is easy • a new adaptor may need to be written. • The adaptor may already be exist online. • No need to detailed system knowledge • Ex. ASIS - OGC GIS Application Integration Framework

  7. ASIS (1) • Enables inter-service communication through well-defined service interfaces, message formats and capabilities metadata. • Data model is ASL (Application Specific Lang.) • Metadata model is capability document • Data and metadata have common predefined schema • Components are Filter Services • Web Services, comon service interfaces defined in WSDL • Information/data services enabling distributed access, querying and transformation through their predictable input/output interfaces. • Chainable, located, and capable of updating their metadata manually or dynamically

  8. ASIS (2) • Data and data storage model • Any data can be integrated into the system after transforming to ASL. • Heterogeneity is handled at the end-Filters with adaptors. • ASL is community-accepted application specific language • GML (Geographic Markup Lang.) in GIS applications • CML (Chemistry Markup Lang.) in Chemistry applications • Filter’s common service interfaces • getCapabilities, getData, getFeatureInfo. • Requests to Filter’s interfaces • getCapabilitiesReq, getDataReq, getFeatureInfoReq • Expected return types are defined in Filters’ capability metadata

  9. ASIS (3) • Metadata and Metadata storage model: • Data integration is done through Filters’ capability metadata • Metadata is stored in local Filter’s file system as a flat file. • Capability: • Inspired from OGC WMS capability specification. • Look like Dublin Core format. • Capability like structure is also used in Gannon’s approach (XPOLA), for Grid services’ security issues. • Describes dynamic Web/Grid resources. • Updated manually or dynamically. • Consists of descriptor, service and provider metadata • Inter-service communication is achieved without a third-party. Enables chain of Filters.

  10. State Boundary Earth Fault ASIS (4)Data Access and Filter Chaining • Each Filter is capable of acting as both a server and a client • Capability integration is done through “getCapability” service interface • Requests for common service interfaces are created in accordance with predefined XML schema F3 F1 State Boundary F2 F4 Earth Fault Fault

  11. Outline • Data Integration Approaches • Application Specific Solutions • Application-Integration Framework • ASIS • Database Federation • Ogsa-DAI • Compare ASIS with Ogsa-DAI • Digital Libraries • SRB • Sompel’s DL • CompareASIS with SRB and Sompel’s DL

  12. Database Federation • Middleware consisting of database management system • Uniform access to number of heterogeneous data sources • Provides query language used to combine, contrast, analyze and manipulate the data • Data integration is done through Database integration. • Combine data from multiple sources in a single SQL statement – query recreation. • Ex. Ogsa-DAI (Open Grid Service Architecture – Data Access and Integration)

  13. Ogsa-DAI (1) • Provides common Java API for accessing and integrating data resources –such relational and XML databases, and files- in Grid environment • Specifically designed for OGSA architecture • SQL queries on relational resources and XPath statements on XML collections • Provides data pipelining (similar to Filter chaining) via an XML document called “perform” document. • Allows developers to easily add or extend functionality within Ogsa-DAI, “activity” document.

  14. Ogsa-DAI (2) • Data and storage model : • Any data stored in XML or relational databases, files • No common data model • Data is provided through GDS (Grid Data Services) • Uses Ogsa-DQP (Distributed Query Processor) to coordinate to access to multiple data services • The enactment engine is the core of Ogsa-DAI. Orchestrate running of the perform document • Information in perform document includes: • The list of activities and their XML schemas and implementation classes. • The list of role mappers and details • The info about data resource

  15. Ogsa-DAI (3) • Metadata storage model: • Metadata is kept in Catalog Service (MCS) • MCS enables attribute-based querying • Metadata is for the datasets, data can be anything (binary, text ..) • Data integration is done through XML based activity file mixing activities (in SQL queries) and metadata • Simple data access scenario • A client contacts a DAISGR first to locate the GDSFs. • Accesses suitable GDSFs directly to find out more about their properties and the data resources they represent. • Asks GDSF to instantiate a GDS • Accesses resource by sending the GDS the GDS-Perform doc.

  16. Ogsa-DAI (4) • Metadata model: • No common schema for metadata like capability • Defines Metadata for the datasets • No schema in XML • Stored in Database tables as attributes • Defines Metadata for the Database system to enable querying and defining activities • Schema in XML (mcsActivity.xsd schema file) • Kept as XML file in the file system (mcsActivity.xml)

  17. ASIS vs. Ogsa-DAI • Ogsa-DAI does not define metadata and data in XML schema. Metadata is mixed with Database schema. ASIS has predefined data and metadata models. • Ogsa-DAI uses any data, and they have predefined Database schema to enable querying and accessing data. • ASIS’s data integration is on demand and based on capability federation. Instead, Ogsa-DAI’s data integration is coded in XML struc perform and activity documents. • Ogsa-DAI has central (MCS), ASIS has distributed metadata approach. • Both system are based on Web Services. • Ogsa-DAI uses GridFTP, and ASIS uses NaradaBrokering for the performance issues in data transfers.

  18. Outline • Data Integration Approaches • Application Specific Solutions • Application-Integration Framework • ASIS • Database Federation • Ogsa-DAI • Compare ASIS with Ogsa-DAI • Digital Libraries • SRB • Sompel’s DL • Compare ASIS with SRB and Sompel’s DL

  19. Digital Libraries • Main focus is publishing and discovering of the digital objects. • Digital Objects : file, URL, SQL command string and any string of bits. • Collects data from multiple different data sources. • It is little bit different from the other data integration approaches • Data curation services – such as publishing and removing data from the data sources. • Ex. SRB (Storage Resource Broker) and Sompel’s Digital Library Approach

  20. SRB (1) • A federated client server system • Each server managing/brokering a set of resources • An implementation architecture for • Data grids • Digital Libraries. • Storage resources include digital libraries, MSS, UniTree and file systems • SRB consists of three components • MCAT services, • SRB servers to access to storage repositories and • SRB clients • Mediates access to distributed heterogeneous resources • Uses MCAT (Metadata Catalog Service) to facilitate brokering and attribute based querying. • Integrates data and metadata

  21. SRB (2) • Data and storage model: • Uniform storage interface • Resource-specific drivers to map from defined storage to interface • Storage resources are registered within SRB as physical resources • Logical resources (LSR) enable replication. • LSR = one or more than one physical resource • Client API refers to LSR. Collections are created by LSR • Metadata storage model (MCAT): • Serves both a core-metadata and domain-dependent metadata • Core-metadata is a standardized schema like Dublin Core • Stores metadata about data, collections, users, resources, methods • Attribute based access and querying, updating metadata catalog • Implemented as a relational database. Oracle, DB2 or Sybase • Abstraction and Replica information for data • “Global user” name space and authentication • Authorization through ACL and tickets

  22. SRB (3) • Metadata and Metadata Exchange Model: • MAPS (Metadata Attribute Presentation Structure) • Independent of the internal representation of the attributes inside the catalog. • Provides a uniform interface specification that can be used between user applications and the MCAT catalog and vice verse. • Structures which form the MAPS: • MAPS_Query_Struct, • MAPS_Result_Struct, • MAPS_Update_Struct and • MAPS_Definition_Struct • Mapping from MAPS to other models and exchange format. Dublin Core format is under implementation.

  23. SRB (4) • Simple data access scenario: • SRB server spawns SRB agent to authenticate the user/Application by comparing it with information stored in MCAT. • Find the location in MCAT. • Check user request against permissions stored in MCAT. • SRB agent contacts user with the result of his request. • SRB agent communicates with the user through a port specific to this client session. • SRB server chaining scenario (integrated SRBs): • First 3 steps from simple data access case. • SRB agent contacts remote SRB agent via remote SRB server. • The second SRB agent returns the pointer to the data item to the first SRB agent which passes it on to the user. • The SRB client interact with the data item directly. The federated SRB scheme -SRB server acts as a client to another.

  24. ASIS vs. SRB • SRB doesn’t define metadata in XML structure (as ASIS does) • SRB uses any data but ASIS uses ASL • SRB keeps the metadata in Catalogue Services (MCAT). ASIS uses XML structured capability metadata • SRB has central metadata handling approach, ASIS has distributed metadata handling approach • ASIS’s data integration is based on metadata federation, SRB’s data integration is based on SRB server federation. • Instead of Filters, SRB uses SRB server and agents for accessing data resources.

  25. Sompel’s DL (1) • Scholarly communication as a network-based workflow • Instead of Filters and ASL in ASIS, Sompel defines “repositories” and “digital objects”, respectively. • Repository is a networked system that provides services pertaining to a collection of Digital Objects • Repositories have common service interfaces. • “Obtain”, “Harvest” and “Put”. • Two classes of participants. • Data providers (DP) and Service providers (SP) • SP collect metadata from DPs (via 3 service interface); normalize and cluster it to deal with duplicates. • DP offer some type of search mechanism for their own repositories.

  26. Sompel’s DL (2) • Data and storage model: • Data is the abstraction of the Digital Objects • Digital Objects = Digital data + key metadata. • Serialization of Digital Objects = Surrogates • Surrogates • Information for the value chains and service • information used at repository service interfaces. • In the XML/RDF format • Composed of “dataStream” and/or “Entity” tag elements. • Chained object is defined by keymetadataID or “providerInfo”. • Different storage types: book repositories, teaching object repositories, dataset repositories etc. • Repositories are active nodes. Repositories enable the use and re-use of materials in many contexts.

  27. Sompel’s DL (3) • Metadata model: • Surrogates are essentially metadata records for objects • Based on Dublin Core format with domain specific extensions. • Dublin core has 15 standard entities to define resources. • For more details see http://doublincore.org • Chaining for integrating data: • Application/User doesn’t need to use workflow engine or script to create or run the chain. (As in ASIS) • Chain (they call “value chain”) is hidden in the surrogates. • Surrogates are updated through the common interfaces (“put” “obtain” and “harvest”) of the resources. • Chain is defined in the “Entity” element in the surrogate document with the “Lineage” sub element. • Sample chaining scenario: • A paper might have references to some papers and these papers might be references to some other papers…. • Value chain does not stop. • Papers have different metadata (value added) through value chain

  28. ASIS vs. Sompel’s Approach • Instead of Filters and ASL in ASIS, Sompel defines “repositories” and “digital objects” respectively • DP correspond to End-Filters, and SP correspond to Filters in ASIS • ASIS do not have publishing or putting service interfaces • “Obtain” corresponds to “getData” in ASIS • “Harvest” corresponds to “getCapabilities” in ASIS • Both have distributed metadata approaches for data integration • ASIS – direct communication between Filters by using “GetCapabilities” interface • Sompe’s DL – direct communication between repositories and services by using “Harvest” interface • Sompel’s DL uses Dublin Core for the representation of the resources – ASIS uses its own schema. • ASIS uses ASL for the representation of the data - Sompel’s approach doesn’t have common data model.

  29. Summary • Application-Integration Framework (ASIS) • Easy to add new sources • Using online Filters providing required adaptors • peer-to-peer chain of Filters • no central metadata catalog server – Distributed capability exchange and aggregation • SOA • Re-usable components (Filters) for different applications in predefined domain • Implications of Filter services • Scalable and Fault-tolerant • Load-balancing and caching • Dynamically updating capability metadata

  30. THANKS !

  31. APPENDIX

  32. Capability in Grid Services Security • XPOLA • The infrastructure is built on a peer-to-peer chain-of-trust model. No central admins • WS-Security compliant • Extensible – PKI and SAML based • Dynamic and reusable (manually or automatically generated) • Composed of two sectors. • Policy document (SAML, lifetime info, binding info etc.) • Provider’s signature • Existing grid security solutions to fine-grained authorization were not addressing general Web/Grid services in compliant with Web Services security specs. • With central admins, other approaches don’t address dynamic services

  33. Sample Capabilities File (too simplified) – GIS Domain • <?xml version='1.0' encoding="UTF-8" standalone="no" ?> <!DOCTYPE WMT_MS_Capabilities SYSTEM "http://toro.ucs.indiana.edu:8086/xml/capabilities.dtd"> <Capabilities version="1.1.1" updateSequence="0">      <Service>           <Name>CGL_Mapping</Name>           <Title>CGL_Mapping WMS</Title>           <OnlineResource xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple“ xlink:href="http://toro.ucs.indiana.edu:8086/WMSServices.wsdl" />      <ContactInformation> ….. </ContactInformation> </Service>      <Capability>           <Request>               <GetCapabilities>                    <Format>WMS_XML</Format>                       <DCPType><HTTP><Get>                         <OnlineResource xmlns:xlink="http://w3.org/1999/xlink" xlink:type="simple“ xlink:href="http://toro.ucs.indiana.edu:8086/WMSServices.wsdl" />                       </Get></HTTP></DCPType>                </GetCapabilities>                <GetMap>                     <Format>image/GIF</Format>                     <Format>image/PNG</Format>                       <DCPType><HTTP><Get>                         <OnlineResource xmlns:xlink="http://w3.org/1999/xlink" xlink:type="simple“ xlink:href="http://toro.ucs.indiana.edu:8086/WMSServices.wsdl" />                       </Get></HTTP></DCPType>               </GetMap>           </Request>           <Layer>                <Name>California:Faults</Name>                <Title>California:Faults</Title>                <SRS>EPSG:4326</SRS>                <LatLonBoundingBox minx="-180" miny="-82" maxx="180" maxy="82"  / >            </Layer>      </Capability> </Capabilities>

  34. Dublin Core • Challenge of resource description and discovery • Language for making a particular class of statements about resources • There 2 namespaces – Dublin Core element set (dc)and Dublin Core qualifiers (dcq ex. dcq:iso8601). • Some of Dublin core metadata element set • Title (dc:title), subject, description, creator, publisher, type, format, source, language, rights • Using DC in RDF, specifications for DC in RDF (work in progress) • Resource has(verb) property(dc:creator) X(dc:Ahmet)

  35. Sample Dublin Core http://www.ils.unc.edu/mrc/jcdl2006/slides/kunze.pdf

  36. Open Archive InitiativeOAI

  37. OAI • Deals with e-print server world • Need to develop services that permitted searching across papers housed at multiple repositories • Repositories also needed capabilities to automatically identify and copy papers that had been deposited in them. • Definition of an interface to permit e-print servers to expose the metadata for the papers that it held. • Service providers with similar metadata standards need to harvest this metadata • Service providers act as a federation of repositories, by indexing documents, so that multiple collections cen be searched as though they form a single collection

  38. OAI-PMH • For the variety of the communities engaged in publishing content on the Web • Any networked server can emplly the protocol to enable service providers to collect its metadata • HTTP-based request-response transaction • Service Providers • Harvest metadata from Data Providers using the OAI protocol and use the returned metadata as a basis for building value-added services. • Data Providers (repositories) • Adopt OAI technical as a means of exposing metadata about their content.

  39. Comments on OAI • OAI-PMH is ultimately only as useful as the metadata it transports. • The tendency of implementers to almost exclusively apply the lowest common denominator of unqualified dublin core makes it difficult to implement more advanced search interface features. • Content providers should prefer more expressive metadata schema like MARC or qualified DC and find ways to augment human-generated descriptive metadata.

  40. Sompel’s Digital Library Approach

  41. Sompel’s ApproachHierarchy steps http://msc.mellon.org/Meetings/Interop/lagoze_data_model.pdf

  42. Sompel’s DLData Model msc.mellon.org/Meetings/Interop/lagoze_data_model.pdf

  43. Ogsa-DAI

  44. Ogsa-DAI Figure http://www.globus.org/grid_software/data/dai.php

  45. Perform Document http://www.ogsadai.org.uk/documentation/ogsadai-wsi-2.2/doc/interaction/Perform.html

  46. MCS • MCS present a design of Metadata Catalog Service that provides mechanism for storing and accessing descriptive metadata attributes • Requirements: Store domain-independent attributes, user-defined attributes, query with a set of attributes, query with a logical name, authentication, authorization and auditing • Allows users to discover data sets based on the value of descriptive attributes, rather then requiring to know specific names or physical locations of data items

  47. MCAT vs. MCS • MCAT can be used just with SRB • MCS can be used just in OGSA architecture • MCAT stores both physical and logical addresses • MCS stores logical metadata attributes and handles that can be resolved by a data location or data access services. • They can both be extended for serving application-specific metadata, but they don’t have generalized way for doing that.

  48. SRB

  49. SRB

  50. CLIENT • Example interaction with SRB using Scommands: • Sinit • Start interaction with SRB • Spwd • Display current position within SRB repository • Smeta -i –I “UDSMD0=‘author’” –I “UDSMD1=‘bob’” myfile • Add metadata describing the author the file • Smeta -i –I “UDSMD0=‘author’” –I “UDSMD1=‘arthur’” • Search for files with author metadata set as arthur • Sget myFile • Copy myFile from SRB to local storage • Sreplicate –S anotherResource myFile • Create a replica of myFile on anotherResource • Srm myFile • Remove myFile (and all replicas) from SRB • Sexit • End interaction with SRB

More Related