270 likes | 416 Vues
Integrating Multiple Data Sources using a Standardized XML Dictionary. Ramon Lawrence University of Manitoba umlawren@cs.umanitoba.ca Supervisor: Dr. Ken Barker. TR Labs - Winnipeg. Outline. Introduction, Motivation, and Background Integration architecture components
E N D
Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence University of Manitoba umlawren@cs.umanitoba.ca Supervisor: Dr. Ken Barker TRLabs - Winnipeg
Outline • Introduction, Motivation, and Background • Integration architecture components • Integration architecture • Example integration • Applications to the WWW • Future work and conclusions • Demonstration of Unity
Introduction • Integration of data is required when accessing multiple databases within an organization or on the WWW. • Our focus is automatically combining database schema using schema integration. • Schema integration requires knowledge of data semantics and use of metadata.
Motivation • Organizations have several database systems which must interoperate. • Users often access multiple Web databases whose knowledge must be integrated and presented in a useful form. • Data warehouses and OLAP systems require data semantics to be understood and data to be cleansed and summarized.
Background • Schema integration involves combining diverse database schema into an integrated view by resolving conflicts. • Schema conflicts include naming, structural, and semantic conflicts. • Schema integration is required for database interoperability, but it is currently a manual process.
GTS GTS GTS GTS LDBS LDBS LDBS LDBS MDBS Architecture Global Transactions • Global Transaction Manager (GTM) • processes global transactions • insures information in all LDBSs is consistent • submits subtransactions to the GTSs for each LDBS GTM subtransactions • Global Transaction Servers (GTSs) • one for each LDBS • converts subtransactions from the GTM into a form usable by the LDBS and vice versa • Local Database Systems (LDBSs) • databases combined into MDBS • unchanged as still process local transactions Local Transactions
Previous Work • Research systems: • integrating systems by logical rules (Sheth) • defining global dictionaries (Castano) • Carnot Project using the Cyc knowledge base • Industrial systems and standards: • Metadata Interchange Specification (MDIS) • XML, BizTalk, E-commerce portals
Architecture Objective • The objective of our architecture is to provide a system for automatically integrating diverse relational schemas into a multidatabase • Desirable properties: • individual mappings - information sources integrated one-at-a-time and independently • global view constructed for query transparency • handles schema conflicts - including semantic, structural, and naming conflicts • automated global integration - global view constructed efficiently and automatically
The Idea • The major idea is that schema conflicts can be resolved if we: • eliminate all naming conflicts • define a language capable of determining schema equivalence and performing transformations • With these two properties, schema conflicts can be resolved automatically at the global level
Architecture Components: The Global Dictionary • A global dictionary (GD) provides standardized terms to capture data semantics. • Hierarchy of terms related by IS-A or Has-A links • Contains base set of common database concepts, but new concepts can be added • A GD term is a single, unambiguous semantic definition. • Several GD entries for a single English word are required if the word has multiple definitions.
Architecture Components:Using the Global Dictionary • GD terms are used to build semantic names to describe the semantics of schema elements. • Semantic names have the form: • semantic name = “[“CT [[;CT] | [,CT]] “]” CN • CT = context term, CN = concept name • each CT and CN is a single term from the GD • Semantic names are included in specifications describing a data source.
Architecture Components:X-Specs • Database metadata and semantic names are combined into specifications called X-Specs: • stored and transmitted using XML • contains information on a relational schema • organized into database, table, and field levels • stores semantic names to describe and integrate schema elements
Architecture Components:Integrating X-Specs • Each database to be integrated is described using a X-Spec. • Identical concepts in different databases are identified by similar semantic names. • Concepts with identical (or hierarchially related) semantic names are combined regardless of their physical representation in the individual databases.
Integration Architecture • Our integration architecture consists of two separate phases: • capture process: X-Specs are constructed for each data source independently • integration process: X-Specs are combined using the integration algorithm which matches semantic names using the global dictionary
Integration Architecture:The Capture Process • Capture process involves: • automatically extracting the schema information and metadata using a specification editor • assigning semantic names to each schema element (tables and fields) to capture their semantics
Integration Architecture:The Capture Process Relational Schema Automatic Extraction X-Spec Specification Editor DBA Lookup of terms Global Dictionary
Integration Architecture:The Integration Process • Integration process involves: • automatically identifying identical concepts by matching semantic names • constructing a global view of database concepts consisting of a hierarchy of concept terms • resolving structural differences during query generation and submission (e.g. a concept may be represented as a table in one database and a field (attribute) in another)
Integration Architecture:The Integration Process …………. Client Client Integration Site Subtransactions X-Spec X-Spec …….. RDBMS RDBMS
Integration Architecture Benefits • The benefits of the two phase architecture are: • Dynamic integration: schemas integrated as needed • X-Specs are constructed only once and independent of each other • Automatic conflict resolution by integrating based on semantic name rather than physical structure • Users are isolated from system names and organization by querying through a global view using semantic names for concepts
Integration Example • Two claims databases to be integrated: • ABC Company: Claims_tb(claim_id, claimant, net_amount, paid_amount) • XYZ Company: T_claims(id, customer, claim_amt), T_payments(cid, pid, amount) • First step is to construct X-Specs for each database.
Integration Example:Integrated View • Global view after integration: • [Claim] • Id • Net amount • [Customer] • name • [Payment] • id • amount
Integration Example:Discussion • Important points: • system and field names are not presented to the user who queries based on semantic names • database structure is not shown to the user • different physical representations for the same concept are combined (e.g. payment (attribute) in ABC with payment table in XYZ database) • hierarchially related concepts (customer vs. claimant) are combined based on their IS-A relationship in the global dictionary
Applications to the WWW • Integrating diverse data sources is involved in constructing a data warehouse and other operational systems. • The WWW is a diverse organizations of databases which users access. • Automatically integrating web data sources by a browser or portal reduces query complexity and integration of results for the user.
Conclusions • Automatic integration of database schema is possible by using a global dictionary of terms and constructing semantic names for schema elements. • Integration of data sources has applications to the WWW and construction of data warehouses.
Future Work • The integration architecture is evolving with standards on XML and captures metadata information in XML documents. • The system is being tested on sample problems, and a query mechanism is work-in-progress. • We are refining a prototype of the system called Unity.