450 likes | 456 Vues
Automatic Integration of Relational Database Systems. Ramon Lawrence University of Manitoba umlawren@cs.umanitoba.ca. Outline. Introduction, Motivation, and Background Our integration approach The integration architecture Standard dictionary, X-Specs, query processor Example integration
E N D
Automatic Integration of Relational Database Systems Ramon Lawrence University of Manitoba umlawren@cs.umanitoba.ca
Outline • Introduction, Motivation, and Background • Our integration approach • The integration architecture • Standard dictionary, X-Specs, query processor • Example integration • Northwind, Southstorm databases • Querying the integrated databases • Generating SQL queries from semantic queries • Unity implementation • Contributions, Conclusions, and Future Work
Database Terminology • Database system - is a database and a system to manage the data. • Schema - is a description of the data organization and format in a database. • Schema integration - is the process of combining local schemas into a global, integrated view by resolving conflicts present between the schemas. • Data integration - is the process of combining data at the entity-level. It requires resolving representational conflicts and determining equivalent keys. • Multidatabase system (MDBS) - is a collection of autonomous, local databases participating in a global database system to share data.
What is Integration? • Two levels of integration: • Schema integration -the description of the data • Data integration -the individual data instances • Integration problems include: • Different data models and conflicts within a model • Incompatible concept representations • Different user or view perspectives • Naming conflicts (homonym, synonym) • Integration handles the different mechanisms for storing data (structural conflicts), for referencing data (naming conflicts), and for attributing meaning to the data (semantic conflicts).
Why is Integration Required? • There are many integration environments: • Operational systems within an organization • System integration during company merger • Data warehouses, Intranets, and the WWW • Users require information from many data sources which often do not work together. • Companies require a global view of their entire operations which may be present in numerous operational databases for different departments and distributed geographically. • E-commerce demands integration of web databases with production systems.
What is the Current Solution? • Manual Integration Algorithms: • Allow designer to detect and resolve conflicts • Manipulate information using semantic models • Knowledge bases/Artificial Intelligence: • Cyc knowledge base and Carnot project • Global Dictionaries and Lexical Semantics: • Wordnet, Clio, Summary schemas model • Concept hierarchies (Castano)
What is the Current Solution? (2) • SQL and multidatabase query languages: • SQL, MSQL, IDL, DIRECT, SchemaSQL • Requires user to understand DB structure & semantics • Wrapper and mediator systems: • Information Manifold, TSIMMIS, Infomaster • Use query languages or description logics • Focus on query rewriting and reformulation • Industrial standards: • XML, BizTalk, E-commerce portals • Apply to limited domains/industries • Require standard structures and database changes
Previous Work Summary • Current techniques for database integration have some of these problems: • Require integrator to understand all databases • Integration process is manual • Do not hide system complexity from the user • Force changes on the existing database systems • Construct global view manually • Suffer from query imprecision (query containment)
Our Approach • Our approach combines standardization and query mapping algorithms. • The major idea is that schema conflicts can be resolved if we: • Eliminate all naming conflicts • Define a language capable of determining schema equivalence and performing transformations • Naming conflicts are eliminated by accepting a standard term dictionary. • Not a knowledge base or set of mediated views • Leverages semantic information in English words
Integration Architecture Integrated Context View X-Spec Editor Standard Dictionary Integration Algorithm Query Processor and ODBC Manager Client Client • Architecture Components: • 1) Integrated Context View • user’s view of integration • 2) X-Spec Editor • stores schema & metadata • uses XML • 3) Standard Dictionary • terms to express semantics • 4) Integration Algorithm • combines X-Specs into integrated context view • 5) Query Processor • accepts query on view • determines data source mappings and joins • executes queries and formats results Multidatabase Layer Subtransactions X-Spec X-Spec Database Database Local Transactions
Architecture Components • The architecture consists of four components: • A standard dictionary (SD) to capture data semantics • SD terms are used to build semantic names describing semantics of schema elements. • X-Specs for storing data semantics • Database metadata and semantic names stored using XML • Integration Algorithm • Matches concepts in different databases by semantic names. • Produces an integrated view of all database concepts. • Query Processor • Allows the user to formulate queries on the view. • Translates from semantic names in integrated view to SQL queries and integrates and formats results. • Involves determining correct field and table mappings and discovery of join conditions and join paths
Integration Processes • The integration architecture consists of three separate processes: • Capture process: independently extracts database schema information and metadata into a XML document called a X-Spec. • Integration process: combines X-Specs into a structurally-neutral hierarchy of database concepts called an integrated context view. • Query process: allows the user to formulate queries on the integrated view that are mapped by the query processor to structural queries (SQL) and the results are integrated and formatted.
Integration Architecture:The Capture Process Relational Schema Automatic Extraction X-Spec Specification Editor DBA Lookup of terms Standard Dictionary • Capture process involves: • Automatically extracting the schema information and metadata using a specification editor • Assigning semantic names to each schema element (tables and fields) to capture their semantics
Architecture Components: The Standard Dictionary • A standard dictionary (SD) provides standardized terms to capture data semantics. • Hierarchy of terms related by IS-A or Has-A links • Contains base set of common database concepts, but new concepts can be added • A SD term is a single, unambiguous semantic definition. • Several SD entries for a single English word are required if the word has multiple definitions. • The top-level dictionary terms are those proposed by Sowa.
Architecture Components: Dictionary vs. Knowledge Base • The standard dictionary differs from a knowledge base such as Cyc because: • Not intended to be a general English dictionary or contain knowledge facts about the world • Dictionary is evolved as new terms are required • Not all English words are used • Dictionary provides the systems with no “knowledge” • Since no facts are stored, system cannot deduce new facts • Dictionary terms are just semantic place holders, integrators determine the semantics of the database not the system • Simplified organization • Dictionary is organized as a tree for efficiency and simplicity in determining related concepts • Re-use of terms • Terms are re-used in semantic names
Architecture Components:Using the Standard Dictionary • SD terms are used to build semantic names describing semantics of schema elements. • Semantic names have the form: • semantic name := [CT_Type] | [CT_Type] CN • CT_Type := CT | CT {; CT} | CT {,CT} • CT := context term, CN := concept name • each CT and CN is a single term from the SD • Semantic names are included in specifications describing a database.
Integration Example (3) Page 19
What is a semantic name? • A semantic name is a universal, semantic identifier in a domain. • Similar to a field name in the Universal Relation. • Semantics are guaranteed unique by construction. • System has mechanism for comparing semantics across domains even though it does not understand them. (Exploiting semantics in English words.) • Important definitions: • context - a semantic name is a context if it maps to a table • concept-a semantic name is a concept if it maps to a field • context closure - of semantic name Si denoted Si* is the set of semantic names produced by taking ordered subsets of the terms of Si = {T1, T2 , … TN} starting with T1.
Architecture Components:X-Specs • Database metadata and semantic names are combined into specifications called X-Specs: • Stored and transmitted using XML • Contains information on a relational schema • Organized into database, table, and field levels • Stores semantic names to describe and integrate schema elements
Southstorm X-Spec <?xml version="1.0" ?> <Schema name = "Southstorm_xspec.xml” xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes"> <ElementType name="[Order]" sys_name = "Orders_tb" sys_type="Table"> <element type = "[Order] Id" sys_name = "Order_num" sys_type = "Field"/> <element type = "[Order] Total Amount" sys_name = "Order_total" sys_type = "Field"/> <element type = "[Order;Customer] Name" sys_name = "Cust_name" sys_type = "Field"/> <element type = "[Order;Customer;Address] Address Line 1" sys_name="Cust_address" sys_type="Field"/> <element type = "[Order;Customer;Address] City" sys_name = "Cust_city" sys_type = "Field"/> <element type = "[Order;Customer;Address] Postal Code" sys_name="Cust_pc" sys_type="Field"/> <element type = "[Order;Customer;Address] Country" sys_name="Cust_country" sys_type="Field"/> <element type = "[Order;Product] Id" sys_name = "Item1_id" sys_type = "Field"/> <element type = "[Order;Product] Quantity" sys_name = "Item1_quantity" sys_type = "Field"/> <element type = "[Order;Product] Price" sys_name = "Item1_price" sys_type = "Field"/> <element type = "[Order;Product] Id" sys_name = "Item2_id" sys_type = "Field"/> <element type = "[Order;Product] Quantity" sys_name = "Item2_quantity" sys_type = "Field"/> <element type = "[Order;Product] Price" sys_name = "Item2_price" sys_type = "Field"/> </ElementType> </Schema> Page 23
Architecture Components:Integrating X-Specs • Each database to be integrated is described using a X-Spec. • Identical concepts in different databases are identified by similar semantic names. • Concepts with identical (or hierarchially related) semantic names are combined regardless of their physical representation in the individual databases.
Integration Architecture:The Integration Process • Integration process involves: • Automatically identifying identical concepts by matching semantic names • Constructing a global view of database concepts consisting of a hierarchy of concept terms • Resolving structural differences during query generation and submission (e.g. a concept may be represented as a table in one database and a field (attribute) in another)
Integration Product:The Integrated Context View • The product of the integration is a structurally-neutral hierarchy of concepts called an integrated context view. • Define a context view (CV) as follows: • If a semantic name Si is in CV, then for any Sj in Si*, Sj is also in CV. • For each semantic name Si in CV, there exists a set of zero or more mappings Mi that associate a schema element Ej with Si. • A semantic name Si can only occur once in the CV. • A context view (CV) is a valid Universal Relation. • Each field is assigned a semantic name which uniquely identifies its semantic connotation.
Architecture Components: The Query Processor • The query processor: • Allows the user to formulate queries on the view. • Translates from semantic names in the context view to structural queries (SQL) on databases. • Involves determining correct field and table mappings and discovery of join conditions and join paths • Retrieves query results and formats them for display to the user. • Client-side query processing: • Perform joins between databases using common keys. • Data value formatting and transformation
The Query Processor:Determining field/table mappings • For each database (D) in the context view • For each semantic name (S) in query • If S has only one semantic name mapping in D Then • Add field mapping to query and its parent table • Else If S has multiple mappings but all in one table Then • Add each field mapping to query and the parent table • Else S has multiple mappings in more than one table Then • If any field mapping has a table already in query take that one • Else take field mapping with best semantic name match • Else take first mapping found • End If • Next • Next
The Query Processor:Constructing Join Graphs • Given a set of fields (F) and tables (T) to access, joins are applied to connect the tables. • A join graphis an undirected graph where: • Each node Ni is a table in the database. • There is a link from node Ni to node Nj if there is a join between the two tables. • A join path is a sequence of joins connecting two nodes in the graph. • A join tree is a set of joins connecting two or more nodes. • A join matrix M stores the shortest join paths between any two nodes (tables).
The Query Processor:Join Discovery Results • Join discovery in a database with a connected, acyclic join graph and a join matrix M: • There exists only one join tree for any set of tables . • The joins required to connect a table set T is found by taking any Ti of Tand unioning the join paths in M[Ni,N1], M[Ni,N2], ... M[Ni,Nn] where N1,N2,..Nn are the nodes corresponding to the set of tables T. • For a cyclic join graph: • There may exist more than one join tree for a set of tables and each tree may have different semantics. • Can allow the user to uniquely determine join tree by graphically displaying join conditions to the user as they browse the context view.
Advanced Query Processing • Advanced query processor features include: • global keys and joins - a mechanism for specifying when a field stores a global key such as a social security number. • result normalization - a procedure for normalizing query results returned from each individual database. (e.g. Southstorm) • data integration - transforming data representational conflicts at the global level. • For example, “M” and “F” may represent “Male” and “Female” in one database, and another may represent these concepts using “0” and “1”.
Northwind & Southstorm Query Examples • Example 1: Retrieve all order ids ([Order] Id) and customers ([Customer] Name): • SS: SELECT Order_num, Cust_name FROM Orders_tb • NW: SELECT OrderID, CompanyName FROM Orders, Customers WHERE Orders.CustomerID = Customers.CustomerID • Example 2: Retrieve all ordered products ([Order;Product] Id) and their order ids. • SS: SELECT Order_num, Item1_id, Item2_id FROM Orders_tb • NW: SELECT OrderID, ProductID FROM OrderDetails • Note: In NW, selects from two different order id mappings. In SS, result normalization is required.
Integration Example:Discussion • Important points: • System table and field names are not presented to the user who queries based on semantic names. • Database structure is not shown to the user. • Field and table mappings are automatically determined based on X-Spec information. • Join conditions are inserted as needed when available to join tables. • Different physical representations for the same concept are combined. • Hierarchically related concepts are combined based on their IS-A relationship in the standard dictionary.
Unity Overview • Unity is a software package that implements the integration architecture with a GUI. • Developed using Microsoft Visual C++ 6 and Microsoft Foundation Classes (MFC). • Unity allows the user to: • Construct and modify standard dictionaries • Build X-Specs to describe data sources • Integrate X-Specs into an integrated view • Transparently query integrated systems using ODBC and automatically generate SQL transactions • Unity is available for demonstration and distribution.
Architecture Discussion • The architecture automatically integrates relational schemas into a multidatabase. • Desirable properties: • Individual mappings - information sources integrated one-at-a-time and independently • Integrated view constructed for query transparency - user queries system by semantics instead of structure • Handles schema conflicts - including semantic, structural, and naming conflicts • Automated integration - integrated view constructed efficiently and automatically • No wrapper or mediator software is required • Transparent querying - users issue semantic queries which are translated to SQL by the query processor
Contributions • Architecture contributions: • Has an unique application of a standard dictionary which is not a knowledge base • Separates the capture and integration processes • Allows transparent querying without structure • Provides algorithms for dynamically extracting database data (creating relevant views) • Algorithms for mediation of global level conflicts (global keys, normalization, etc.) • Arguably simpler method for capturing data semantics than using description logic • An implementation, Unity, which demonstrates the practical benefits of the architecture
Conclusions • Automatic database integration is possible by using a standard term dictionary and defining semantic names for schema elements. • Integration of data sources has applications to the WWW and construction of data warehouses. • Users are able to transparently query integrated systems by concept instead of structure.
Future Work • The integration architecture is evolving with standards on XML and captures metadata information in XML documents. • We are constantly refining Unity. • Develop an integration component for a web browser • The query processor is being extended to resolve more complex queries and conflicts. • Test the system in large industrial projects. • Allow distributed updates and global updates on all databases.
References • Publications: • Unity - A Database Integration Tool, R. Lawrence and K. Barker, TRLabs Emerging Technology Bulletin, January 2000. • Multidatabase Querying by Context, R. Lawrence and K. Barker, DataSem2000, pages 127-136, Oct. 2000. • Integrating Relational Database Schemas using a Standardized Dictionary, To appear in SAC’2001 - ACM Symposium on Applied Computing, March, 2001. • Sponsors: • NSERC, TRLabs • Further Information: • http://www.cs.umanitoba.ca/~umlawren/