Download
lessons from the tsimmis project n.
Skip this Video
Loading SlideShow in 5 Seconds..
Lessons from the TSIMMIS Project PowerPoint Presentation
Download Presentation
Lessons from the TSIMMIS Project

Lessons from the TSIMMIS Project

460 Vues Download Presentation
Télécharger la présentation

Lessons from the TSIMMIS Project

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego

  2. Overview • TSIMMIS’ goals, technical challenges, and solutions • Insufficiencies of the TSIMMIS’ framework • Going forward

  3. Information Resides on Heterogeneous Information Sources Personal database Ticker Tape WWW Dialog • different interfaces • different data representations • redundant and conflicting information

  4. Goal: System Providing Integrated View of Heterogeneous Data Integration System • collects and combines information • provides integrated view, uniform user interface Personal database Ticker Tape WWW Dialog

  5. The Wrapper and Mediator Architecture Client Common Data Model portfolios for each company Mediator stock market prices business reports Wrapper Wrapper Ticker Tape Dialog

  6. The Data Warehousing Approach to Integration Client Stored Integrated View Mediator Wrapper Wrapper Ticker Tape Dialog

  7. The Lazy Integration Approach Query Decomposition, Translation and Result Fusion Client IBM portfolio Mediator IBM price IBM related reports (in common model) Wrapper Wrapper IBM related reports Ticker Tape Dialog

  8. Wrappers & Mediators from High-Level Specifications Client Mediator Specification Interpreter Mediator Mediator Specification Wrapper Generator Wrapper Wrapper Wrapper Specification Source Source

  9. Challenge: Sources Without a Well-Structured Schema Examples • semistructured • irregular • deeply nested • cross-referenced • incomplete schema knowledge • autonomous • dynamic • HTML pages • SGML documents • genome data • chemical structures • bibliographic information • results of the integration process

  10. Challenge: Different and Limited Source Capabilities Client retrieve IBM data Mediator (U = A + B) retrieve IBM data retrieve IBM data Wrapper (A) Wrapper (B)

  11. Mediator has to Adapt to Query Capabilities of Sources Client retrieve IBM data Mediator (U = A + B) retrieve IBM data retrieve IBM data retrieve everything (A) does not allow selection Wrapper (A) Wrapper (B)

  12. Part B • Semistructured Data Representation • Mediator Generation • Wrapper Generation • Capabilities-Based Rewriting

  13. Representation of Semistructured Information using OEM semantic object-id label Set Value <http://www/~doe, faculty, {&f1,&l1,&r1}> <&f1, first_name, “John”> <&l1, last_name, “Doe”> <&r1, rank, “professor”> Atomic Value structural object-id

  14. Graph Representation of OEM Data <http://www/~doe, faculty, {&f1,&l1,&r1}> <&f1, first_name, “John”> <&l1, last_name, “Doe”> <&r1, rank, “professor”> http://www/~doe faculty first_name “John” last_name “Doe” rank “professor”

  15. OEM Structures Represent Arbitrary Labeled Graphs http://www/~smith faculty name “Mary Smith” project “Air DB” paper author name “John Doe” author name “Mary Smith” title “Thin Air DB” http://www/~doe faculty first_name “John” last_name “Doe” rank “professor”

  16. Overview • Semistructured Data Representation • Mediator Generation • Example of mediator specification • Language expressiveness • Implementation and performance • Wrapper Generation • Capabilities-Based Rewriting

  17. Merge Information Relating to a Faculty faculty name “John Doe” rank “professor” birthday “April 1” papers ... s1 s2 faculty name “John Doe” rank “professor” papers ... person name “John Doe” birthday “April 1”

  18. Mediator Specification Example faculty name “John Doe” rank “professor” birthday “April 1” papers ... <N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 s1 s2 faculty name “John Doe” rank “professor” papers ... person name “John Doe” birthday “April 1”

  19. Mediator Specification Example: Semantics of Rule Bodies faculty name “John Doe” rank “professor” birthday “April 1” papers ... <N faculty {<L V>}> :- <faculty {<name N> <LV>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 s1 s2 faculty name “John Doe” rank“professor” papers ... person name “John Doe” birthday “April 1”

  20. Mediator Specification Example: Semantics of Rule Heads “John Doe” faculty name “John Doe” rank“professor” birthday “April 1” papers ... <N faculty {<LV>}> :- <faculty {<name N> <LV>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 s1 s2 faculty name “John Doe” rank“professor” papers ... person name “John Doe” birthday “April 1”

  21. Incrementally Add to Semantically Identified Object “John Doe” faculty name“John Doe” rank “professor” birthday “April 1” papers ... <N faculty {<LV>}> :- <faculty {<name N> <LV>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 s1 s2 faculty name“John Doe” rank “professor” papers ... person name “John Doe” birthday “April 1”

  22. Irregularities & Incomplete Schema Knowledge “John Doe” faculty name “John Doe” rank “professor” birthday “April 1” papers faculty name “Mary Smith” project “Air DB” “Mary Smith” <N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1 s1 faculty name “John Doe” rank “professor” papers faculty name “Mary Smith” project “Air DB” s2 person name “John Doe” birthday “April 1”

  23. Second Rule Attaches More Subobjects to View Objects “John Doe” faculty name “John Doe” rank “professor” birthday “April 1” papers ... <N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1 <N faculty {<LV>}> :- <person {<name N> <LV>}>@s2 s1 s2 faculty name “John Doe” rank “professor” papers ... person name “John Doe” birthday“April 1”

  24. Language Expressiveness • Information fusion problems solved by MSL • Irregularities • Incomplete knowledge of source structure • Transformation of cross-referenced structures • Inconsistent and redundant data • Use of arbitrary matching criteria • Theoretical analysis of expressiveness • Consider the relational representation of OEM graphs. Then MSL is equivalent to “SQL + special form of transitive closure”

  25. Inconsistent and Redundant Information “John Doe” faculty name “John Doe” rank “associate” rank “assistant” <N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 AND NOT <faculty {<name N> <L V1>}>@s1 s1 s2 faculty name “John Doe” rank “associate” person name “John Doe” rank “assistant”

  26. Overview • Semistructured Data Representation • Mediator Generation • Example of mediator specification • Language expressiveness • Implementation and performance • Wrapper Generation • Capabilities-Based Rewriting

  27. Mediator Specification Interpreter Architecture Result Query Mediator Specification Query Rewriter logical datamerge program Cost-Based Optimizer plan Datamerge Engine Queries to Wrappers Results

  28. Query Rewriting When Known Origins of Information • <N faculty {<salary S>}> :- :- <faculty {<name N> <salary S>}>@s1 <N faculty {< rank R >}> :- <person {<name N> <rank R>}>@s2 • <well-paid {<name N> <salary X>}> :- <N faculty {<salary X> <rank assistant>}> AND X>65000

  29. Query Rewriter PushesConditions to Sources • <N faculty {<salary S>}> :- :- <faculty {<name N> <salary S>}>@s1 <N faculty {< rank R >}> :- <person {<name N> <rank R>}>@s2 • <well-paid {<name N> <salary X>}> :- <N faculty {<salary X> <rank assistant>}> AND X>65000 • logical datamerge program <well-paid {<name N> <salary X>}> :- (<faculty {<name N> <salary X>}> ANDX>65000)@s1AND <person {<name N> <rank assistant>}>@s2

  30. Passing Bindings & Local Join Plans Passing Bindings s1 s2 <salary X> :- <faculty {<name $N> <salary X>}> AND X>65000 <name N> :- <person {<rank assistant>}> Local Join s1 s2 <a {<s X> <n N>}>:- <faculty {<name N> <salary X>}> AND X>65000 N <name N> :- <person {<rank assistant>}>

  31. Query Decomposition When Unknown Origins of Information <N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 <X faculty {<S Y>}> :- <X faculty {<birthday “1/20”> <S Y>}>

  32. Plan Considers All Possible Sources of birthday <N faculty {<L V>}> :- <faculty {<name N> <L V>}>@s1 <N faculty {<L V>}> :- <person {<name N> <L V>}>@s2 <X faculty {<S Y>}> :- <X faculty {<birthday “1/20”> <S Y>}> s1 s2 birthday name name birthday

  33. Overview • Semistructured-Data Representation • Mediator Generation • Wrapper Generation • Capabilities-Based Rewriting

  34. Query Translation in Wrappers SELECT * FROM person SELECT * FROM person WHERE name=“Smith” Wrapper Query Translator Result Translator find -all find -n Smith Source

  35. Rapid Query Translation Using Templates and Actions SELECT * FROM person SELECT * FROM person WHERE name=“Smith” SELECT * FROM person {emit “find -all” } SELECT * FROM person WHERE name=$N {emit “find -n $N”} Template Interpreter Result Translator find -all find -n Smith Source

  36. Description of Infinite Sets of Supported Queries • uses recursivenonterminals • Example: • job description contains word w1 and word w2 and ... • SELECT subset(person) FROM person WHERE \CJob\CJob: job LIKE $W AND \CJob \CJob: TRUE

  37. Overview • Semistructured-Data Representation • Mediator Generation • Wrapper Generation • Capabilities-Based Rewriting

  38. Capabilities-Based Rewriter in Mediator Architecture Query logical datamerge program Mediator Specification Query Rewriter Capabilities- Based Rewriter supported plans Cost-Based Optimizer optimal plan Datamerge Engine Wrapper Supported Queries Description Wrapper Supported Queries Description

  39. Capabilities-Based Rewriter Finds Supported Plans SELECT * FROM A WHERE salary>65000 Supported Queries SELECT * FROM A

  40. Capabilities-Based Rewriter Finds Most-Selective Supported Plans SELECT * FROM B WHERE salary>65000 Supported Queries SELECT * FROM B WHERE salary >65000 SELECT * FROM B

  41. Capabilities-Based Rewriter Architecture Query Query Capabilities Description Component SubQuery Discovery Component SubQueries Plan Construction Plans (not fully optimized) Plan Refinement Algebraically optimal plans

  42. What TSIMMIS Achieved • system for integration of heterogeneous sources • challenges and solutions • semistructured data & incomplete schema knowledge • appropriate specification language and query processing algorithms • limited and different query capabilities • query translation algorithm • capabilities-based query rewriting algorithm

  43. Overview • TSIMMIS’ goals, technical challenges, and solutions • Insufficiencies of the TSIMMIS’ framework • Going forward

  44. Insufficiencies of the TSIMMIS framework • OEM was really unstructured data • some loose and partial schematic info may pay off tremendously • too “databasy” user/mediator/source interaction

  45. Overview • TSIMMIS’ goals, technical challenges, and solutions • Insufficiencies of the TSIMMIS’ framework • Going forward

  46. Web emerges as a Distributed DB and XML as its Data Model XMAS Query Language Also export: 1. Schemas & Metadata (XML-Data, RDF,…) 2. Description of supported queries XML View Document(s) XML View Document(s) XML View Document(s) Data Source Wrapper Native XML Database Legacy Source

  47. Definition of Integrated Views Integrated XML View View Definition in XMAS Mediator XML View Document(s) XML View Document(s) XML View Document(s) Data Source Data Source Data Source

  48. Non-Materialized Views in the MIX mediator system Blended Browsing & Querying (BBQ) GUI Application XMAS query XML document Integrated View DTD DOM for Virtual XML Doc’s View Definition in XMAS MIX Mediator DTD Inference Query Processor Source DTD XML Source XML Source

  49. Application XML Document Fragments Blended Browsing & Querying (BBQ) GUI DOM (VXD) Client API XMAS Query View DTD MIX Mediator XMAS Mediator View Definition Resolution Unfolded Query DTD Inference Simplification Translation to Algebra Optimization DTD Execution XMAS Query XML Document Fragments XML Source 1 RDB2XML Wrapper XML Source 2 RDB