1 / 20

Probabilistic Information Integration Maurice van Keulen, Ander de Keijzer

Probabilistic Information Integration Maurice van Keulen, Ander de Keijzer. Automatic data integration. Why does data integration take so long, why not automatic? The schema mismatch problem The data conversion/mapping problem

lucus
Télécharger la présentation

Probabilistic Information Integration Maurice van Keulen, Ander de Keijzer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Information IntegrationMaurice van Keulen, Ander de Keijzer Dagstuhl Seminar 08421 - Probabilistic Information Integration

  2. Automatic data integration Why does data integration take so long, why not automatic? • The schema mismatch problem • The data conversion/mapping problem • The overlapping data problem(entity resolution / record linkage / data cleaning) • Proverbial 90% of the cases is straightforwardcan be done with little development effort • Proverbial 10% of the cases are hardtake most of the development time Let’s simply not solve those 10% rightaway! Let’s go for an initial integration that can readily be used “Good is good enough” for many applications Let it improve over time during use Dagstuhl Seminar 08421 - Probabilistic Information Integration

  3. How to deal with remaining 10% • Conflict between sources ≠ inconsistency= (Independent) observations • Data conflicts and partial/ambiguous matchings are symptoms of semantic uncertainty Our approach to data integration: • Define few rules to resolve only proverbial 90% of the cases • Store initial integration result as uncertain data • Start using the integrated data(time-to-market 10x earlier) • Queries will return uncertain answers • But integrated data can already meaningfully used • Feedback on use gradually improves integration(e.g., feedback on query answers) Dagstuhl Seminar 08421 - Probabilistic Information Integration

  4. Data integration process (semi-)automatic user interaction Solve remaining semantic uncertainty during use 1. Data integrationwith external source Allow early meaningful use of integrated data 2. Query 3. Feedback(statement about query answer) Uncertainqueryanswer DB DB DB lessuncertain certain uncertain Dagstuhl Seminar 08421 - Probabilistic Information Integration

  5. What we built DemoGUI Focus of talk • Differences / correspondences between probabilistic XML and relational DBs • Probabilistic integration algorithm • What would defy my purpose? • What is quality? (metrics) • When is it good enough?(experiments) ProbabilisticIntegrationFunctionality IMPrECISE ProbabilisticXMLDatabase αML XMLDBMS MonetDB/XQuery Dagstuhl Seminar 08421 - Probabilistic Information Integration

  6. probabilistic node possibility tag XML node with tag name ‘tag’ Data representation Probabilistic XML tree represents all possible worlds in one tree Possible worlds • Movie list with 1 movie (King Kong/1933)probability 8% • Movie list with 1 movie (King Kong/1976)probability 32% • Movie list with 2 movies(King Kong/1933 and King Kong/1976)probability 60% Can express uncertainty about existence, dependent and independent choice 1 choice points movies .4 .6 movie movie movie .2 .8 1 1 1 1 1 tl yr yr tl yr tl yr King Kong 1933 1976 King Kong 1933 King Kong 1976 Dagstuhl Seminar 08421 - Probabilistic Information Integration

  7. Differences / correspondencesXML vs. relational What to say about our probabilistic XML DBMS Representation: • Choice point (▼) = variable / x-tupleAlternative (O) = possible var assignment / alternative • Dependencies expressed in ancestor/descendant= event expression / lineage formula Querying • In XPath/XQuery vs. SQL • Semantics of querying according to possible world theory. Scalable implementation by working directly on compact/succint representation Dagstuhl Seminar 08421 - Probabilistic Information Integration

  8. Motivating example Scenario of demo on ICDE april 2008: • Portal with daily recommendation of movies on TV • Source 1 : TV guide (e.g., www.tvguide.com) • Enrich with information of Source 2 : IMDB • Combined 18 ‘attributes’ of which 6 overlap • Entity resolution problem with movies and actors Dagstuhl Seminar 08421 - Probabilistic Information Integration

  9. Movie Title: King Kong Year 1976 Year 1933 Rating: 8.0; 5.5 Movie Title: King Kong Movie Year 1933 Title: King Kong Rating: 8.0 Year 1976 Rating: 5.5 Movie Movie Title: King Kong Title: King Kong Year 1976 Year 1933 Rating: 8.0 Rating: 5.5 Movie Movie Title: King Kong Title: King Kong Year 1933 Year 1976 Rating: 8.0 Rating: 5.5 Uncertainty concerningentity resolution Same movie;for conflictingfields, both are correct Different movies Schema may exclude this possibility Same movie;for conflictingfields, one is correct Dagstuhl Seminar 08421 - Probabilistic Information Integration

  10. Integration functionality Integration algorithm = XML Tree-merge (in recursive descent fashion) • Similarity matching (In Christoph’s words) • Repair-key • Select worlds that satisfy background knowledge • Rules / Constraints • Thresholds Strict separation of concerns • Integration mechanism:enumeration of possibilities + XML tree merge • Integration intelligence:background knowledge + similarity matching Dagstuhl Seminar 08421 - Probabilistic Information Integration

  11. Result • A compact/succinct representation of all possible merged XML trees Why in this way? • Result need not be perfectAn integration result of ‘good enough quality’ sufficesSemantical issues in data integration not an obstacle Knowledge needed for meaningful use • Schema info (e.g., movies have 1 year child) • Some thresholds (e.g., less than 50% match on titles means not the same movie title) • Few domain specific rules (e.g., for (possibly) the same movie, if actors agree on role and role is unique in movie, then decide same actor regardless of difference in name) • Automatic fallback: edit distance similarity (should be something better) I on purpose use bad similarity matcher Dagstuhl Seminar 08421 - Probabilistic Information Integration

  12. What would defy my purpose? Purpose is to significantly reduce software development effort for obtaining integration of sources that is good enough (reduce time-to-market) • What is good enough? • Useful metrics for data quality • Threshold on metric when good enough • I do not reduce anything if • Need to manually define and fine-tune many rules • Need to fine-tune thresholds for sufficient accuracy • Feedback should be able to effectively improve quality Dagstuhl Seminar 08421 - Probabilistic Information Integration

  13. Metrics for integration result • Metrics for uncertainty • # possible worlds • Uncertainty density= average number of alternatives per choice point • Metrics for probability assignment • Answer decisivenessTwo 50/50 alternatives are less decisive then 90/10 Dens: .25 .17 .22 Dec: .83 .89 .72 Dagstuhl Seminar 08421 - Probabilistic Information Integration

  14. How to measure answer quality? Year of movie “King Kong”?//yr[../tl=“King Kong”] • (1933) : 40%×20% = 8%(1976) : 40%×80% = 32%(1933,1976) : 60% • Ranking by probability:1976 at 92% (better: 1x 1976 at 92%, 0x 1976 at 8%)1933 at 68% (better: 1x 1933 at 68%, 0x 1933 at 32%) • Suggests IR-like precision and recall, but • Query answers are possibly not distinct • Correct answer with high probability is better than one with low probability (and vice versa for incorrect answers) • Approach • Answer only exists for asmuch as its probability • Expected value of precision and recall Dagstuhl Seminar 08421 - Probabilistic Information Integration

  15. Answer quality measurement Year of movie “King Kong”?//yr[../tl=“King Kong”] • (1933) : 40%×20% = 8%(1976) : 40%×80% = 32%(1933,1976) : 60% • Ranking by probability:1976 at 92%1933 at 68% • Suppose 1976 is a correct answer and 1933 is not • EXP(Precision) = EXP(correct) / EXP(all answers) = 0.92 / 1.6 = 57.5%EXP(Recall) = EXP(correct) / |Human| = 0.92 / 1 = 92% Dagstuhl Seminar 08421 - Probabilistic Information Integration

  16. Data: few “Today’s picks” from TV guide enriched with IMDB source with 243000 movies. 18 attrs in total; 6 overlapping. Queries: 43 XPath queries Too many rules? • Isn’t development effort not simply shifted to rule definition and threshold tuning? • Rules: DTD-info + 1 ‘rough’ rule per entity suffices • Thresholds: Quality insensitive to ‘safe’ thresholds Advice to developer: Don’t worry about perfecting the rules and thresholds. Strive for an in initial query result that can be queried with about 90% entities resolved. For the 10% hard cases just make sure that you don’t miss the one correct match (user feedback cannot invent matches). Dagstuhl Seminar 08421 - Probabilistic Information Integration

  17. User feedback • User feedback = statement about query answer • Usually, user feedback can be naturally embedded in user interaction • Example: • Contacts application in your mobile phone, integrated/synchronized with company phone list, PC at home, other people’s phones (community) with the aim to automatically pick up changes • Phone application ranks possible phone numbers according to likelihood for dialing • Phone application can automatically give feedback • Dialed number gave error ‘invalid number’ • Both “End call” and “Wrong number” buttons • No significant additional interaction needed Dagstuhl Seminar 08421 - Probabilistic Information Integration

  18. Data: integration result @ margin 4, threshold 0.8 Queries: 43 XPath queries Feedback: several series of 40 consecutive feedbacks Each feedback randomly chosen from possible ones UF effective? • Is user feedback effective enough to quickly and effectively improve integration quality? Negative feedback Positive feedback Mixed feedback Precision Recall Dagstuhl Seminar 08421 - Probabilistic Information Integration

  19. Conclusions • Many correspondences between probabilistic XML and relational database • Simple model for uncertainty in data with well-understood semantics suffices: possible world model with discrete choices • Seems appropriate for schema and data integration for many applications (e.g. portals): early meaningful use of integrated data, improves during use with feedback • My worries: • First proposals for some quality metrics • Few rules and safe thresholds suffice • Mixed targeted user feedback effective in quickly improving integration quality Dagstuhl Seminar 08421 - Probabilistic Information Integration

  20. Opportunities • Put probabilistic relational DBMS underneath • Techniques for deriving (imperfect) (conditional) functional dependencies may be used to automate rule definition • Since rules need not be perfect nor handle all cases, tool-support for non-expert users becomes possible? • User feedback may also be used to learn new rulesWork is needed to handle wrong user feedbackAnswer explanation may help in targeting user feedback • Recent works on probabilistic schema matching/mapping • More distant future: • Autonomous applications that only rely on their own data and metadata for automatic data exchange/integration“Community of co-operating applications” • We need a way to let applications automatically learn how to disambiguate things Dagstuhl Seminar 08421 - Probabilistic Information Integration

More Related