1 / 61

Semex: A Platform for Personal Information Management and Integration

Semex: A Platform for Personal Information Management and Integration. Xin (Luna) Dong University of Washington June 24, 2005. Is Your Personal Information a Mine or a Mess ?. Intranet Internet. Is Your Personal Information a Mine or a Mess ?. Intranet Internet.

zorina
Télécharger la présentation

Semex: A Platform for Personal Information Management and Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005

  2. Is Your Personal Informationa Mine or a Mess? Intranet Internet

  3. Is Your Personal Informationa Mine or a Mess? Intranet Internet

  4. Questions Hard to Answer • Where are my SEMEX papers and presentation slides (maybe in an attachment)?

  5. Index Data from Different SourcesE.g. Google, MSN desktop search Intranet Internet

  6. Questions Hard to Answer • Where are my SEMEX papers and presentation slides (maybe in an attachment)? • Who are working on SEMEX? • What are the emails sent by my PKU alumni? • What are the phone numbers and emails of my coauthors?

  7. Organize Data in a Semantically Meaningful Way Intranet Internet

  8. Questions Hard to Answer • Where are my SEMEX papers and presentation slides (maybe in an attachment)? • Who are working on SEMEX? • What are the emails sent by my PKU alumni? • What are the phone numbers and emails of my coauthors? • Whom of SIGMOD’05 authors do I know?

  9. Integrate Organizational and Public Data with Personal Data Intranet Internet

  10. AttachedTo Recipient ConfHomePage CourseGradeIn ExperimentOf PublishedIn Sender Cites ComeFrom EarlyVersion ArticleAbout PresentationFor FrequentEmailer CoAuthor AddressOf OriginitatedFrom BudgetOf HomePage

  11. Homepage Web Page Person Cached Organizer, Participants Document Author Event Sender, Recipients Softcopy Softcopy Paper Presentation Message Cites Mail & calendar HTML Files Presentations Papers SEMEX (SEMantic EXplorer) – I. Provide a Logical View of Data

  12. Homepage Web Page Person Cached Organizer, Participants Document Author Event Sender, Recipients Softcopy Softcopy Paper Presentation Message Cites SEMEX (SEMantic EXplorer) – II. On-the-fly Data Integration

  13. How to Find Alon’s Papers on My Desktop?

  14. How to Find Alon’s Papers on My Desktop? – Google Search Results Search Alon Halevy Send me the semex demo slides again?

  15. How to Find Alon’s Papers on My Desktop? – Google Search Results Search Alon Halevy Ignore previous request, I found them

  16. How to Find Alon’s Papers on My Desktop? – Google Search Results

  17. Semex Goal • Build a Personal Information Management (PIM) system prototype that provides a logical view of personal information • Build the logical view automatically • Extract object instances and associations • Remove instance duplications • Leverage the logical view for on-the-fly data integration • Exploit the logical view for information search and browsing to improve people’s productivity • Be resilient to the evolution of the logical view

  18. An Ideal PIM is a Magic Wand

  19. An Ideal PIM is a Magic Wand

  20. Outline • Problem definition and project goals • Technical issues: • System architecture and instance extraction [CIDR’05] • Reference reconciliation [Sigmod’05] • On-the-fly data integration • Association search and browsing • Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] • Overarching PIM Themes

  21. Searcher Searcher Browser Browser Analyzer Analyzer Domain Model Domain Model Association DB Association DB Index Index Indexer Indexer Reference Reconciliater Reference Reconciliater Domain Manager Associations Associations Objects Objects Extractors Extractors Integrator Integrator Word Word PPT PPT PDF PDF Latex Latex Email Email Webpage Webpage Excel Excel DB DB System Architecture Data Analysis Module Domain Management Module Data Collection Module Domain Manager

  22. Outline • Problem definition and project goals • Technical issues: • System architecture and instance extraction [CIDR’05] • Reference reconciliation [Sigmod’05] • On-the-fly data integration • Association search and browsing • Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] • Overarching PIM Themes

  23. Reference Reconciliation in Semex Xin (Luna) Dong Lab-#dong xin dong xin luna • ¶­ðà xinluna dong Names luna x. dong dongxin Emails xin dong

  24. Semex Without Reference Reconciliation Search results for luna 23 persons luna dong SenderOfEmails(3043) RecipientOfEmails(2445) MentionedIn(94)

  25. Semex Without Reference Reconciliation Search results for luna 23 persons Xin (Luna) Dong AuthorOfArticles(49) MentionedIn(20)

  26. Semex Without Reference Reconciliation A Platform for Personal Information Management and Integration

  27. Semex Without Reference Reconciliation 9 Persons: dong xin xin dong

  28. Semex NEEDS Reference Reconciliation

  29. Reference Reconciliation • A very active area of research in Databases, Data Mining and AI. (Surveyed in [Cohen, et al. 2003]) • Traditional approaches assume matching tuples from a single table • Based on pair-wise comparisons • Harder in our context

  30. Challenges • Article: a1=(“Bounds on the Sample Complexity of Bayesian Learning”, “703-746”, {p1,p2,p3}, c1) a2=(“Bounds on the sample complexity of bayesian learning”, “703-746”, {p4,p5,p6}, c2) • Venue: c1=(“Computational learning theory”, “1992”, “Austin, Texas”)c2=(“COLT”, “1992”, null) • Person: p1=(“David Haussler”, null) p2=(“Michael Kearns”, null) p3=(“Robert Schapire”, null) p4=(“Haussler, D.”, null) p5=(“Kearns, M. J.”, null) p6=(“Schapire, R.”, null)

  31. ? ? Challenges • Article: a1=(“Bounds on the Sample Complexity of Bayesian Learning”, “703-746”, {p1,p2,p3}, c1) a2=(“Bounds on the sample complexity of bayesian learning”, “703-746”, {p4,p5,p6}, c2) • Venue: c1=(“Computational learning theory”, “1992”, “Austin, Texas”)c2=(“COLT”, “1992”, null) • Person: p1=(“David Haussler”, null) p2=(“Michael Kearns”, null) p3=(“Robert Schapire”, null) p4=(“Haussler, D.”, null) p5=(“Kearns, M. J.”, null) p6=(“Schapire, R.”, null) p7=(“Robert Schapire”, “schapire@research.att.com”) p8=(null, “mkearns@cis.uppen.edu”) p9=(“mike”, “mkearns@cis.uppen.edu”) 2. LimitedInformation 1. Multiple Classes 3. Multi-value Attributes

  32. Intuition • Complex information spaces can be considered as networks of instances and associations between the instances • Key: exploit the network, specifically, the clues hidden in the associations

  33. I. Exploiting Richer Evidences • Cross-attribute similarity – Name&email • p5=(“Stonebraker, M.”, null) • p8=(null, “stonebraker@csail.mit.edu”) • Context Information I – Contact list • p5=(“Stonebraker, M.”, null, {p4, p6}) • p8=(null, “stonebraker@csail.mit.edu”, {p7}) • p6=p7 • Context Information II – Authored articles • p2=(“Michael Stonebraker”, null) • p5=(“Stonebraker, M.”, null) • p2 and p5 authored the same article

  34. 1409 Considering Only Attribute-wise Similarities Cannot Merge Persons Well 3159 Person references: 24076 Real-world persons (gold-standard):1750

  35. 1409 346 Considering Richer Evidence Improves the Recall Person references: 24076 Real-world persons:1750

  36. II. Propagate Information between Reconciliation Decisions • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null)

  37. Propagating Information between Reconciliation Decisions Further Improves Recall Person references: 24076 Real-world persons:1750

  38. X X V V III. Reference Enrichment • p2=(“Michael Stonebraker”, null, {p1,p3})p8=(null, “stonebraker@csail.mit.edu”, {p7})p9=(“mike”, “stonebraker@csail.mit.edu”, null) • p8-9 =(“mike”, “stonebraker@csail.mit.edu”, {p7})

  39. References Enrichment Improves Recall More than Information Propagation Person references: 24076 Real-world persons:1750

  40. 1409 346 125 Applying Both Information Propagation and Reference Enrichment Gets the Highest Recall Person references: 24076 Real-world persons:1750

  41. Outline • Problem definition and project goals • Technical issues: • System architecture and instance extraction [CIDR’05] • Reference reconciliation [Sigmod’05] • On-the-fly data integration • Association search and browsing • Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] • Overarching PIM Themes

  42. Homepage Web Page Person Cached Organizer, Participants Document Author Event Sender, Recipients Softcopy Softcopy Paper Presentation Message Cites Importing External Data Sources

  43. Intuition—Explore associations in schema mapping • Traditional approaches: proceed in two steps • Step 1. Schema matching (Surveyed in [Rahm&Bernstein, 2001]) • Generate term matching candidates • E.g., “paperTitle” in table Author matches “title” in table Article • Step 2. Query discovery [Miller et al., 2000] • Take term matching as input, generate mapping expressions (typically queries) • E.g., SELECT Article.title as paperTitle, Person.name as author FROM Article, Person WHERE Article.author = Person.id

  44. Intuition—Explore associations in schema mapping • Traditional approaches: proceed in two steps • Step 1. Schema matching (Surveyed in [Rahm&Bernstein, 2001]) • Generate term matching candidates • E.g., “paperTitle” in table Author matches “title” in table Article • Step 2. Query discovery [Miller et al., 2000] • Take term matching as input, generate mapping expressions (typically queries) • E.g., SELECT Article.title as paperTitle, Person.name as author FROM Article, Person WHERE Article.author = Person.id • User’s input is needed to fill in the gap between Step 1 output and Step 2 input • Our approach: check association violations to filter inappropriate matching candidates

  45. Integration Example authoredBy publishedIn authoredBy Person(name, email) Book(title, year) Article(title, page) Conference(name, year) Webpage-item (title, author, conf, year)

  46. authoredBy publishedIn Person(name, email) Book(title, year) Article(title, page) Conference(name, year) Webpage-item (title, author, conf, year) Integration Example authoredBy Person(name, email) Book(title, year) Article(title, page) Conference(name, year)  Webpage-item (title, author, conf, year) 

  47. Outline • Problem definition and project goals • Technical issues: • System architecture and instance extraction [CIDR’05] • Reference reconciliation [Sigmod’05] • On-the-fly data integration • Association search and browsing • Domain model personalization and evolution [WebDB’05] Interleaved with Semex demo [Best demo in Sigmod’05] • Overarching PIM Themes

  48. Explore the association network – 1. Find the relationship between two instances • Example: How did I know this person? • Solution: Lineage • Find an association chain between two object instances • Shortest chain? • “Earliest” chain OR “Latest” chain

  49. Explore the association network – 2. Find all instances related to a given keyword • Example: Who are working on “Schema Matching”? • Solution: • Naive approach: index object instances on attribute values • A list of papers on schema matching • A list of emails on schema matching • A list of persons working on schema matching • A list of conferences for schema-matching papers • A list of institutes that conduct schema-matching research • Our approach: index objects on the attributes of associated objects

  50. Explore the association network – 3. Rank returned instances in a keyword search • Example: What are important papers on “schema matching”? • Solution: • Naive approach: rank by TF/IDF metric • Our approach: ranking by • Significance score: PageRank measure • Relevance score: TF/IDF metric • Usage score: last visit time and modification time

More Related