1 / 17

Corpus-based Schema Matching

Corpus-based Schema Matching. Jayant Madhavan. Schema Matching. Schema Matching: Discovering correspondences between similar elements Eventually… SQL expressions that can populate one database from other. BooksAndMusic Title Author Publisher ItemID ItemType ListPrice Categories

yetta
Télécharger la présentation

Corpus-based Schema Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpus-based Schema Matching Jayant Madhavan

  2. Schema Matching Schema Matching: Discovering correspondences between similar elements Eventually… SQL expressions that can populate one database from other BooksAndMusic Title Author Publisher ItemID ItemType ListPrice Categories Keywords Books Title ISBN Price OurPrice Edition Authors ISBN FirstName LastName Discounts ItemID DiscountPrice BookGenres ISBN Genre Inventory Database A Inventory Database B Corpus-based Schema Matching

  3. Book, Music, Store, … Mappings Books, Pubs, Authors,… Products, Discounts, … Heterogeneity and Data Sharing • Data Integration • Mappings provide the glue between independent data sources Query Books+Music Central Mediator All Books CD World Amazon Data Sources • Schema matching important to any application with multiple data sources Corpus-based Schema Matching

  4. Abbreviations, synonyms,… Incomplete, absent,… Inconsistent, absent,… Overlapping schemas,… Different values, scales,… Typical Approaches • Multiple sources of evidence in the schemas • Schema element names • Descriptions and documentation • Data types • Schema structure • Data instances • BooksAndCDs/Categories ~ BookCategories/Category • ItemID: unique identifier for a book or a CD • DateTime  Integer • All books have similar attributes • All addresses have similar formats Combine multiple techniques to exploit all available evidence [Do, Rahm; VLDB 2002], [Doan, et al.; WWW 2002]… Corpus-based Schema Matching

  5. S T s t 2. Compare models Matching Techniques Schemas 1. Build models Ms Mt Name: Instances: Type: … Name: Instances: Type: … Element Models 3. Combine results t1 tn s1 Similarity Matrix sm 4. Generate matches Mapping s t Corpus-based Schema Matching

  6. Insufficient evidence Product Music (no tuples) MusicCD CD Corpus-based Schema Matching

  7. Obtaining more evidence Product, CD Music, MusicCD Corpus-based Augment MusicCD Corpus CD Corpus-based Schema Matching

  8. Corpus-based Schema Matching • Can we use known schemas and mappings to match as yet unseen schemas? • Augment information about elements in schemas being matched • Learn schema design patterns and constraints from known schemas to improve matches Corpus-based Schema Matching

  9. Multiple representations for concepts • Learn alternate names, data instances, names of related elements, data types, … CDs CD Music Album AlbumName Name TrackName DiscountPrice DiscountedPrice SalePrice OurPrice Discounted DiscPrice Artist AuthorArtist Name LastName Author ID CDID ProdCode ISBN RecordLabel Label Company RecordingCompany Artists CD2Artist AuthorArtists ArtistID Corpus-based Schema Matching

  10. Schema Design Patterns • Relations between elements Tables and likely columns Corpus-based Schema Matching

  11. Corpus of known schemas and mappings S Schemas s Build initial models Ms Element Models Name: Instances: Type: … Search similar elements e M’s f Augmented Models Build augmented models Name: Instances: Type: … Typical Schema Matcher Learn schema design patterns Concepts/Clusters Generate Matches Domain Constraints Mapping Corpus-based Schema Matching

  12. Contents of the Corpus • In order to augment • Learn model ensemble for each element • names, data instances, types, structure, … • Train using the schemas and mappings • Element and elements it maps to are positive examples • In order to learn domain constraints • Cluster elements in the corpus into concepts • Estimate schema statistics • Likely tables-columns and element co-occurrence • Learn importance of individual constraints Corpus-based Schema Matching

  13. Experimental Results • Four domains • Automatically extracted web forms • Manually created relational schemas • Techniques • Direct: Glue [WWW’2004] • Corpus-based Augment • Corpus-based Pivot [IIW’2004] Corpus-based Schema Matching

  14. Improved Matching Performance • 16-19 schemas and 6 mappings in the corpus • 22-54 schema pairs being tested Corpus-based Schema Matching

  15. Difficult Match Tasks • More significant improvements for difficult tasks • Improvements are less for easy tasks Corpus-based Schema Matching

  16. Related Work • Using past matching experience • [Doan, et al., SIGMOD’2001; Do & Rahm, VLDB’2002] • We are trying to match unseen schemas. • Using web forms to construct mediated schema • [He & Chang, SIGMOD’2003] • Clustering of elements is an intermediate step in our corpus. • Using a Domain Ontology • [Xu & Embley, DASFAA’2003] • Our corpus structures are automatically generated. Corpus-based Schema Matching

  17. Conclusions • Schema Matching is hard with insufficient evidence • Corpus-based Schema Matching • Augment the evidence about elements in unseen schemas • Learn schema design patterns to select matches • Improves matching especially for difficult tasks • Future Work • Large schemas and complex mappings • User feedback to curate the corpus • Corpus as a tool for other data management task [Halevy & Madhavan, IJCAI’2003] http://www.cs.washington.edu/homes/jayant Corpus-based Schema Matching

More Related