1 / 20

Mapping Maintenance for Data Integration Systems

Mapping Maintenance for Data Integration Systems. Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005. wrapper. homeseekers.com. Data Integration Systems. Find homes under $300K.

niel
Télécharger la présentation

Mapping Maintenance for Data Integration Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mapping Maintenance for Data Integration Systems Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005

  2. wrapper homeseekers.com Data Integration Systems Find homes under $300K mediated schema source schema 1 source schema 2 source schema 3 wrapper wrapper yahoo.com windermere.com

  3. Mapping Maintenance is a Key Bottleneck • Constructing mappings has proven difficult… • (see first speaker) • …but maintenance often quickly dominates cost • E.g., Integrated Genome Database Project [Stein, 03] • 12 genomic databases, each remodeled data twice per year • System broke every two weeks, abandoned after 1 year • E.g., Integration Project at Illinois • Integrated 400 DB researcher homepages • 2 system administrators, stopped after 3 months Reducing maintenance costs is now crucial!

  4. cost | city | numbeds | numbaths cost | city | numbeds | numbaths price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 wrapper wrapper 5 weeks later (source has changed) homeseekers.com homeseekers.com Problem Definition mediated schema mediated schema ? price location beds baths $180,000 61801 2 2 $260,000 98195 3 2

  5. cost | city | numbeds | numbaths price location beds baths wrapper wrapper 185 “Urbana, IL” 2 2 270 “Seattle, WA” 3 2 homeseekers.com homeseekers.com price location beds baths price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 $180,000 “Urbana, IL” 2 2 $260,000 “Seattle, WA” 3 2 Example 1: Change Source Schema or Data • Update tuples • Change units of price wrapper homeseekers.com

  6. cost | city | numbeds | numbaths wrapper wrapper homeseekers.com homeseekers.com price location beds baths price location beds baths price location beds baths $185,000 “Century 21” 2 2 $270,000 “RE/MAX” 3 2 $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 $185,000 61801 2 2 $270,000 98195 3 2 $185,000 61801 2bed/2bath Century 21 $185,000 Urbana, IL 2bed/2bath Century 21 $185,000 - Urbana, IL 2bed/2bath Century 21 Example 2: Change Presentation Format • Display location as zipcode • Rearrange page layout wrapper homeseekers.com

  7. The MAVERIC Approach Suppose administrator wants to maintain mappings for 1 year 1. For a short initial period (e.g., 5 weeks) • Administrator manually verifies each mapping • MAVERIC probes the source to learn data characteristics 2. For remaining time (e.g., 47 weeks) • MAVERIC probes the source to observe new data instances • MAVERIC outputs an alarm if characteristics differ • If an alarm, administrator repairs mappings

  8. price location beds baths wrapper wrapper wrapper 132 “Century 21” 1 2 365 “RE/MAX” 2 4 price location beds baths price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 $132,000 “Salem, OR” 2 1 $365,000 “Atlanta, GA” 4 2 homeseekers.com on week 5 homeseekers.com on week 1 homeseekers.com on week 6 Example • Training phase Learned data characteristics • Verification phase If beds < baths, output alarm If average price < 100,000, output alarm If layout of attributes changes, output alarm

  9. Contributions • Develop core MAVERIC system • An ensemble of sensors that exploit multiple characteristics of data • A combiner that leverages the most effective sensors • Significantly improve core system • Generate synthetic data to improve training • Leverage external data to improve training • Employ filters to reduce false alarms • Extensive evaluation over 114 sources in 6 domains • Core MAVERIC outperforms related work, improving F-1 by 4-19% • Enhancements further improve F-1 by 2-13%

  10. wrapper wrapper price location beds baths price location beds baths combiner $132,000 “Salem, OR” 2 1 $365,000 “Atlanta, GA” 4 2 $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 homeseekers.com on week 5 homeseekers.com on week 1 sm s1 …... Training the Core MAVERIC System • Sensors learn internal profiles of data characteristics • Combiner learns weight for each sensor employ Winnow to learn weights layout of attributes in HTML pages: price location beds / baths avg value of price

  11. price location beds baths 132 “Century 21” 1 2 365 “RE/MAX” 2 4 combiner scorem sm s1 …... layout of attributes has changed wrapper homeseekers.com on week 6 Verifying with the Core MAVERIC System • Sensors leverage internal profiles to output sensor scores • Combiner combines scores based upon weights alarm if combined score ≥ θ score1 new avg price

  12. combiner s1 sm …... query results at t1 wrapper source S at t1 Improving Training via Perturbation • Idea: expand training data by generating synthetic data • Simulate natural source changes during training • Source data changes, e.g., insert and delete tuples • Presentation format changes, e.g., $29.99 becomes 29.99 USD perturber - apply change - reapply wrapper - test results perturbed results training data for S original results query results at tn wrapper source S at tn System “practices ahead of time”

  13. combiner s1 sm …... price location beds baths $185,000 “Urbana, IL” 3 2 perturbation Example: Reformatting Price training data perturbed training example original training example price location beds baths ? = 185,000 USD “Urbana, IL” 3 2 original results perturbed results wrapper wrapper $185,000 Urbana, IL 3bed/2bath… 185,000 USD Urbana, IL 3bed/2bath… original HTML perturbed HTML homeseekers.com

  14. description cost Additional Improvements • Improve training by borrowing data from other sources mediated schema source schema source schema category price comments amount wrapper wrapper “This…” 185,000 USD house $185,000 S S’ • Reduce false alarms via filtering • Web Search Engines: • “price is 185,000 USD” • “costs 185,000 USD” Other Sources: • Monetary Recognizers: • $185,000 • $185000.00 potentially corrupt attribute price price is valid 185,000 USD amount 210 K (see paper for details)

  15. Empirical Evaluation • Test verification ability over 114 sources in 6 domains

  16. Core MAVERIC Outperforms Prior Work • Compare with recent system [Lerman et al, Journal of AI Research 03] Achieve F-1 from 82-93%, an improvement of 4-19% in all domains

  17. Sensor Ensemble Sensor Ensemble + Perturbation Sensor Ensemble + Perturbation + Multi-Src Train Sensor Ensemble + Perturbation + Multi-Src Train + Filtering 1 0.9 F-1 0.8 0.7 0.6 Flights Books Researchers Real Estate Inventory Courses Enhancements Boost Performance • Progressively enhanced versions of MAVERIC Each enhancement improved F-1 in at least 4 domains

  18. Reasons for Mistakes • Unrecognized instance formats • E.g., trained over TIME with format 2:00 pm, source changed format to 1400, output false alarm • E.g., trained over DAYS with format M-W-F, source changed format to Mon Wed Fri, output false alarm • Train with additional perturbations? Leverage more sources? • Attributes with similar values • E.g., trained with ORDER-DATE before SHIP-DATE, source reversed order, missed alarm on reversed values (ORDER-DATE = 7/13/2004, SHIP-DATE = 7/4/2004) • Include additional domain constraints?

  19. Related Work • Schema matching • [Dhamankar et al, 04], [He & Chang, 03], [Kang & Naughton, 03], [Rahm & Bernstein, 01], [Doan, 01] • Quantify semantics to compute matching scores • Activity monitoring • [Shavlik & Shavlik, 04], [Lazarevic et al, 03], [Stolfo et al, 01], [Fawcett & Provost, 99], [Allan et al, 98] • Profile normal behavior to detect notable events (e.g., intrusions) • Mapping and wrapper maintenance • Wrapper verification: [Lerman et al, 03], [Kushmerick, 00] • Mapping and wrapper repair: [Velegrakis et al, 03], [Meng et al, 03], [Chidlovskii, 01]

  20. Conclusion & Future Work • Developed MAVERIC to reduce maintenance costs • An ensemble of sensors that exploit multiple characteristics of data • Significantly improved core system • Perturbation, multi-source training, and filtering • Extensively evaluated over 114 sources in 6 domains • Core outperformed related work, improving F-1 by 4-19% • Enhancements further improved F-1 by 2-13% • Future work • Further improve and evaluate MAVERIC • Develop a solution for repairing broken mappings

More Related