1 / 34

Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure

Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure. David W. Embley, Cui Tao, Stephen W. Liddle Brigham Young University. Funded by NSF. Leverage this …. … to do this. Information Exchange. Source. Target. Information Extraction. Schema

arnon
Télécharger la présentation

Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatically Extracting Ontologically Specified Data from HTML Tableswith Unknown Structure David W. Embley, Cui Tao, Stephen W. Liddle Brigham Young University Funded by NSF

  2. Leverage this … … to do this Information Exchange Source Target Information Extraction Schema Matching

  3. Information Extraction

  4. Extracting Pertinent Information from Documents

  5. Year Price 1..* 1..* 1..* has has Make 1..* Mileage 0..1 0..1 0..1 0..1 has has Car 0..1 0..1 0..* is for PhoneNr has has 1..* Model 0..1 1..* 1..* has Feature 1..* Extension A Conceptual-Modeling Solution

  6. Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … … End;

  7. Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (336)835-8597 0002 1998 Elantra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Recognition and Extraction

  8. Schema Matching for HTML Tables with Unknown Structure

  9. Table-Schema Matching(Basic Idea) • Many Tables on the Web • Ontology-Based Extraction • Works well for unstructured or semistructured data • What about structured data – tables? • Method • Form attribute-value pairs • Do extraction • Infer mappings from extraction patterns

  10. Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas • {Run #, Yr, Make, Model, Tran, Color, Dr} • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} • {Vehicle, Distance, Price, Mileage} • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy}

  11. Problem: Attribute is Value

  12. ? ? Problem: Attribute-Value is Value

  13. Problem: Value is not Value

  14. `` `` `` Problem: Implied Values

  15. Problem: Missing Attributes

  16. Problem: Compound Attributes

  17. Problem: Factored Values

  18. Problem: Split Values

  19. Problem: Merged Values

  20. Problem: Values not of Interest

  21. Table extending over several pages Single-Column Table (formatted as list) Problem: Information Behind Links

  22. Solution • Form attribute-value pairs (adjust if necessary) • Do extraction • Infer mappings from extraction patterns

  23. ACURA ACURA Legend Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Solution: Remove Internal Factoring Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*

  24. Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM βCDTable βAutoβAir CondβAM/FM Yes, Yes, Yes, Yes, Solution: Replace Boolean Values ACURA ACURA Legend

  25. Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Form Attribute-Value Pairs ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>, <CD, >

  26. Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Adjust Attribute-Value Pairs ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

  27. Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend

  28. Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πMakeμ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Each row is a car. πYearTable πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car). Solution: Infer Mappings ACURA ACURA Legend {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

  29. Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Solution: Do Extraction ACURA ACURA Legend {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

  30. Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend πPriceTable {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

  31. Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend ρColour←Feature πColourTable U ρAuto←Feature πAuto βAutoTable UρAir Cond.←Feature πAir Cond. βAir Cond.Table UρAM/FM←Feature πAM/FM βAM/FMTable UρCD←FeatureπCDβCDTable Yes, Yes, Yes, Yes, {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

  32. Experiment • Tables from 60 sites • 10 “training” tables • 50 test tables • 357 mappings (from all 60 sites) • 172 direct mappings (same attribute and meaning) • 185 indirect mappings (29 attribute synonyms, 5 “Yes/No” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)

  33. Results • 10 “training” tables • 100% of the 57 mappings (no false mappings) • 94.6% of the values in linked pages (5.4% false declarations) • 50 test tables • 94.7% of the 300 mappings (no false mappings) • On the bases of sampling 3,000 values in linked pages, we obtained 97% recall and 86% precision • 16 missed mappings • 4 partial (not all unions included) • 6 non-U.S. car-ads (unrecognized makes and models) • 2 U.S. unrecognized makes and models • 3 prices (missing $ or found MSRP instead) • 1 mileage (mileages less than 1,000)

  34. Conclusions • Summary • Transformed schema-matching problem to extraction • Inferred semantic mappings • Discovered source-to-target mapping rules • Evidence of Success • Tables (mappings): 95% (Recall); 100% (Precision) • Linked Text (value extraction): ~97% (Recall); ~86% (Precision) • Future Work • Discover and exploit structure in linked text • Broaden table understanding • Integrate with current extraction tools www.deg.byu.edu

More Related