340 likes | 493 Vues
Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure. David W. Embley, Cui Tao, Stephen W. Liddle Brigham Young University. Funded by NSF. Leverage this …. … to do this. Information Exchange. Source. Target. Information Extraction. Schema
E N D
Automatically Extracting Ontologically Specified Data from HTML Tableswith Unknown Structure David W. Embley, Cui Tao, Stephen W. Liddle Brigham Young University Funded by NSF
Leverage this … … to do this Information Exchange Source Target Information Extraction Schema Matching
Year Price 1..* 1..* 1..* has has Make 1..* Mileage 0..1 0..1 0..1 0..1 has has Car 0..1 0..1 0..* is for PhoneNr has has 1..* Model 0..1 1..* 1..* has Feature 1..* Extension A Conceptual-Modeling Solution
Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … … End;
Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (336)835-8597 0002 1998 Elantra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Recognition and Extraction
Table-Schema Matching(Basic Idea) • Many Tables on the Web • Ontology-Based Extraction • Works well for unstructured or semistructured data • What about structured data – tables? • Method • Form attribute-value pairs • Do extraction • Infer mappings from extraction patterns
Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas • {Run #, Yr, Make, Model, Tran, Color, Dr} • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} • {Vehicle, Distance, Price, Mileage} • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy}
? ? Problem: Attribute-Value is Value
`` `` `` Problem: Implied Values
Table extending over several pages Single-Column Table (formatted as list) Problem: Information Behind Links
Solution • Form attribute-value pairs (adjust if necessary) • Do extraction • Infer mappings from extraction patterns
ACURA ACURA Legend Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Solution: Remove Internal Factoring Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM βCDTable βAutoβAir CondβAM/FM Yes, Yes, Yes, Yes, Solution: Replace Boolean Values ACURA ACURA Legend
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Form Attribute-Value Pairs ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>, <CD, >
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Adjust Attribute-Value Pairs ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πMakeμ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Each row is a car. πYearTable πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car). Solution: Infer Mappings ACURA ACURA Legend {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Solution: Do Extraction ACURA ACURA Legend {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend πPriceTable {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend ρColour←Feature πColourTable U ρAuto←Feature πAuto βAutoTable UρAir Cond.←Feature πAir Cond. βAir Cond.Table UρAM/FM←Feature πAM/FM βAM/FMTable UρCD←FeatureπCDβCDTable Yes, Yes, Yes, Yes, {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Experiment • Tables from 60 sites • 10 “training” tables • 50 test tables • 357 mappings (from all 60 sites) • 172 direct mappings (same attribute and meaning) • 185 indirect mappings (29 attribute synonyms, 5 “Yes/No” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)
Results • 10 “training” tables • 100% of the 57 mappings (no false mappings) • 94.6% of the values in linked pages (5.4% false declarations) • 50 test tables • 94.7% of the 300 mappings (no false mappings) • On the bases of sampling 3,000 values in linked pages, we obtained 97% recall and 86% precision • 16 missed mappings • 4 partial (not all unions included) • 6 non-U.S. car-ads (unrecognized makes and models) • 2 U.S. unrecognized makes and models • 3 prices (missing $ or found MSRP instead) • 1 mileage (mileages less than 1,000)
Conclusions • Summary • Transformed schema-matching problem to extraction • Inferred semantic mappings • Discovered source-to-target mapping rules • Evidence of Success • Tables (mappings): 95% (Recall); 100% (Precision) • Linked Text (value extraction): ~97% (Recall); ~86% (Precision) • Future Work • Discover and exploit structure in linked text • Broaden table understanding • Integrate with current extraction tools www.deg.byu.edu