320 likes | 323 Vues
Schema Matching and Data Extraction over HTML Tables. Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University. supported by NSF. Introduction. Many tables on the Web Ontology-based extraction: Works well for unstructured or semi-structured data
E N D
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported by NSF
Introduction • Many tables on the Web • Ontology-based extraction: • Works well for unstructured or semi-structured data • What about structured data – tables? • How to integrate data stored in different tables? • Detect the table of interest • Form attribute-value pairs (adjust if necessary) • Do extraction • Infer mappings from extraction patterns
? ProblemDetecting The Table of Interest
Problem Different schemas • Different source table schemas • {Run #, Yr, Make, Model, Tran, Color, Dr} • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} • {Vehicle, Distance, Price, Mileage} • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} • Target database schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
? ? Problem Attribute-Value is Value
Table extending over several pages List ProblemInformation Behind Links
Solution • Detect the table of interest • Form attribute-value pairs (adjust if necessary) • Do extraction • Infer mappings from extraction patterns
SolutionDetect The Table of Interest • Top-level tables • Table size: at least 3 rows and columns • Grid layout: same # of values • Attributes • Value density: # of ontology extracted values total # of values in the table
SolutionDetect The Table of Interest • Linked-page tables • Table size: at least 2 rows and columns • Attributes • Attribute-value-pair pattern • Page-spanning tables
2001 2001 2001 2000 2000 2000 2000 2000 2000 1999 1999 Solution Remove Factoring
SolutionForm Attribute-Value Pairs <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
SolutionAdjust Attribute-Value Pairs <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
Unstructured and semi-structured: concatenate < Single attribute value pairs: Pair them together <Price, $7,988>, <Mileage, 63,168 miles>, <Body Type, Car>, <Body Style, 4 DR Sedan>, <Transmission, Automatic>, <Engine, 3.0 L V-6>, <Doors, 4>, <Fuel Type, Gas>, <Stock Number, 22764>, <VIN, 1FAFP52U2WA139879> List: Mark the beginning and the end > SolutionAdd Information Hidden Behind Links
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Each row is a car. SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Precision:100% Recall:87% 15 46 Testing Set 53 87%(46) 2 28 12 13 100%(7) Training Set 7 100%(7) Top Table Location Structured Linked Page Location Linked Pages Experimental Results − Table Location Car advertisement application domain Precision:86% Recall: 92%
Experimental Results − Mapping Car advertisement application domain • 46 recognized tables in the testing set • Total 319 mappings • Precision: 95.8% Recall: 92.8% • Top-level tables: 77% of the 296 correct mappings • Linked tables: 19.6% • Both: 3.4%
Precision:100% Recall:92% 11 Testing Set 12 92%(11) 3 100%(5) 100%(5) Training Set 5 Top Table Location Linked Pages Experimental Results − Table Location Cell-phone sales application domain
Experimental Results − Mapping Cell-phone sales application domain • 11 recognized tables in the testing Set • Total 97 mappings • Precision: 90.1% Recall: 85.4% • Top-level tables: 85.4% of the 88 correct mappings • Linked tables: 50.5% • Both: 35.9%
Contribution • Provides an approach to extract information automatically from HTML tables • Suggests a different way to solve the problem of schema matching