190 likes | 324 Vues
Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources. Cui Tao March, 2002. Founded by NSF. Introduction. Many Web sites present their information in tables. Ontology-Based Extraction: Works for unstructured or semi-structured data
E N D
Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF
Introduction • Many Web sites present their information in tables. • Ontology-Based Extraction: • Works for unstructured or semi-structured data • Does not work for structured data -- tables • Only tables for information, not for layout.
Problems Different Schema • Different Source Table Schemas • {Run #, Yr, Make, Model, Tran, Color, Dr} • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} • {Vehicle, Distance, Price, Mileage} • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} • Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Problems Switch Switch Switch • Attribute Value Pairs ?
Problems • Attribute/Value Combinations Year/sty Cyl. # Dr Tran Color
Problems • Attribute/Value Split
Problems • Information in the linked pages • Tables • Lists • Unstructured data • … • Header information
Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. Methods • Table Understanding. • Recognize Attributes and Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. • Table Understanding • Table Recognition • <TABLE>, </TABLE> • <TR>: Row; <TD>: Data Entry; <TH>: Header. • <COLSPAN>, <ROWSPAN> • Attribute/Value Determination • <TH> • First row, first column • Different font style
AOT/AOL ATL AOT/AOL MA • Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. Methods • Form Attribute-Value Pairs • AOT (Attribute On Top) Tables and AOL (Attribute On Left) Tables • Form Attribute-Value Pairs
Run# 1; Year SEE; Make AVAILABLE; Model TRUCKS; Tran----; Color----;Dr---- _________________________________________________________________ Run# 2; Year 93; Make Mercury; Model Sable; Tran A; Color Green;Dr 4 __________________________________________________________________ Run# 3; Year 94; Make Chevrolet; Model Camaro; Tran A; Color Red;Dr 2 __________________________________________________________________ : : : : Run# 1; Year SEE; Make AVAILABLE; Model TRUCKS; Tran----; Color----;Dr---- __________________________________________________________________
AOT/AOL ATL AOT/AOL MA • Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. Methods • Form Attribute-Value Pairs • AOT (Attribute On Top) Tables and AOL (Attribute On Left) Tables • ATL (Attribute On both Top and Left) Tables • MA (Multiple Set Attribute) Tables
The City Fuel Economy of 2001 Honda Civic DX
AOT/AOL ATL AOT/AOL MA • Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. Methods • Form Attribute-Value Pairs • AOT (Attribute On Top) Tables and AOL (Attribute On Left) Tables • ATL (Attribute On both Top and Left) Tables • MA (Multiple Set Attribute) Tables • Adjust Attribute-Value Pairs CD: Yes -> “CD”; Auto: No -> “ “
Attr 1 Value 1; Attr 2 Value 2; . . .; Attr n:Value n Detailed information in the linked page(s) Attr 1 Value 1; Attr 2 Value 2; . . .; Attr n:Value n Detailed information in the linked page(s) Attr 1 Value 1; Attr 2 Value 2; . . .; Attr n:Value n Attr 1 Value 1; Attr 2 Value 2; . . .; Attr n:Value n Attr 1 Value 1; Attr 2 Value 2; . . .; Attr n:Value n Detailed information in the linked page(s) : : Attr 1 Value 1; Attr 2 Value 2; . . .; Attr n:Value n . . . . • Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre Data Extraction • Infer General Mapping • Data Extraction. • Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre Data Extraction • Infer General Mapping • Data Extraction. Methods • Form Records • Pre Data Extraction
Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. • Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. Methods • Infer General Mapping • Extract Data
Experiment • Tables of car advertisement from 20 sites. • 10 training tables. • Used to develop the ontology • 10 testing tables • Used to measure recall ratios and precision ratios • Before table processing, before training and after training
Results • Mapping ratios: • Before table-processing: hard to find record boundary. • After table-processing and before training: 336/490 = 68.57% • After table-processing and after training: 480/490 = 97.96% • Precision and Recall
Conclusion and Future Work • Tests are only for AOT tables • Experimental results show that we have a very successful approach. • Next step: Table understanding and inferred mapping.