1 / 19

Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources

Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources. Cui Tao March, 2002. Founded by NSF. Introduction. Many Web sites present their information in tables. Ontology-Based Extraction: Works for unstructured or semi-structured data

sibley
Télécharger la présentation

Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF

  2. Introduction • Many Web sites present their information in tables. • Ontology-Based Extraction: • Works for unstructured or semi-structured data • Does not work for structured data -- tables • Only tables for information, not for layout.

  3. Problems Different Schema • Different Source Table Schemas • {Run #, Yr, Make, Model, Tran, Color, Dr} • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} • {Vehicle, Distance, Price, Mileage} • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} • Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

  4. Problems Switch Switch Switch • Attribute Value Pairs ?

  5. Problems • Attribute/Value Combinations Year/sty Cyl. # Dr Tran Color

  6. Problems • Attribute/Value Split

  7. Problems • Information in the linked pages • Tables • Lists • Unstructured data • … • Header information

  8. Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. Methods • Table Understanding. • Recognize Attributes and Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. • Table Understanding • Table Recognition • <TABLE>, </TABLE> • <TR>: Row; <TD>: Data Entry; <TH>: Header. • <COLSPAN>, <ROWSPAN> • Attribute/Value Determination • <TH> • First row, first column • Different font style

  9. AOT/AOL ATL AOT/AOL MA • Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. Methods • Form Attribute-Value Pairs • AOT (Attribute On Top) Tables and AOL (Attribute On Left) Tables • Form Attribute-Value Pairs

  10. Run# 1; Year SEE; Make AVAILABLE; Model TRUCKS; Tran----; Color----;Dr---- _________________________________________________________________ Run# 2; Year 93; Make Mercury; Model Sable; Tran A; Color Green;Dr 4 __________________________________________________________________ Run# 3; Year 94; Make Chevrolet; Model Camaro; Tran A; Color Red;Dr 2 __________________________________________________________________ : : : : Run# 1; Year SEE; Make AVAILABLE; Model TRUCKS; Tran----; Color----;Dr---- __________________________________________________________________

  11. AOT/AOL ATL AOT/AOL MA • Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. Methods • Form Attribute-Value Pairs • AOT (Attribute On Top) Tables and AOL (Attribute On Left) Tables • ATL (Attribute On both Top and Left) Tables • MA (Multiple Set Attribute) Tables

  12. The City Fuel Economy of 2001 Honda Civic DX

  13. AOT/AOL ATL AOT/AOL MA • Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. Methods • Form Attribute-Value Pairs • AOT (Attribute On Top) Tables and AOL (Attribute On Left) Tables • ATL (Attribute On both Top and Left) Tables • MA (Multiple Set Attribute) Tables • Adjust Attribute-Value Pairs CD: Yes -> “CD”; Auto: No -> “ “

  14. Attr 1 Value 1; Attr 2 Value 2; . . .; Attr n:Value n Detailed information in the linked page(s) Attr 1 Value 1; Attr 2 Value 2; . . .; Attr n:Value n Detailed information in the linked page(s) Attr 1 Value 1; Attr 2 Value 2; . . .; Attr n:Value n Attr 1 Value 1; Attr 2 Value 2; . . .; Attr n:Value n Attr 1 Value 1; Attr 2 Value 2; . . .; Attr n:Value n Detailed information in the linked page(s) : : Attr 1 Value 1; Attr 2 Value 2; . . .; Attr n:Value n . . . . • Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre Data Extraction • Infer General Mapping • Data Extraction. • Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre Data Extraction • Infer General Mapping • Data Extraction. Methods • Form Records • Pre Data Extraction

  15. Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. • Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Form Records • Inferred Mapping Creation • Pre data extraction • Infer General Mapping • Data Extraction. Methods • Infer General Mapping • Extract Data

  16. Experiment • Tables of car advertisement from 20 sites. • 10 training tables. • Used to develop the ontology • 10 testing tables • Used to measure recall ratios and precision ratios • Before table processing, before training and after training

  17. Results • Mapping ratios: • Before table-processing: hard to find record boundary. • After table-processing and before training: 336/490 = 68.57% • After table-processing and after training: 480/490 = 97.96% • Precision and Recall

  18. Conclusion and Future Work • Tests are only for AOT tables • Experimental results show that we have a very successful approach. • Next step: Table understanding and inferred mapping.

More Related