1 / 26

Towards Domain-Independent Information Extraction from Web Tables

Towards Domain-Independent Information Extraction from Web Tables. Table Extraction Using Spatial Reasoning in the CSS2 Visual Box Model. Wolfgang Gatterbauer , Paul Bohunsky , Marcus Herzog, Bernhard Krupl , and Bernhard Pollak. Wolfgang Gatterbauer and Paul Bohunsky.

junior
Télécharger la présentation

Towards Domain-Independent Information Extraction from Web Tables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Domain-Independent Information Extraction from Web Tables Table Extraction Using Spatial Reasoning in the CSS2 Visual Box Model Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak WolfgangGatterbauer and Paul Bohunsky Database and ArtificialInteligence Group Vienna University of Technology, Austria Presented by Aaron Stewart BYU CS 652

  2. Contributions • Classify visually structured data • Non-tree IE formalism • Argue to defer semantic interpretation of output • Ground truthing method • Web table test set • Visual results

  3. Introduction Source: Gatterbauer et al. 2007

  4. Visually Structured Data on the Web • Tables • Lists • Aligned Graphs

  5. Visually Structured Data on the Web Source: Gatterbauer et al. 2007

  6. Formal Setup • DOM Tree Representation • Visual Box Representation • Visualized Element Nodes (VENs) • DOM nodes with bounding boxes • Visualized Words • Text words with bounding boxes

  7. Formal Setup Source: Gatterbauer et al. 2007

  8. Information Extraction • Visualized Element Nodes Table extraction (VENTex) • Steps: • Table location • Table recognition • Table interpretation

  9. Information Extraction Source: Gatterbauer et al. 2007

  10. Table Extraction Source: Gatterbauer et al. 2007

  11. Table Extraction • Gather 8 HTML node attributes • For text, add link • Only accept TH, TD, DIV html nodes • Tables must form frames • Remove duplicate bounding boxes

  12. Table Extraction • Adjacency: 3 pixels • LOCATEFRAMES algorithm • No overlapping cells • Minimum 3 rows, 2 columns • Remove empty rows/columns (spacers)

  13. LOCATE FRAMES Algorithm (earlier paper) • Visual table model • Expansion algorithm

  14. Visual Table Model Source: Gatterbauer et al. 2007

  15. Double Topographical Grid??? • Two origins • Upper left corner • Lower right corner • Sorted lists of pixel positions • The numbers are indices • But pixels remain in regular coordinates

  16. Neighbor Relations Source: Gatterbauer et al. 2007

  17. Neighbor Relations • Expand to include neighbors 1,2,3,4 • within or equal • Not bigger • Not outside • Not stepped

  18. Expansion Algorithm Source: Gatterbauer et al. 2007

  19. Basic Algorithm • http://www.dbai.tuwien.ac.at/staff/gatter/work/AAAI_2006_Presentation_Table_Extraction_Spatial_Reasoning.pdf

  20. Table Interpretation • Argument • Few details about the method actually used • Take data as it comes • Pass it on to a later semantic processing stage

  21. Table Interpretation Source: Gatterbauer et al. 2007

  22. Performance • Load + render: O(n) • Double topographical grid: O(n sqrt(n)) • About 5 seconds per page

  23. Web Table Ground Truthing • Tool to copy web pages • (not easy!) • http://www.dbai.tuwien.ac.at/user/pollak/webpagedump • Students selected and submitted pages • 493 web tables • 269 web pages • 63 students • http://www.dbai.tuwien.ac.at/staff/gatter/ventex/

  24. Experimental Results Source: Gatterbauer et al. 2007

  25. Future Work • Table extraction • Table interpretation • Nested substructures • Other visually structured data • Information integration Source: Gatterbauer et al. 2007

  26. My Conclusions • Useful table-building algorithm • For electronic data only • Requires strict alignment • Could be expanded • Other electronic formats (PDF, even ASCII text) • Probabilistic model for jitter

More Related