220 likes | 343 Vues
This document presents the methodology and challenges of table understanding within the DIADEM framework, focusing on key processes for recognizing, analyzing, and interpreting tabular structures. It defines tables based on specific criteria and explores the extraction of structured data, the identification of layout versus data tables, and the rules for classifying "sane" tables. The implementation of machine learning techniques for refining recognition accuracy is also discussed, along with evaluation metrics like precision and recall, emphasizing the importance of visual and semantic information in enhancing extraction processes in both HTML and CSS formatted tables.
E N D
Table Understanding in DIADEM Giorgio Orsi1,2 and Ben Watson2 1Institute for the Future of Computing University of Oxford 2Department of Computer Science University of Oxford DIADEM 1.0
Table Understanding • Process that • locates (or recognizes), • analyses and • interprets • a tabular structure with the goal of • classify (layout vs data tables), • extract data, • translate or, • other.
What is a Table? • Penn et Al. ’01 • a 2D assembly of cells, where • each cell is short in length and • contains no complex structures, and • there is semantic and syntactic coherence within the rows and columns.
What Information do we Have? • HTML • CSS Boxes <table border="1"> <tbody> <tr> <th colspan="2">NAME</th> <th rowspan="2">D.O.B.</th> </tr> <tr> <th>FIRST NAME</th> <th>SURNAME</th> </tr> <tr> <td>Sue</td> <td>Adams</td> <td>12th June 1980</td> </tr> <tr> <td>Jim</td> <td>Wright</td> <td>19th May 2000</td> </tr> </tbody> </table> • Domain xsd:string ox:firstName ox:dob xsd:date ox:person xsd:string ox:surname
Why Table Understanding in DIADEM • recognize and extract data in tabular format • layout tables • data tables • understand forms and result-pages • labelling • segmentation • let us focus first on HTML tables (e.g., <table>)
Leaf Tables • Goal: determine whether a table contains any inner table layout recursive check • if T1 contains T2 (e.g., there is a <table> element in the subtree rooted in T1), than T1 is a layout table.
Row and Column count • Goal: identify “sane” tables • at least two coherent adiacent cells (TD, DIV, TH) • e.g., two data cells, two header cells, 1 header one data • allow 1D tables (i.e., vectors) • allow empty tables
Longest String • Goal: identify “sane” cells • find the longest string w in every cell, T is a data table if |w|<δ • layout tables are likely to contain a large amount of text • ignore text nodes associated to <SELECT>, <FORM> and <TABLE> • in their subtree • siblings ignore
Empty Cell • Goal: identify “sane” cells • find empty cells, T is a data table if contains no empty cells • layout tables are likely to contain empty cells empty
TH Check • Goal: identify “sane” tables • find <TH> elements in a table • layout tables are not likely to contain <TH> elements
Picture • Goal: identify “sane” cells • check the size of pictures in a cell • T is a data table if p-area<δ • layout tables are likely to contain large pictures • e.g., ads and logos
Combining Rules • Identify the combination of rules that maximizes the recognition accuracy • cut-offs estimation • best-guess estimation • if T passes all the rules data table • cut-off calculation • cut-off = performance of each rule • If T passes all the rules data table • machine learning • decision trees white box model
Evaluation: Cut-Off Estimation • First run: all rules in AND • Second run: no empty cell • Third run: no empty cell, no table size • Fourth run: no empty cell, no table size, no picture rule
Evaluation: Cut-Off Computation • First run: all rules in AND • Second run: no empty cell, no table size
Evaluation: Decision Tree • Facts: • 65% training • 35% 10-fold validation • precision: 0.807 • recall: 0.836 • F-measure: 0.821 • Comparison: • F-Measure 0.740 (Gatterbauer)
Discussion • Most of the errors caused by missing information or bad combination of rules. • use visual and semantic information • combine the heuristics in an “organic” way • PDF-inspired extraction • guided by the HTML and CSS structure. • use a reference model as in form and result-page analysis