Table Understanding in DIADEM

Table Understanding in DIADEM Giorgio Orsi1,2 and Ben Watson2 1Institute for the Future of Computing University of Oxford 2Department of Computer Science University of Oxford DIADEM 1.0

Table Understanding • Process that • locates (or recognizes), • analyses and • interprets • a tabular structure with the goal of • classify (layout vs data tables), • extract data, • translate or, • other.

What is a Table? • Penn et Al. ’01 • a 2D assembly of cells, where • each cell is short in length and • contains no complex structures, and • there is semantic and syntactic coherence within the rows and columns.

What is a Table?

What Information do we Have? • HTML • CSS Boxes <table border="1"> <tbody> <tr> <th colspan="2">NAME</th> <th rowspan="2">D.O.B.</th> </tr> <tr> <th>FIRST NAME</th> <th>SURNAME</th> </tr> <tr> <td>Sue</td> <td>Adams</td> <td>12th June 1980</td> </tr> <tr> <td>Jim</td> <td>Wright</td> <td>19th May 2000</td> </tr> </tbody> </table> • Domain xsd:string ox:firstName ox:dob xsd:date ox:person xsd:string ox:surname

Why Table Understanding in DIADEM • recognize and extract data in tabular format • layout tables • data tables • understand forms and result-pages • labelling • segmentation • let us focus first on HTML tables (e.g., <table>)

Why Table Understanding in DIADEM

Leaf Tables • Goal: determine whether a table contains any inner table layout recursive check • if T1 contains T2 (e.g., there is a <table> element in the subtree rooted in T1), than T1 is a layout table.

Row and Column count • Goal: identify “sane” tables • at least two coherent adiacent cells (TD, DIV, TH) • e.g., two data cells, two header cells, 1 header one data • allow 1D tables (i.e., vectors) • allow empty tables

Longest String • Goal: identify “sane” cells • find the longest string w in every cell, T is a data table if |w|<δ • layout tables are likely to contain a large amount of text • ignore text nodes associated to <SELECT>, <FORM> and <TABLE> • in their subtree • siblings ignore

Empty Cell • Goal: identify “sane” cells • find empty cells, T is a data table if contains no empty cells • layout tables are likely to contain empty cells empty

TH Check • Goal: identify “sane” tables • find <TH> elements in a table • layout tables are not likely to contain <TH> elements

Largest Cell

Picture • Goal: identify “sane” cells • check the size of pictures in a cell • T is a data table if p-area<δ • layout tables are likely to contain large pictures • e.g., ads and logos

Table Size

Combining Rules • Identify the combination of rules that maximizes the recognition accuracy • cut-offs estimation • best-guess estimation • if T passes all the rules  data table • cut-off calculation • cut-off = performance of each rule • If T passes all the rules  data table • machine learning • decision trees  white box model

Evaluation: Cut-Off Estimation • First run: all rules in AND • Second run: no empty cell • Third run: no empty cell, no table size • Fourth run: no empty cell, no table size, no picture rule

Evaluation: Cut-Off Computation • First run: all rules in AND • Second run: no empty cell, no table size

Evaluation: Decision Tree • Facts: • 65% training • 35% 10-fold validation • precision: 0.807 • recall: 0.836 • F-measure: 0.821 • Comparison: • F-Measure 0.740 (Gatterbauer)

Discussion • Most of the errors caused by missing information or bad combination of rules. • use visual and semantic information • combine the heuristics in an “organic” way • PDF-inspired extraction • guided by the HTML and CSS structure. • use a reference model as in form and result-page analysis

Thank you!

Table Understanding in DIADEM