Table Extraction Using MaxEnt
E N D
Presentation Transcript
Table Extraction Using MaxEnt Zonghui Lian
Introduction • Table extraction • Table format
Problem • HTML table • Tags can help us to understand it • How about plain text table?
title title title separator header header header header datarow datarow datarow datarow datarow datarow An Example
How to define features How to learn model weights MaxEnt
Data Set • CS dept university of Massachusetts Amherst (FedStats.gov) • Training data: 9321 Test data: 1200 • Format
Features • White space • Large gaps /Small gaps • Four space indents • Space percentage • Text feature • Digit percentage • Month and year
Features • Special characters -, +, =, :, |, .
TABLEFOOTNOTE -> NONTABLE DATAROW DATAROW -> SECTIONDATAROW TABLEHEADER -> SUPERHEADER Most error happened when recognizing … [TABLEFOOTNOTE : 0.2719665271966527 DATAROW : 0.12552301255230125 TABLEHEADER : 0.11715481171548117 Error Analysis TABLEFOOTNOTE 1 Includes Hawaii. TABLEFOOTNOTE 2 Includes processing total for dual usage crops.
Future Work • Improve the performance • Features For example Alphabet characters Previous label Next label • Data set size
Future Work • Identity columns • Add tags • Use table understanding algorithm