1 / 11

Table Extraction Using Conditional Random Fields

Table Extraction Using Conditional Random Fields. D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15 th 2004. Warm up. Why table extraction? Applications: Question-Answering, data mining and IR

bbradbury
Télécharger la présentation

Table Extraction Using Conditional Random Fields

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15th 2004

  2. Warm up • Why table extraction? • Applications: Question-Answering, data mining and IR • Tables: “textual tokens laid out in tabular form” • Tables: “databases designed for human eyes” • Related Work: • Pyreddy and Croft,1997: purely layout-based approach; a Character Alignment Graph (CAG) is used to identify the whole table • Ng et. al. ,1999: machine learning to identify rows and columns positions; no extraction of content. • Hurst, 2000: combination of layout and language perspective; text are broken into blocks by spatial and linguistic evidence • Pinto et. al., 2002: based on CAG, heuristic method to extract table cells for QA system.

  3. Objectives • On this paper: • Only text tables are studied, not HTML tables • Table extraction can be broken down into 6 subproblems: • Locate the table (*) • Identify the row positions and types (*) • Identify columns positions and types • Segment tables into cells • Tag cells as data or headers • Associate data cells with their corresponding headers • Only (*) tasks are addressed in the paper • CRFs are compared to MaxEntropy and to HMM

  4. Example • From www.FedStats.com , July 2001

  5. 12 Line Labels • Non-extraction labels • { NONTABLE, BLANKLINE, SEPARATOR } • Header Labels • { TITLE, SUPERHEADER, TABLEHEADER, SUBHEADER, SECTIONHEADER } • Data Row Labels • { DATAROW, SECTIONDATAROW } • Caption Labels • { TABLEFOOTNOTE, TABLECAPTION }

  6. Feature Set • White Space Features • Presence of: 4 consecutive white spaces, 4 space indents, 2 consecutive white space between non-space characters, a complete white space line, single space indent, etc • Percentage of: white space from the first non-white space on • Text Features • Presence of: 3 cells on a line, etc • Percentage of: digits (0-9) on a line, alphabet characters(a-z) on a line, header features (year strings, month abreviations, etc) on a line • Separator Features • Presence of: 4 consecutive periods • Percentage of: separator characters(-,+,! ,=,:,*) on a line • Conjunction of Features • Conjunctions: current&previous line, current&next line, next&nextnext

  7. Task 1: Table Line Location • A table line is any label but NONTABLE, BLANKLINE and SEPARATOR • F-Measure = (2*Precision * Recall)/(Recall+Precision) • Both CRFs used a Gaussian Prior and were trained using L-BFGS • Training set (52 documents), develop. set (6 documents), test set (62 docs)

  8. Task 2: Line Identification • How many of these lines were actually table lines?

  9. Task 2: Line Identification

  10. Additional Results • Pinto et. al. heuristic method • 4 labels: CAPTIONS, HEADERS, DATA, NON-TABLE

  11. Conclusions • The Table extraction problem has complex linguistic and formatting characteristics. In order to attack this problem, a combination of textual and spatial features was used. • CRFs can handle very well arbitrary and overlapping features, and offer the combined benefits of conditional-probability training models and Markov finite-state context models.

More Related