1 / 16

Wrapper Induction for List Extraction from OCRed Documents

This study explores Wrapper Induction methods for extracting lists from OCR data, improving data entry efficiency and accuracy. Techniques include Weakly-supervised and Semi-supervised Wrapper Induction using Hand-Labeled OCR data. Preliminary results show promising F-measure results on small datasets. Future work focuses on enhancing rules for list extraction.

otylia
Télécharger la présentation

Wrapper Induction for List Extraction from OCRed Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ListReader:Wrapper Induction for Lists in OCRed Documents Thomas Packer BYU CS DEG 2012.03.17

  2. We Love Data (In Digital Form)

  3. Lots of Paper Documents in the World

  4. Lots of Text in Paper Documents

  5. Lots of Lists in TextLots of Data in Lists

  6. Manual Data Entry

  7. Wrappers: Individualized Extraction Rules

  8. Semi-Automatic Wrapper Induction

  9. Weakly-supervisedWrapper Induction Semi-supervised Wrapper Induction

  10. Data: Image

  11. Data: OCR

  12. Data: Hand-Labeled OCR

  13. Induced Regex Wrappers Single-Specific Character Classes (i)(\. )(Lydia)( )(Lewis)(\*, )(b\. ) … [a-z] [A-Z] [0-9] [ \t\r\n] <each punct.> Single-General ([a-z]{1,5})([ \t\r\n\.]{1,6})([a-zA-Z]{3,9})([ \t\r\n]{1,5}) ([a-zA-Z]{3,9})([ \t\r\n\*,]{1,7})([ \t\r\n\.a-z]{1,7}) … Multi-General ([iv]{1,3})([ \.]{2,2})([ACHJLacdehilmnrstuvy]{4,6})([ ]{1,1}) ([Leisw]{5,6})([ \*,]{2,3})([ \.abp]{3,5}) …

  14. Preliminary Results(Field Label F-measure, Small Dataset)

  15. Conclusions and Future Work

  16. All done. Suggestions?

More Related