1 / 16

Using a Hidden-Markov Model in Semi-Automatic Indexing of Historical Handwritten Records

Using a Hidden-Markov Model in Semi-Automatic Indexing of Historical Handwritten Records. Thomas Packer, Oliver Nina, Ilya Raykhel Computer Science Brigham Young University. The Challenge: Indexing Handwriting. Millions of historical documents. Many hours of manual indexing.

yaron
Télécharger la présentation

Using a Hidden-Markov Model in Semi-Automatic Indexing of Historical Handwritten Records

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using a Hidden-Markov Model in Semi-Automatic Indexing of Historical Handwritten Records Thomas Packer, Oliver Nina, Ilya Raykhel Computer Science Brigham Young University

  2. The Challenge: Indexing Handwriting • Millions of historical documents. • Many hours of manual indexing. • Years to complete using hundreds of thousands of volunteers. • Previous transcriptions not fully leveraged.

  3. Family Search Indexing Tool

  4. A Solution: On-Line Machine Learning • Holistic handwritten word recognition using a Hidden Markov Model (HMM), based on Lavrenko et al. (2004). • HMM selects words to maximize joint probability: • Word-feature probability model • Word-transition probability model • Word-feature model predicts a word from its visual features. • Word-transition model predicts a word from its neighboring word.

  5. Census Images The Process Transcriptions Labeled Examples Word Rectangles Feature Vectors Learner Training Examples Model Results Classifier Test Examples

  6. Census Images • 3 US Census images • Same census taker • Preprocessing: Kittler's algorithm to threshold images

  7. Extracted Fields Manually copied bounding rectangles 3 columns: Relationship to Head (14) Sex (2) Marital Status (4) 123 rows total N-fold cross validation N = 24 (5 rows to test)

  8. Examples to Feature Vectors 25 Numeric Features Extracted: • Scalar Features: • height (h) • width  (w) • aspect ratio (w / h) • area  (w * h) • Profile Features: • projection profile • upper/lower word profile • 7 lowest scalar values from DFT

  9. HMM and Transition Probability Model Probability Model:  Hidden Markov Model State Transition Probabilities 

  10. Observation Probability Model Multi-variate normal distribution:

  11. Accuracies with and without HMM

  12. Accuracies for Separate Columns with and without HMM

  13. Accuracies of HMM for Varying Numbers of Training Examples

  14. Accuracies of “Relationship to Head” for Varying Numbers of Examples

  15. Conclusions and Future Work • 10% correction rate for chosen columns after one page. • Measure indexing time. • Update models in real-time. • Columns with larger vocabularies.  • More image preprocessing.  • More visual features. • More dependencies among words (in different rows). • More training data.

  16. Questions?

More Related