220 likes | 306 Vues
Explore the process of extracting names from genealogical texts using Natural Language Processing and layout clues. Discover how tools like Stanford Named Entity Recognizer and Apache UIMA Framework aid in this task, along with challenges and future work.
E N D
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010
Finding Names • Name recognition in genealogical texts • Focus: Lists, Directories
Finding Names Which side was easier? It’s easy for us to spot names… But how does a computer do it?
Finding Names Natural Language Processing Stanford Named Entity Recognizer ? Apache UIMA Framework MEMM CRF
BYU OntoES Ontology Extraction System • Dictionary • Regular Expressions
Ancestry.com Data • Word text • Word bounding boxes • Genres: • Genealogical Books • City Directories • Yearbooks • Newspapers
Margin Finder – Future Work Key Left Center Right
Margin Finder – Future Work • ABBYY FineReader handles – • Paragraphs • Newspaper columns • But has trouble with – • Hanging indents • Outline indentation (possibly)
Pattern Finding • Apply baseline name extractor (OntoES) • Apply margin finder and insert markers • Find left and right context for each name • Apply common contexts to extract more names
Pattern Finding 1. Apply baseline name extractor (OntoES)
Pattern Finding 2. Apply margin finder and insert markers LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 2 LEVEL 1 LEVEL 2
Pattern Finding 3. Find left and right context for each name LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 2 LEVEL 1 LEVEL 2
Pattern Finding 4. Apply common context patterns to extract more names LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 1 LEVEL 2 LEVEL 1 LEVEL 2
Pattern Finding – Sample Results Baseline Results • Precision: 40% • Recall: 31.25% • F1: 35.09% Results of Most Salient Pattern • Precision: 51.52% • Recall: 53.12% • F1: 52.31% Not all results are this good!
Challenges • Evaluation • More aligned data • Annotation tool • Other books • Centered and right-aligned text • Knowing when to apply patterns