1 / 45

Recognizing Records from the Extracted Cells of Microfilm Tables

Recognizing Records from the Extracted Cells of Microfilm Tables. Kenneth M. Tubbs David W. Embley Brigham Young University. Supported by NSF. Motivation. Motivation. Millions want microfilm information 1880 census on-line, end of October 3 million hits per hour on familysearch.org

bairdk
Télécharger la présentation

Recognizing Records from the Extracted Cells of Microfilm Tables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recognizing Recordsfrom the Extracted Cellsof Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF

  2. Motivation

  3. Motivation • Millions want microfilm information • 1880 census on-line, end of October • 3 million hits per hour on familysearch.org • Acquiring information from microfilm • Expensive and time consuming • 2.5 million rolls, 20,000 extractors, 100 hours per year: requires 104 years • Finding a way to automate: big win!

  4. Difficulties • Different layouts and styles • Different types of data • Sometimes ambiguous • Type-written labels (OCR) • Hand-written data (?)

  5. Objective: Identify Records • Ontological as well as geometric constraints • Layout of handwritten values • Layout of empty cells Given a zoned image of a microfilm table, exploit: Output field coordinates (labeled with respect to the ontology) and organized into records

  6. Input Generate Confidence XML Input File(Preprocessed Microfilm Image) Enforce Constraints Verify Results Genealogical Ontology Algorithm Method Output SQL Insert Statements

  7. “Training” Set • 25 Tables from 5 different microfilm rolls • Used to: • Identify relationships between table cells • Create genealogical ontology • Define features to extract • Generate rules (constraints)

  8. Input: Microfilm Table

  9. Input: Microfilm Table

  10. Input: Microfilm Table • Input Features • Coordinates of each cell • Printed text for label cells • Cell empty or not

  11. Input: Microfilm Table <index source="0444770/0444770_2.gif"ontology="ontology.xml"> <cellrect="7,131,62,261"printed_text="Dwelling-houses number in the order of visitation."empty="0" /> <cellrect="61,132,118,260"printed_text="Families number in order of visitation."empty="0" /> <cellrect="119,132,436,261"printed_text="The Name of every Person whose usual place of abode on the first day of June, 1840, was in this family."empty="0" /> <cellrect="62,260,120,295"printed_text="2"empty="0" /> <cellrect="118,260,436,298"printed_text="3"empty="0" /> <cellrect="7,458,62,497"printed_text=""empty="1" /> . . .

  12. Genealogical Ontology

  13. Genealogical Ontology

  14. Genealogical Ontology <Ontology> <ObjectSet id="0" name="Person" syn="" lex="0"/> <ObjectSet id="1" name="Family" syn="families" lex="0"/> <ObjectSet id="2" name="Event" syn="" lex="0"/> <ObjectSet id="3" name="Age" syn="age birthday" lex="1"/> <ObjectSet id="4" name="Relationship" syn="relationship relation" lex="1"/> <ObjectSet id="5" name="Full Name" syn="full name whom who" lex="1"/> <ObjectSet id="6" name="First Name" syn="first given christian" lex="1"/> <ObjectSet id="7" name="Middle Name(s)" syn="middle initial" lex="1"/> <ObjectSet id="8" name="Last Name" syn="last surname" lex="1"/> <ObjectSet id="9" name="Title(s)" syn="title" lex="1"/> . . .

  15. Generate Confidence Matrices Generate Confidence • Relationships between pairs of cells • Confidence values between 0 and 1

  16. Relationships Generate Confidence • Label cell describes value cells • Value cells in same row or column • Label cells form a multi-level label • Label cells correspond to object sets • Value factoring and nested values

  17. Label Cell and Value Cell Generate Confidence A continuous path between a label cell and a value cell Label Label Confidence = 1 If a path exists 0 If no path exists

  18. Label Cell and Value Cell Generate Confidence Preferences for label – value orientations Label Label

  19. Label Cell and Value Cell Generate Confidence Compare the height or width of each label cell with each value cell Label OR Label Not Similar Similar 0 1

  20. Value Cell and Value Cell(Same Row) Generate Confidence A continuous, horizontal path exists between a pair of value cells Confidence = 1 If a path exists 0 If no path exists

  21. Value Cell and Value Cell (Same Column) Generate Confidence A continuous, vertical path exists between a label cell and a value cell Confidence = 1 If a path exists 0 If no path exists

  22. Value Cell and Value Cell(Geometrically Similar ) Generate Confidence Compare height and width Not Similar Similar 0 1

  23. Multi-level Labels Generate Confidence • Distance between the midpoints • A line through the midpoints • Share a common border

  24. Match Label Cells to Object Sets Generate Confidence • Location of matched words • Order of matched words Object Sets Full Name Location Day Family

  25. Enforce Constraints Generate Confidence Enforce Constraints • Rules for geometric and ontological constraints • Examples: • Same-type value cells have the same dimensions. • A family can’t have 100 members. • Iterate over the rules, seeking convergence

  26. Similar Value Cells Generate Confidence Enforce Constraints

  27. Similar Value Cells Generate Confidence Enforce Constraints LowerConfidence

  28. Similar Value Cells Generate Confidence Enforce Constraints

  29. Combine Aggregations Generate Confidence Enforce Constraints

  30. Multi-level Labels Generate Confidence Enforce Constraints

  31. Factoring Generate Confidence Enforce Constraints Check Cardinality Constraints • Observed cardinality in microfilm table • Expected cardinality in genealogy ontology

  32. Observed Cardinality Generate Confidence Enforce Constraints [First Name] per [Family] = 45 / 9 = 4.67 . . .

  33. Expected Cardinality Generate Confidence Enforce Constraints [First Name] per [Family] = 4.8 * 1 * 1 = 4.8

  34. Ontological Similarity Generate Confidence Enforce Constraints Increase Confidence of Label to Object Set Mappings

  35. Same Microfilm Roll Generate Confidence Enforce Constraints Average Confidence Values Across Tables

  36. Verify Results Generate Confidence Enforce Constraints Verify Results

  37. INSERT INTO Person (Full Name) VALUES ('335,114,521,172') INSERT INTO Person (Full Name) VALUES ('335,173,521,231') Database Generate Confidence Apply Rules SQL Statements Insert Value Cell Coordinates Verify Results … Full Name … …

  38. “Training” Set Results

  39. Ambiguous Factoring

  40. Experiments • 75 tables from 15 different microfilm rolls • Precision, recall, and accuracy • Populated SQL fields • Each relationship

  41. Test Set Results

  42. Factoring over Several Tables Improved Results

  43. Some Long Label NamesCaused Confusion State here the particular Religion or Religious Denomination, to which each persons belongs. [Members of Protestant Denomina- tions are requested not to describe themselves by the vague term ‘Protestant,’ but to enter the name of the Particular Church, Denomination, or Body, to which they belong.]

  44. Ambiguous ColumnsCaused Confusion Full Name

  45. Conclusions • Identified records in microfilm tables • Geometric and ontological properties • Evidence matrices & corroboration rules • Accuracy: ~92% http://www.rdhd.byu.edu http://www.fht.byu.edu

More Related