240 likes | 445 Vues
Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm. Kenneth Tubbs. Microfilm Image. Input. Table Zones. The coordinates of each table cell The printed text in ASCII for each cell, if any. Whether or not the cell is empty. Table Zones. Identify
E N D
Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs
Input Table Zones • The coordinates of each table cell • The printed text in ASCII for each cell, if any. • Whether or not the cell is empty.
Table Zones Identify Structure Record Patterns Match Attributes Genealogical Ontology Check Constraints Algorithm
Identify Structure • Identify Table Primitives • Aggregate Table Primitives • Sort Candidates Identify Structure
Name Identify Structure • Identify Table Primitives Column: [[table_label width] [table_value width]+] {below} Identify Structure
Name Identify Structure • Identify Table Primitives Row: [[table_label height] [table_value height]+] {left} Identify Structure
Row Primitive Column Primitive Identify Structure • Identify Table Primitives Printed Text Hand-written Text Identify Structure
Identify Structure 2. Identify Table Primitives • Probabilistic Rules are associated with each • primitive type. • Examples • Column primitives should be factored left to right. (.9) • Row primitives factor the Column primitives below them. (.7) Identify Structure
A B C D E F G H I J K L Identify Structure 2. Aggregate Table Primitives Identify Structure
G H I J K L Identify Structure 2. Aggregate Table Primitives [G H I J K L] or [G] [ H I J K L] or [K] [G H I J L] or [G] [H I J [K][L]] or Others Identify Structure
Identify Structure 2. Sort Candidates • The candidates are evaluated based on: • The confidence of the table primitive matches. • The probability the the rules used are correct. Identify Structure
Identify Structure 2. Sort Candidates • [G] [ H I J K L] • [G H I J K L] • [G] [H I J [K][L]] • [K] [G H I J L] • Others Identify Structure
Match Attributes • Identify Possible Mappings • Sort Candidates Match Attributes
Name Name Sex Gender Female Age Female, Age Genealogical Ontology Match Attributes • Identify Possible Mappings Mapping types Printed Text • Identical Matches • Synonym Matches • Composite Matches • Human-Aided Matches Match Attributes
Match Attributes 2. Sort Candidates • The candidates are evaluated based on: • The type of the match. • The confidence of the match. Match Attributes
Check Constraints • Identify the individual records • Evaluate the records with the Genealogical Ontology. Check Constraints
Check Constraints Table (Address , Age) = 4.1 Address 1 1 1 4.1 3.9 4.2 Name Age Gender Check Constraints
Check Constraints Ontology (Address, Age) = 1.5 * 4.3 * .9 = 5.805 Age Name Gender 5 1.1 10 1.1 .9 1.1 1.5 1.3 4.3 1.3 Address Family Person Check Constraints
Check Constraints Constraint_Score = 1 2 (1\(2n)) * | Ontology(i, j) – Table(i,j) |2 • The variables “i” and “j” are attributes. • The sum is over all combinations of “i” and “j”. • The variable “n” is number of attributes. Check Constraints
Check Constraints The algorithm sorts the candidates by their constraint score. The algorithm creates rules to prevent the factoring of the attributes the receive low constraint scores. Check Constraints
Table Zones Identify Structure Record Patterns Match Attributes Genealogical Ontology Check Constraints Algorithm
Final Remarks • The algorithm produces: • Record Patterns • Attributes for each record • Geometry for each record • 2. Attribute mappings from the table to the ontology.
Final Remarks • Given extracted values for the information written by hand, • the process can extract the records into an XML file. • Individuals can then query the XML files and index • back into the original microfilm images.