180 likes | 294 Vues
The way from pdf-documents to xml-files. A brief overview through the OCR-process and the XML mark up. Christiana Klingenberg & Donat Agosti. workflow. OCR (ABBYY FineReader) reading the pdf document, dividing the text in blocks building training files orthography check.
E N D
The way from pdf-documents to xml-files A brief overview through the OCR-process and the XML mark up Christiana Klingenberg & Donat Agosti
OCR (ABBYY FineReader) reading the pdf document, dividing the text in blocks building training files orthography check XML markup (GoldenGATE) workflow (level 1) FAT / LSID treatments document processing
OCR – ABBYY FineReader Considerations • building training files for each type face pattern (eg. for each journal) • marking the blocks in logical reading order • recognizing special caracters [[worker]], [[queen]], [[male]], [[soldier]] • orthography check • saving options • problems
type face pattern 1804. Carolum Reichard, Brunsviga. 1861. Journal of the Proceedings of the Linnean Society of London, Zoology 1921. Annales de la Societe Entomologique de Belgique 2005. Proceedings of the California Academy of Sciences
marking the blocks 1 1 2 3 4 2 3 5 6 7 4 marking the blocks in a logical order to get a readable xml document
Vespa. 263 emargina-ta. 50. V. nigra thorace maculata, abdomine fasciis quinque prima antice emarginata, Vespa emarginata. Ent. Syst. 2. 267. 51. * Habitat in Germania Dom Smidt. simplex 51. V. nigra clypeo thoracis margine antico ab-dominisque fasciis quinque simplicibus flavis. Ent. Syst. 2, 267. 52. * Habitat Kiliae. parietina. 52. V. nigra clypeo thoraceque maculatis, abdomi-ne fasciis supra quinque, subtus duabus flavis. Ent. Syst, 2. 268. 53. * Panz. Fn. Germ. 49. tab. 24. Habitat Kiliae. blocks marked in a logical sequence, „clean“ html Vespa. 263 50. V. nigra thorace maculata, abdomine fasciis emargina-quinque prima antice emarginata, ta. Vespa emarginata. Ent. Syst. 2. 267. 51. * Habitat in Germania Dom Smidt. 51. V. nigra clypeo thoracis margine antico ab- simplex. dominisque fasciis quinque simplicibus flavis. Ent. Syst. 2. 267. 52. * Habitat Kiliae. 52. V. nigra clypeo thoraceque maculatis, abdomi- parietina. ne fasciis supra quinque, fubtus duabus flavis. Ent. Syst, 2. 268. 53. Panz. Fn. Germ. 49. tab. 24. Habitat Kiliae. whole text marked in one block, „dirty“ html
special characters [[worker]] [[soldier]] [[queen]] [[male]] [[…]] = not recognizable it is not possible to enforce the Abbyy pattern editor to re-read certain characters!
orthography check / problems • additional dictionaries: “anty_species”, “anty_glossary”, (“anty_Chris”) • latin dictionary? • geographic names dictionary? • misspelled taxa (incl. species names beginning with CAPITALS) • available training files for different type patterns for ABBYY (community) • species dictionaries for different groups (eg. plants, beetles, birds, etc.) (community) (could be used as lexicon in GoldenGATE)
saving options (T) australis Forel = parallela (T) bequaerti Forel = schultzei (T) bicolor (Clark) * = turneri (T) bidentata Brown n. sp. [[worker]] Philippines [13] (T) bicuspis Emery 1900:268 [[worker]] [[male]] Madagascar [15] boliviana Santschi = sinuata (P) brevidentata Wheeler — cribrinodis (T) brevinodis Santschi = cribrinodis (?) brunnipes (Clark) * 1938:361 [[worker]] S Australia: Reevesby I. [16] (T) cephalotes Viehmeyer = parallela (T) ceylonensis Donisthorpe = parallela cineracea Forel = punctata (T) australis Forel = parallela (T) bequaerti Forel = schultzei (T) bicolor (Clark) * = turneri (T) bidentata Brown n. sp. [[worker]] Philippines [13] (T) bicuspis Emery 1900:268 [[worker]] [[male]] Madagascar [15] boliviana Santschi = sinuata (P) brevidentata Wheeler — cribrinodis (T) brevinodis Santschi = cribrinodis (?) brunnipes (Clark) * 1938:361 [[worker]] S Australia: Reevesby I. [16] (T) cephalotes Viehmeyer = parallela (T) ceylonensis Donisthorpe = parallela cineracea Forel = punctata
GoldenGATE: xml mark up • FAT / attribute taxon names • editing species names (beginning with lower case letters, if not recognized as a genus) • marking of additional, not recognized taxa (without the author, the author will be given during LSID referencing) • edit annotations (improving the tool)
GoldenGATE: xml mark up • LSID referencing • upload of new taxonomic names (quality control?) • same taxon described by two authors? In case of doubt, which one? Establishing “taxon format” rules according with the ICZN for taxon upload: “Genus (SubGenus) species subspecies variety” (requires in most cases a previous editing of the taxa, during the OCR process or in GoldenGATE)
GoldenGATE: treatment mark up • definitions of treatment options, especially: catalogue entry, synopsis, citation, reference group • suggestions for simplifying the treatment mark up: journal-specific analyzers? • treatment mark-up during “paginator” step and subSubSection mark up posteriorly?
GoldenGATE: TaxonX • TaxonX validation: in GoldenGATE (no necessity of Oxygen or XMLSpy) • TaxonX – MODS: what about books?
GoldenGATE: considerations • new definitions of mark up levels • LSIDs, citations (DOIs) • community: “mark up server”, integrating specialists for special groups or mark up levels Error prevention: • in case of doubt consult the original pdf (taxa), especially when working with “dirty” html
expenditure of time • OCR: average of x 5,63 min / page depends on type face pattern and availability of trainig file for type face pattern • GoldenGATE: average of x 8,18 min / page (tx1) • average time represents also time of debugging and error search • depends on number of taxa and treatments • time will reduce due to constant improving of GoldenGATE and developing helpful tools