1 / 18

The way from pdf-documents to xml-files

The way from pdf-documents to xml-files. A brief overview through the OCR-process and the XML mark up. Christiana Klingenberg & Donat Agosti. workflow. OCR (ABBYY FineReader) reading the pdf document, dividing the text in blocks building training files orthography check.

linh
Télécharger la présentation

The way from pdf-documents to xml-files

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The way from pdf-documents to xml-files A brief overview through the OCR-process and the XML mark up Christiana Klingenberg & Donat Agosti

  2. workflow

  3. OCR (ABBYY FineReader) reading the pdf document, dividing the text in blocks building training files orthography check XML markup (GoldenGATE) workflow (level 1) FAT / LSID treatments document processing

  4. OCR – ABBYY FineReader Considerations • building training files for each type face pattern (eg. for each journal) • marking the blocks in logical reading order • recognizing special caracters [[worker]], [[queen]], [[male]], [[soldier]] • orthography check • saving options • problems

  5. type face pattern 1804. Carolum Reichard, Brunsviga. 1861. Journal of the Proceedings of the Linnean Society of London, Zoology 1921. Annales de la Societe Entomologique de Belgique 2005. Proceedings of the California Academy of Sciences

  6. marking the blocks 1 1 2 3 4 2 3 5 6 7 4 marking the blocks in a logical order to get a readable xml document

  7. Vespa. 263 emargina-ta. 50. V. nigra thorace maculata, abdomine fasciis quinque prima antice emarginata, Vespa emarginata. Ent. Syst. 2. 267. 51. * Habitat in Germania Dom Smidt. simplex 51. V. nigra clypeo thoracis margine antico ab-dominisque fasciis quinque simplicibus flavis. Ent. Syst. 2, 267. 52. * Habitat Kiliae. parietina. 52. V. nigra clypeo thoraceque maculatis, abdomi-ne fasciis supra quinque, subtus duabus flavis. Ent. Syst, 2. 268. 53. * Panz. Fn. Germ. 49. tab. 24. Habitat Kiliae. blocks marked in a logical sequence, „clean“ html Vespa. 263 50. V. nigra thorace maculata, abdomine fasciis emargina-quinque prima antice emarginata, ta. Vespa emarginata. Ent. Syst. 2. 267. 51. * Habitat in Germania Dom Smidt. 51. V. nigra clypeo thoracis margine antico ab- simplex. dominisque fasciis quinque simplicibus flavis. Ent. Syst. 2. 267. 52. * Habitat Kiliae. 52. V. nigra clypeo thoraceque maculatis, abdomi- parietina. ne fasciis supra quinque, fubtus duabus flavis. Ent. Syst, 2. 268. 53. Panz. Fn. Germ. 49. tab. 24. Habitat Kiliae. whole text marked in one block, „dirty“ html

  8. special characters [[worker]] [[soldier]] [[queen]] [[male]] [[…]] = not recognizable it is not possible to enforce the Abbyy pattern editor to re-read certain characters!

  9. orthography check / problems • additional dictionaries: “anty_species”, “anty_glossary”, (“anty_Chris”) • latin dictionary? • geographic names dictionary? • misspelled taxa (incl. species names beginning with CAPITALS) • available training files for different type patterns for ABBYY (community) • species dictionaries for different groups (eg. plants, beetles, birds, etc.) (community) (could be used as lexicon in GoldenGATE)

  10. saving options (T) australis Forel = parallela (T) bequaerti Forel = schultzei (T) bicolor (Clark) * = turneri (T) bidentata Brown n. sp. [[worker]] Philippines [13] (T) bicuspis Emery 1900:268 [[worker]] [[male]] Madagascar [15] boliviana Santschi = sinuata (P) brevidentata Wheeler — cribrinodis (T) brevinodis Santschi = cribrinodis (?) brunnipes (Clark) * 1938:361 [[worker]] S Australia: Reevesby I. [16] (T) cephalotes Viehmeyer = parallela (T) ceylonensis Donisthorpe = parallela cineracea Forel = punctata (T) australis Forel = parallela (T) bequaerti Forel = schultzei (T) bicolor (Clark) * = turneri (T) bidentata Brown n. sp. [[worker]] Philippines [13] (T) bicuspis Emery 1900:268 [[worker]] [[male]] Madagascar [15] boliviana Santschi = sinuata (P) brevidentata Wheeler — cribrinodis (T) brevinodis Santschi = cribrinodis (?) brunnipes (Clark) * 1938:361 [[worker]] S Australia: Reevesby I. [16] (T) cephalotes Viehmeyer = parallela (T) ceylonensis Donisthorpe = parallela cineracea Forel = punctata

  11. workflow

  12. GoldenGATE: xml mark up • FAT / attribute taxon names • editing species names (beginning with lower case letters, if not recognized as a genus) • marking of additional, not recognized taxa (without the author, the author will be given during LSID referencing) • edit annotations (improving the tool)

  13. GoldenGATE: xml mark up • LSID referencing • upload of new taxonomic names (quality control?) • same taxon described by two authors? In case of doubt, which one? Establishing “taxon format” rules according with the ICZN for taxon upload: “Genus (SubGenus) species subspecies variety” (requires in most cases a previous editing of the taxa, during the OCR process or in GoldenGATE)

  14. GoldenGATE: treatment mark up • definitions of treatment options, especially: catalogue entry, synopsis, citation, reference group • suggestions for simplifying the treatment mark up: journal-specific analyzers? • treatment mark-up during “paginator” step and subSubSection mark up posteriorly?

  15. GoldenGATE: TaxonX • TaxonX validation: in GoldenGATE (no necessity of Oxygen or XMLSpy) • TaxonX – MODS: what about books?

  16. GoldenGATE: considerations • new definitions of mark up levels • LSIDs, citations (DOIs) • community: “mark up server”, integrating specialists for special groups or mark up levels Error prevention: • in case of doubt consult the original pdf (taxa), especially when working with “dirty” html

  17. expenditure of time • OCR: average of x 5,63 min / page depends on type face pattern and availability of trainig file for type face pattern • GoldenGATE: average of x 8,18 min / page (tx1) • average time represents also time of debugging and error search • depends on number of taxa and treatments • time will reduce due to constant improving of GoldenGATE and developing helpful tools

  18. Time development GoldenGATE

More Related