1 / 18

Moving Targets: Integrating semistructured data

Moving Targets: Integrating semistructured data. Pepé Ciardelli & Marc Geoffroy Botanic Garden and Botanical Museum Berlin-Dahlem Dept. of Biodiversity Informatics TDWG 2000, Bratislava. The project.

Télécharger la présentation

Moving Targets: Integrating semistructured data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Moving Targets: Integrating semistructured data Pepé Ciardelli & Marc Geoffroy Botanic Garden and Botanical Museum Berlin-Dahlem Dept. of Biodiversity Informatics TDWG 2000, Bratislava

  2. The project • Euro+Med Plantbase - on-line database for the vascular plants of Europe and the Mediterranean region; http://ww2.bgbm.org/EuroPlusMed • 2722 pages in Microsoft Word – human generated • 11,755 accepted taxa • 18,119 synonyms • Distribution tables

  3. The data

  4. The actors • Senior taxonomist – his baby; knows what a <TAB> is • Junior taxonomist – technically sophisticated; proofs the data post-import • Programmer – taxonomically sophisticated • Programmer – taxonomically unsophisticated; naively believes every case can be caught with code

  5. “Moving targets” • Problems of notation • Files 1-10: 1865 (Mar.-Jun.) • File 11: 1902 [Oct.] • Taxonomic rules can vary • One fine morning, species can be included in species • Collective species notation significantly different • Complex groups like Hieracium require extensive notation • Communication generally good with taxonomic “moving targets”, notation changes not so much

  6. A typical dilemma

  7. Some more weirdos

  8. The moving targets challenge • Import 18 files, generated by hand, over time • Build error-tolerant software that constantly evolves and improves • Know when to say “enough!” – what is the best use of limited human resources?

  9. Added wrinkle • Taxonomist does not review his own work to confirm it’s been parsed correctly • Absolutely essential: junior taxonomist intimately familiar not only with content, but with senior taxonomist • Anticipate problems based on experience

  10. The re-import • For final revisions, database exported back into original Word format • Senior taxonomist’s fine eye for detail confirmed that initial imports were successful • Re-import presents opportunity to use most efficient workflow based on experience

  11. Hard-won lessons • Put data into XML format to catch “fatal errors” – i.e. typos that deviate from rudimentary markup • Identify records likely to cause errors, capture for manual check post-import • Run additional parsing software after the initial import • Key realization: really not so many exceptions after all

  12. The Taxonomic Web Editor • Built to edit checklists stored in a Berlin Model DB • What’s missing: knowing where to look for errors • Based on experience, programmers provide taxonomists with (blessedly short) lists of suspect taxa

  13. A.nemorosum in Word

  14. A.nemorosum in Web Editor

  15. “In”-reference parser • After import, all “in”-refs marked as “preliminary” • Parse what fits regular expression patterns • Parse the rest by hand • Results: • 14,303 “in”-references • Only 26 unresolved, all of which were typos, not unmatched patterns

  16. “In”-reference parser drawback • Reg exps not everyone’s cup of tea

  17. Specific solution to a specific problem • Complex, somewhat quirky system of notation developed over decades • Close relationship between taxonomists and programmers • Limited human resources • Re-usability not a goal of the project • The right mix of automation and data massaging

  18. Acknowledgments • Mattfeld-Quadbeck Foundation • Association of Friends of the BGBM • Global Biodiversity Information Facility (GBIF)

More Related