1 / 30

Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands

Overview of the IMPACT Project. Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands. Twitter: @impactocr, #impactproject. Overview of this presentation. Challenges in digitisation of historical printed text IMPACT project and objectives IMPACT Achievements

yoland
Télécharger la présentation

Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of the IMPACT Project Hildelies Balk, IMPACT Project Director, KB National Library of the Netherlands Twitter: @impactocr, #impactproject

  2. Overview of this presentation • Challenges in digitisation of historical printed text • IMPACT project and objectives • IMPACT Achievements • IMPACT Centre of Competence • How can we work together with YOU

  3. KB Digital Library Programme • Goal: Offer everyone access to everything published in and about the Netherlands through the internet • 2013: 10% of the publications published in and about the Netherlands available in digital form (60 M pages by KB, 13 M pages by third parties) • Offer our full text collections in such a way that they can be immediately used by researchers • Example projects: Historical Newspapers – http://kranten.kb.nl Dutch Parliamentary Papers – http://www.statengeneraaldigitaal.nl/ Early European Books (Proquest) , 18th and 19th century books (Google), other projects - http://www.kb.nl/hrd/digitalisering/index-en.html • Timeframe covered: 1618 - 1995

  4. So we offer this….

  5. With this message ….

  6. Damaged pages, bleed through, difficult layout, historic fonts … OCR problems

  7. Twitter: @impactocr, #impactproject Warping of paper (due to humidity)

  8. Twitter: @impactocr, #impactproject Bleed through & Shine through Bad printing: blurred, broken, faded characters

  9. Twitter: @impactocr, #impactproject Gothic print types

  10. Twitter: @impactocr, #impactproject Annotations in the text

  11. Twitter: @impactocr, #impactproject Complicated layout

  12. Language Challenges: Spelling variants, orthographical variants, inflected forms…and more Historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

  13. Institutional Challenge: lack of knowledge and expertise  inefficiency

  14. Answering the challenges – IMPACT IMPACT – Improving Access to Text (2008-2011) • Large-scale integrating research project • Consortium of 26 partners • Good mix of public and private partners • Users, researchers and industry work together to find solutions • Each established in a large international network • Coordinated by the National Library of the Netherlands (KB) • Co-funded by EU (FP7 ICT Work Programme) • From 2012: sustainable Centre of Competence with alternative resources

  15. Twitter: @impactocr, #impactproject IMPACT objectives Significantly improve mass digitisation of historical printed text by: • Innovate OCR software and language technology → tools for each step in the digitisation workflow from scan to publication • Share expertise and building capacity across Europe • Ensure that tools and services will be sustained after the end of the project

  16. IMPACT Achievements: summary • On market: Improved commercial OCR • Ready for real life testing: • Adaptive OCR engine • Tools for OCR correction with volunteer involvement • Computerlexica for nine languages • Digitisation Framework with evaluation tools and dataset • Knowledge bank with guidelines and learning resources • Service for for print space recognition • For future development: • Novel Approaches to preprocessing, OCR and post correction • Tools for lexion building • Added value: Unique network bringing together experts from different communities • Centre of Competence for digitisation to start 1 january 2012

  17. Twitter: @impactocr, #impactproject IMPACT Achievements: • Examples

  18. Twitter: @impactocr, #impactproject Preprocessing: Novel Approaches to image enhancement before after Border removal and dewarping by NCSR and USAL

  19. Twitter: @impactocr, #impactproject OCR: Improved commercial engine on market: ABBYY FR10 • Historic European font: FRE10 recognition of historic fonts: • 25% more accurate than FRE9 • 38% more accurate than FR XIX

  20. Twitter: @impactocr, #impactproject OCR correction: two effective tools ready for implementation • Both make use of volunteer involvement • CONCERT by IBM: collaborative correction feeds back into Adaptive OCR • → promising pilots by libraries • LMU Post correction tool based on language input → pilot to start soon

  21. Twitter: @impactocr, #impactproject Language: lexica for nine languages Correction of Long S with IMPACT lexicon for historical Dutch

  22. Twitter: @impactocr, #impactproject Post Processing: Print space recognition • Functional Extension Parser by UIBK • Recognition of the structure of book pages • Enrichment of OCR results with structural information

  23. Twitter: @impactocr, #impactproject Evaluation: IMPACT Framework • Modular and transparent method for evaluating specific workflows

  24. Twitter: @impactocr, #impactproject Evaluation: IMPACT Dataset • Over half a million representative pages of digitised historical texts (newspapers, books, pamphlets, typewritten material) from the collections of 11 European libraries, with unique IDs and metadata • Invaluable resource for future research in OCR and language technology.

  25. Centre of Competence in digitisation • New community: Bridges the gap between • content holders with digitisation programmes and • scientific communities in the area of pattern recognition, language technology, image processing • Mission: making Europe’s heritage accessible in digital form • Focus on practical solutions • Provides support in the implementation of the innovative IMPACT solutions for improving access to text • Provides tools and services for further advancement of the State of the Art in the field • Organises Conferences/workshops

  26. How to join Three levels of membership : • Open (registration) access to forum, part of content • Basic membership (fee): access to all facilities, reduced fee for conferences • Premium membership (fee): member of the Board, privileges such as free entry to conferences Want to sign up? • Mail to impact@kb.nl for information on membership • Join us now already on LinkedIn • Follow us on Twitter (@impactocr) • Access through www.impact-project.eu

  27. Houston: our ideas on working together Low hanging fruit: • Sharing open source solutions • Evaluating them in our framework with Ground Truth • Building a good set of use cases for all available tools • Sharing case studies on digitisation problems Adressing the big remaining challenge: • Getting the tools to work in real life environments • Bridging the gap between techy solution and content holders workflow

  28. Twitter: @impactocr, #impactproject Questions? • impact@kb.nl • www.impact-project.eu Thank you!

More Related