1 / 16

Current Challenges in Digitization

Current Challenges in Digitization. Ivo IOSSIGER President Director General. Digitization Brings Documents Online. Books, magazines and newspapers are precious collections of knowledge and reference Allow users to search and access valuable content through most recent online technologies

gpotter
Télécharger la présentation

Current Challenges in Digitization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Current Challenges in Digitization Ivo IOSSIGERPresident Director General

  2. Digitization Brings Documents Online • Books, magazines and newspapers are precious collections of knowledge and reference • Allow users to search and access valuable content through most recent online technologies The new trend : "What is not online, does not exist !” paper content digital images digital files digital content

  3. The Problem • Content, knowledge are imprisoned between pages of books. • Millions of documents are written in various languages. • How to digitize by preserving the book or the unique copy.

  4. Challenges of Books • Manipulate the large variety : • page formats • paper types • binding types • soft or rigid covers • Turn one by one all pages of a book and present them to a camera. • Reproduce accurately the aspect of every page • page layout • calligraphy • images

  5. Challenges of Pages • Enhancing images of pages • remove typical artifacts (borders, split pages) • remove transparency of print from the opposite page (unbleed) • correct rotation of page (deskew) • correct page curvature of bound documents • enhance contrast and clear page background

  6. Challenges of Text • Recognize text and make it accessible for full text search, copy and paste.

  7. Challenges of Costs • Equipment and production tools • are more performing and reliable • engages fixed costs • Human Labor • is expensive and slow • engages variable costs • Logistics and workflow • are sophisticated and diverse • engages variable costs How to reduce costs per page ?

  8. The Solution Everybody is Looking for How to digitize a large number of books as quick and as cheap as possible with preserving superior quality ? • automatic solution without operator • superior productivity beyond humans • insure digitization of all pages • produce high quality images unattended faster reliable reliable Increase VOLUME, Keep QUALITY, Decrease PRCE

  9. The Solution of the Past • An operator turns pages all day long ... an endless task • forbidding and tedious task • limited performance by concentration and tiredness • irregular quality due to individuals • contradiction between performance and motivation

  10. The Solution of the Future

  11. Technology and Production Challenges Indexing Structuring Image Scanning Image Treatment OCR Providing Online Paper Content 4DigitalBooks ABBYY Digital Library Digital Content RecognitionServer Digitizing Line Page Improver Text Search

  12. Challenges of File Formats -17 -16 -11 -9 -8 -4 -1 0 time you are here 1993 PDF (Acrobat 1) 2001 PDF hidden text (Acrobat 5) 2005 PDF/A 2008 PDF is an open standard (Acrobat 9) Image & Text 1992 JPEG 2000 JPEG 2000 Image 1992 TIFF 1998 XML Text Which formats will survive ? - The most popular and widely spread !

  13. Challenges of Storage ratio GS RGB BW bitmap image A4 at 300 dpi : 8.5 MB 25.5 MB 1.0 MB % % % lossless quality (recommended for multiple edits) • TIFF.TIF 1:1 100 300 12 • TIFF LZW.TIF 2:1 ~50 ~150 ~6 • TIFF CCITT G4 .TIF 100:1 ~1 loss on quality (not recommended for multiple edits) • JPEG.JPG 5-10:1 ~20-10 ~25-12 • JPEG 2000.JP2 6-12:1 ~16-8 ~20-10 Archive or reference files are NOT intended for multiple edits. Therefore all these formats are good for long term preservation.

  14. abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm Challenges Joining Digital and Paper Born Content PDF Image PDF Text PDF Image over Text (large size) (small size)PDF hidden text OCR Paper Born accurate subject to read & print search to original OCR mistakes copy & paste Word Text PDF Text (small size) (small size) read & print search copy & paste Digital Born

  15. abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz nopqrstuvwxyz abcdefghijklm abcdefghijklm abcdefghijklm abcdefghijklm Challenges Selecting Source Material - Microfilm or Digital computer output microfilm COM digital media lifetime 20 y NEEDS MIGRATION microfilm printer analog media lifetime 500 y STORAGE possible image enhancement and restoration microfilm microfilm scanner microfilm camera book scanner 5-10% quality loss analog media lifetime 500 y REFERENCE no quality loss

  16. Challenges to Bring Files Online

More Related