From Web Documents to Old Books Works in Progress in Graphics Recognition

From Web Documents to Old BooksWorks in Progress in Graphics Recognition Mathieu Delalandre Meeting of Document Analysis Group Computer Vision Center Barcelona, Spain Thursday 23th November 2006

Plan • Short CV • Vector Graphics Indexing and Retrieval • Dropcap Image Retrieval

Short CV SCSIT Nottingham LITIS Rouen L3i La Rochelle CVC Barcelona Personal Information Mathieu Delalandre, 32 years old Academic Degrees 1995-1998 Lic.Sc in Electronic Rouen University, France 1998-2001 M.Sc in Industrial Computing Rouen University, France Research Periods Length Position Laboratory Subject 6 months Master LITIS symbol recognition 3 ½ years PhD LITIS drawing understanding 5 months Post-doc SCSIT vector graphics indexing • months Post-doc L3i dropcap image retrieval 2 months Contract LITIS performance evaluation 3 years Post-doc CVC …

Vector Graphics Indexing and Retrieval What are vector graphics ? Bitmap vs vector graphics More accurate and lighter • Known vector graphics formats • AI (Adobe Illustrator) • SVG (Scalable Vector Graphic) • WMF (Windows Metafile) • EPS (Encapsulated PostScript) • DXF (AutoCAD) • ClipArt EPS Plane WMF pen <rect x="400" y="100" width="400“ height="200" fill="yellow" stroke="navy" stroke-width="10" /> Clipart cheese • Application of vector graphics • 1982 Computer Aided Design (DXF ‘1982’) • 1985 Office software (PS ‘1985’, CGM ‘1987’, WMF ‘1993’) • 1996 Web (PNG ‘1996’, SVG ’2001’ ..) • Vector graphics are growing on Web • SVG 1.0 • SVG widely used structured documents [Mong’03], geographic maps [Chen’04], technical drawings [Kang’04] • 2005 Powerful editors (Inskape, Webdraw, …) • Internet Explorer and Mozilla Firefox support SVG

Vector Graphics Indexing and Retrieval Indexing process must adapted to document content Our key ideas content adaptation Features Extraction Retrieval Doc 1 Doc 1 Doc 2 Doc 2 Features Extraction Retrieval Doc 3 Doc 3 We can improve results by structuring the index structured index Content adaptation Structured index Index Indexed objects Pattern frequency Graphics objects Model 3 Model 1 Model 2 Level 1 3.3 28.3 Level 2 6.6 10.0 3.3 13.2 Level 3 3.3 1.6 3.3 13.2 6.6 6.6 Ranked patterns Adjacency Line Inclusion Square Junction System overview [Doer’98] [Tom’03] Look like pattern recognition approach

Vector Graphics Indexing and Retrieval Our approach Before retrieve, we need to extract features What are the difficulties ? R1 parsing and break-up set of objects How to get R2 ? R1 R3 <rect x="400" y="100" width="400" height="200" fill="blue" /> <rect x="650" y="200" width="400" height="200" fill="yellow" /> R2 x11 x12 x21 x22 R3 y11 y21 y12 filtering then junction detection y22 set of line We need a break-up How to speed up the process ? set of broke line Sorting the bounding box You see 5 You have 9 We need a clean-up

Vector Graphics Indexing and Retrieval Result example line gravity center line graph building adjacency inclusion Polyline Junction while 2-connex  edge if  3-connex  node region detection To work on graph take time Polygon Using vectorial data [Wen’01] 1 while  starting vector take nearest vector 2 adjacency and inclusion 3 Adjacency Polygon common vector included bounding box Our approach (next) Time processing on ‘Mikado’ database

Vector Graphics Indexing and Retrieval Features Extraction Retrieval Doc 1 GT 1 Doc 2 GT 2 Doc 3 GT 3 Performance evaluation Production rules Synthetic document production Our key idea Produce true-life document need much knowledge, it is harder to do with a computer Production rules ‘Creasy’ but well formed drawing 2-connected 0-n - 1-connected 0-n We can produce ‘creasy’ but well formed documents, it is sufficient for performance evaluation purposes + + 1-connected 1 1-connected 1 - 2-connected O-n To work on retrieval engine now ? How to evaluate the retrieval results after ? We must work on performance evaluation before ? How to get the ground truth ? Produce ground truth from existing document take time, we must produce synthetic document.

Vector Graphics Indexing and Retrieval rotate and scale rotate and distort scale and overlap Low Level Primitives Graphical Objects • Domain rules • must be connected • must be adjacent • must be include • can include • … • General rules • object number • document size • object choice • -probability distribution • -rotation and scale range • -position constraints • -overlapped or not • … • Noise rules • to scale line • to broke line • to move line • … (4) To move objects according to domain rules (5) To delete oldest alone objects ‘cycle number’ II Vector Graphics (1) To insert a new object while underhand object number (2) To move other objects if it can’t do (1) (3) To exit if it can’t do (1) and (2), then run (4) and (5) while I III (6) Adding noise on low level primitives composing objects Ground Truth In progress

Vector Graphics Indexing and Retrieval Works done Fast graph building from vector graphics Production of first synthetic documents Works in progress … To produce more complex synthetic documents … To work on model selection … To work on index structuration … About project dot-line 04/05 SCSIT Post doc 02/06 IRCSET Application A. Winstanley (NCG, Dublin University) 04/06 Eureka Meeting eConnector, HP Lab 06/06 ANVAR Application informal agreement 11/06 EPEIRES contract 2007 To visit A. Winstanley (NCG, Dublin University) To take contact with M. Fonseca (IST, Lisbon University) 2008 JM Ogier plan to mount a European project

Dropcap Image Retrieval dropcap Alciati (1511) Bartolomeo (1534) Laurens (1621) figure headline headline Old books of XV° and XVI° centuries Which part and kind of graphics in old books CESR Database Old Graphics

Dropcap Image Retrieval Real time process or not ? Retrieve similar printings We can’t index all images in regard to legal properties, a real time process will allow to do queries with images provided by other digital libraries Printing 1 query query DB Printing 2 results result Plug 1 Plug 2 Plug 3 (1) Wood plug tracking (2) User-driven historical metadata acquisition Vascosan 1555 Marnef 1576 Wood plug (bottom view) Without retrieval Metadata file Metadata file 1555-1578 With retrieval more faster reduce error Printing house plug 1511-1542 Metadata file Metadata file exchange copy 1531-1548 1497-1507 In what are interested historian people with these images ? Why ?

Dropcap Image Retrieval What are the main difficulties? To scalar [Loncaric’98] • Hough, Radon, Zernike, Hu, Fourrier • Scaled and orientation invariant • fast • local (character, symbol, digit) To image [Gesu’99] • Template matching, Hausdorff distance • no scaled and orientation invariant • slow • global (scene) Image Database Noise Offset Complexity Accuracy Scalability several hundred of classes several thousand of images descriptors Filtering More adapted but too complex Compression Centering and Comparison Not adapted for our problem fast local complex global Our key idea To use an image compressed representation Query R1 R2 R3 Which descriptor use ?

Dropcap Image Retrieval We have started to work with our images but the file formats are so different Filtering Compression Centering and Comparison

Dropcap Image Retrieval accepted Why ? Base QUEID Engine Parameters rejected Expertise query analysis Base QUEID • Software setting • Image exchange • Prototype software charts Diagnostic Filtering Filtering mode Format Our database Compression Centering and Comparison Our key idea Before to work on retrieval engine historian people need tools to improve quality of their databases Files 2038 Size 279.7 Mp Model gray Formats Tiff To develop an engine (QUEID) working on image metadata to detect digitalization problem, and to secure retrieve system Compression Uncompress Resolutions 250 to 350 Diagnostic mode Digitalization problems [Lawrence’00] Several image providers Several digitalization tools Long process Human supervised Complex post-processing plate-form … Contrôle

Dropcap Image Retrieval Our key idea Which kind of RLE ? both RLE seems more adapted To use a Run Length Encoding (RLE) of Image 0.95 0.88 Filtering 0.75 Compression Centering and Comparison image foreground both background Compression results

Centering Dropcap Image Retrieval To solve the offset problems we must use a centering step before the comparison Time results Raster vs RLE x1 x1 x1 x1 Size k.run Size k.pixel Time s Time s line (y) image 1 Filtering Min Min 1.1 7.74 176.67 22.32 Comparison Mean Mean 137.7 15.5 337.06 41.68 line (y+dy) image 2 Max Max 600.8 87.8 903.62 137.06 Compression Centering and Comparison x x2 x2 x2 x2 x2 reference stack while x2 x1 handle image 2 while x1 x2handle image 1 image database query image We can do it in an easy way by comparing foreground histogram

Dropcap Image Retrieval Mean query of 40 s, how to reduce again without using a lossless compression and to loose accuracy ? Our first system Level 1 : image sizes Level 2 : black, white pixels Level 3 : RLE comparison How to process the distance curve ? Using a basic clustering algorithm ‘elbow criteria’ To use a system appraoch using different level of operator (from more speed to more accurate) to select image to compare Our key idea query Speed 1st Level 2sd Level if 1 - 2 < 0 push x, cluster while1 - 2 < 0 next 2 1 Depth

Dropcap Image Retrieval From 4% to 59%, how to reduce the variability ? To work on a better selection criteria seems ambiguous … Our key idea To add an intermediate operator between scalar and image data Selection % Min 4% Mean 24% Max 59% Selection results

Dropcap Image Retrieval Our key idea First results seem good, but how to get the ground truth and to evaluate our system? Example of query result To use our engine to produce benchmark database Query Same plug 0.19470.25170.34850.3616 0.3819 0.4064 Bench1 Bench2 Bench2 Next plug To produce control retrieve 0.4109 0.4209 Base Labels IHM Retrieve engine display driven labelling

Dropcap Image Retrieval Works done QUEID to filter and analyse image database Speedup comparison using two feature RLE compression System approach Works in progress … To add operator to improve system To extend our system to produce benchmark database About project dot-line 09/05 MADONNE Postdoc 06/06 1er CESR Technical Meeting 09/06 ANAGRAM Worshop (Fribourg) 10/06 2sd CESR Technical Meeting 10/06 NaviDoMass agreement 2007 GDR-JC Project (LMA, LI, CreSTIC, LITIS, CVC) To put online the system on CESR website old graphic working group (Glasgow, Tours …)

Bibliography • J. Mong and D. Brailsford. Using svg as the rendering model for structured and graphically complex web material. In Symposium on Document Engineering (DocEng), pages 88-91, 2003. • Y. Chen, J. Gong, W. Jia, and Q. Zhang. Xml-based spatial data interoperability on the internet. In Conference of International Society for Photogrammetry and Remote Sensing and Spatial Information Sciences (ISPRS), pages 167-201, 2004. • J. Kang, B. Lho, J. Kim, and Y. Kim. Xml-based vector graphics: Application for web-based design automation. In International Conference on Computing in Civil and Building Engineering (ICCCBE), pages 170-178, 2004. • M. Weindorf. Structure based interpretation of unstructured vector maps. In Workshop on Graphics Recognition (GREC), volume 2390 of Lecture Notes in Computer Science (LNCS), pages 190-199, 2002. • N. Journet, R. Mullot, J. Ramel, and V. Eglin. Ancient printed documents indexation: a new approach. In International Conference on Advances in Pattern Recognition (ICAPR), volume 3686 of Lectures Notes in Computer Science (LNCS), pages 513-522, 2005. • V. D. Gesu and V. Starovoitov. Distance based function for image comparison. Pattern Recognition Letters (PRL), 20(2):207-214, 1999. • S. Loncaric. A survey of shape analysis techniques. Pattern Recognition (PR), 31(8):983-1001, 1998. • G. Lawrence and al. Risk management of digital information: A file format investigation. RLG DigiNews, 8(4), 2000.

From Web Documents to Old Books Works in Progress in Graphics Recognition

From Web Documents to Old Books Works in Progress in Graphics Recognition

Presentation Transcript

Web Graphics

From HTML Documents to Web Tables and Rules

Information Extraction from Web Documents

From Web Documents to Web Applications

Works Progress Administration

Works Progress Administration

Works Progress Administration

Works Progress Administration

Web Graphics

From Web Documents to Web Applications

Civil Works Progress

Graphics Recognition – from Re-engineering to Retrieval

Works in Progress

Graphics in the web

Works Progress Administration

Works Progress Administration

Record-Boundary Discovery in Web Documents

Clustering Documents in a Web Directory

Works Progress Administration

OLD BOOKS

CSPH Works-in-Progress

Progress and Challenges in Automatic Speech Recognition