Graphics Recognition – from Re-engineering to Retrieval

Graphics Recognition – from Re-engineering to Retrieval Karl Tombre, Bart Lamiroy LORIA, France

Document Analysis in the IR era • Information is at the core of industrial strategies • A lot of digital or digitized information, but often in very “poor” formats • The challenge: not necessarily re-engineering of documents, but enrich poorly structured information, add (limited) amount of semantics, build indexes • Purposes: browsing, navigation, indexing • DAR methods and tools useful, but must be adapted

Specific challenges of large-scale IR applications • Genericity: we cannot necessarily build a complete and exhaustive a priori model of contextual knowledge (ontology) • Adaptability: various input data – scanned paper, PDF, DXF, HTML, GIF… – various resolutions • Robustness: “back-office” applications • Efficiency: online searching in heterogeneous data • Scaling: methods have to scale to increasing number of symbols/features

DAR and IR • Media without (or with very little) contextual knowledge • Image-based indexing and retrieval, indexing of video sequences • Documents do explicitly convey information from one person to another person • Much more structure, syntax and semantics

DAR and IR – some examples • Indexing and/or searching scanned text without OCR • Similarities, signatures • Query or index on layout structure • Table spotting • Keyword spotting • …

What about Graphics Recognition? • Subfield of DAR, for graphics-rich documents • Numerous methods for various analysis and recognition problems • Raster-to-vector conversion • Text/graphics separation • Symbol recognition • Many specific technical areas: maps, architectural drawings, engineering drawings, diagrams and schematics, …

Graphics recognition methods • Text/graphics separation

Graphics recognition methods • Vectorization

Graphics recognition and IR applications • Usual text-based indexing and retrieval still useful • But need for access to other kinds of information: • Symbols • Text-drawing connections • Description-illustration connections

Some contributions • Syeda-Mahmood – maintenance drawings IEEE Trans. On PAMI 21(8):737-751, Aug. 1999

Some contributions • Arias et al., Najman et al. – use of information contained in legend / title block Proc. GREC’01, Kingston (Ontario, Canada), p.19-26, Sept. 2001

Some contributions • Samet & Soffer – symbols from legend IEEE Trans. On PAMI 18(8):783-798, Aug. 1996

Some contributions • Müller & Rigoll – graphical retrieval in database of engineering drawings Proc. ICDAR’99, Bangalore (India), pp. 697-700, Sept. 1999

Some contributions • Boose et al. (Boeing) – Generation of Layered Illustrated Parts Drawings (GREC’ 03) Proc. GREC’03, Barcelona, pp. 139-144

Symbol DB Or even better… Wishful thinking?

Symbol recognition Before we move on: 1st contest on symbol recognition held last week See IAPR TC10 homepage for further details • Natural features for indexing and retrieval • Most methods work with known databases of reference symbols – what about interactive querying of arbitrary symbols? • From segmentation followed by recognition, to segmentation-free recognition, or segmenting while recognizing • Scalability • Efficiency / complexity • Discrimination power • Signatures

Image-based signatures • Compute invariant signatures on binary document image • F-signatures (ICDAR’01) • Radon transform: R-signatures [Tabbone & Wendling] • Ridgelets [Ramos Terrades & Valveny – GREC’03] – aka wavelet transform of Radon transform

R-signatures Detection of arrowheads [Girardeau & Tabbone] DEA degree thesis, INPL, Nancy, Jul. 2002

R-signatures Another example [Girardeau & Tabbone]

Ridgelets [Ramos Terrades & Valveny – GREC’03] Proc. GREC’03, Barcelona, pp. 202-211

Vector-based signatures [Dosch & Lladós – GREC’03] • Based on set of basic graphical features: • Parallelism • Overlap • Collinearity • T- and V-junctions • Quality factor associated with the various relations • Match signatures of reference symbols with signatures of buckets

Vector-based signatures Proc. GREC’03, Barcelona, pp. 159-169

Towards symbol spotting • Pre-compute – or compute on the spot – a set of basic signatures • Can be sufficient for symbol spotting and retrieval • Followed by classical symbol recognition if more discrimination is needed

Symbol spotting • [Jabari & Tabbone] : graph matching through probabilistic relaxation, with nodes=segments and vertices=relations DEA degree thesis, INPL, Nancy, Jul. 2003

Symbol spotting • [Jabari & Tabbone] : another example

Combining Text and Graphics • Extracting Text/Graphics relationships within document • Using Text matching for inter-document relationships • Transitive inter-document Graphics matching • No need for complex graphics matching • Restricted to well known document types

Example: continuation of Wiring Diagrams (Boeing) • [Baum et al. – GREC’03] Proc. GREC’03, Barcelona, pp. 132-138

Scan2XML Example Proc. GREC’01, Kingston (Ontario, Canada), pp. 312-325

Indexing and Semantics • Signature + metric • Semantics = measured distance to signature • Applies only to homogenous contexts • Pre-segmented images • Pre-determined image classes • Implicit application of domain kowledge • ... • Semantics = Syntax

Example Signature type A Metric M Signature value l  Semantics1 = (1, 1) Semantics2 = (2, 2) M(l,s1) < m1 ? M(l,s2) < m2 ? semantics = measurement to reference value

Heterogenous Document Bases • Semantics do not have a unique syntax anymore • Syntax metrics may be context sensitive • Semantics = Syntax + Context Context needs to be considered

Two different contexts from the automobile industry

Example Context 1: Signature type A Metric M Context 2: Signature type B Metric N Signature value l  What if M(l,s1) < m1and N(l,t2) < n2 ? (1, 1) = Semantics1 = (t1, n1) (2, 2) = Semantics2 = (t2, n2)

Data Data Data (syntax) (semantics) (semantics) A step to taking into account context (while consolidating existing approaches) Component Algebra : • Image Analysis = Pipeline • Syntax + algorithm = semantics Algorithm Algorithm Syntax and semantics need not be distinguished

Component Algebra • Components : Known and implemented document analysis algorithms, taking input data from one domain, and producing data into another domain. • Application Context : Set of all available Components. • Semantics : Data sets needed by or produced by Components.

Component Algebra is a Graph Data Component Data Data Component Data Data Component Data Data

Advantages • Each node is a semantic concept, semantic relationships are explicitly expressed. • Structure may support automatic reasoning and knowledge inference. • Context is embedded in components, different contexts give different paths in the graph. • Highly scalable and open architecture. • Bridge between signal-level document analysis and high-level document representation.

However ... The formalism exists, the realization doesn't (yet) • What about parametrization ? • How context independant can you get ? • What about « guessing » context appropriateness ? • How to design fully interoperable components ?

Conclusion • A lot of DA methods – and more specifically GR methods – can be of direct use in IR, indexing and browsing applications • Specific challenges • Scaling and efficiency • Heterogeneous sets of documents • Incomplete domain knowledge • Symbol spotting • On-the-fly symbol searching • Sketch of open framework for including document semantics when context can be heterogeneous

Graphics Recognition – from Re-engineering to Retrieval