Document Analysis: Structure Recognition

Document Analysis:Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

Outline • Objectives • Examples of applications • Physical and logical structures • Methodologies for structure recognition • Microstructures vs. macrostructures • Role of models • Interactive Systems

Importance of document structures • Document = Content + Structures • Structures convey abstract high level information • Structuresare revealed by styles

Applications of document structure recognition • Information extraction • form analysis (check readers, ...) • business applications : mail distribution, invoice processing, ... • analysis of museum & library notices • analysis of bibliographical references • Document mining, content analysis • business reports • legal documents • scientific publications • Intelligent indexing • laws • magazine & newspaper • Document restyling • teaching material • ...

Preprocessing Image Layout analysis Segmentation OFR Blocs Fonts OCR Struct. Document Logical labeling Postanalysis Simple Text Extended Processing Chain

document region region region block hr block block hr block frm block Physical document structures • Reveal the publisher's view • Composed of a hierarchy of physical entities • text blocs, text lines and tokens • graphical primitives • Universal, i.e. independent of the document class

Illustration of physical document structure from A. Belaïd

document article article title hdln p p p p p p title p p p author link author link Logical structures • Reflect the author’s mind • Independent of presentation • can be mapped on various physical structures • Composed of application dependent logical entities • Specific to the application and document class

Illustration of logical document structure

Relation between logical and physical structures • There is no 1:1 relation between physical and logical structure • There are some correspondences between as shown below

edit display formatting print Logical Structure Physical Structure Stylesheet analysis Role of style sheets • Document formatting is straightforward ... • But document analysis is a non trivial task that generally can not be fully automated

Methodologies • Document structural analysis can be • data-driven : the recognition task is based on image analysis • model-driven approaches : the recognition task is led by the model • Methods of structural document analysis can be classified into • syntactic approaches based on formal grammars • structural approaches based on graphs • rule based approaches • expert systems (artificial intelligence) • machine learning

Syntactic Document Recognition [Ingold89] • Fully model driven • Formal document description language • attributed grammar • translated into an analysis graph • Top down matching algorithm with backtracking • for macro-structure as well as micro-structure recognition • Very generic approach • Sensitive to noise (no error recovering) • Theoretically exponential complexity

Document Description Language [Ingold89] • Document class specific formal description composed of • composition rules (context-free grammar) • typographical rules (attributes) Act:DOC => ActNumber ActContent FootNotes Headings ; ActNumber:FRG => {Number $ Period} ; ActContent:PRT => ActTitle ActDate Otgan {Provis} Formul {Chapter} [Validity] ; ... Chapter:PRT => ChTitle ({Section} | {Article}) ; ChTitle.zone = Inherited ChTitle.alignment = (Allowed, Centered, 0pt, 0pt, Undefined) ; ChTitle.lineHeight = 11pt ; ChTitle.spaceBefore = (Allowed,[6pt, 60pt] ) ; ChTitle.interSpace = (Forbidden, [2pt, 3pt]) ; ChTitle.font = (Times, 11pt, Bold, Roman); Article.spaceBefore = <FST: (Forbidden, [6pt, 30pt]), NXT: (Allowed, [6pt, 30pt])> ; ...

Analysis graph [Ingold89] • Analysis graph for syntactic analysis where each node has two links • successor (in case of successful match) • alternative (in case of unsuccessful match)

Fuzzy document structure recognition [Hu94] • The previous approach has been adapted to be less sensitive to matching errors

Fuzzy document structure recognition [Hu94] • Pattern matching is using fuzzy logic • Parsing is expressed as a cost function to be optimized • Solution consist of finding the shortest path in a graph • solved by linear programming

Graphein : Blackboard approach [Chenevoy92]

Model of Graphein [Chenevoy92]

Complex Layout Analysis [Azolky95]

Modeling of Scientific Journals [Azokly95]

Model for a Scientific Journal <volume name="article" width="160" height="240"> <page name="first"> ... </page> <page name="even"> <hsep name="hs1" bloc="4 3 LEFT RIGHT" type="BLANK"/> <layer name="principle"> <vsep name="vs1" bloc="40 65 TOP hs1" type="BLANK"/> <vsep name="vs2" bloc="[50,60] 4 hs1 BOTTOM" type="BLANK"/> <region name="center" bloc="vs2 RIGHT hs1 BOTTOM" content="ANY"/> <region name="margin" bloc="LEFT vs2 hs1 BOTTOM" content="TEXT"/> ... </layer> <layer name="secondary"> <hsep name="hs2" bloc="[10,220] 2 LEFT RIGHT" type="BLANK"> <subst value="hs1"/> <hsep/> <hsep name="hs3" bloc="[10,220] 2 LEFT RIGHT" type="BLANK"> <subst value="BOTTOM"/> <hsep/> <region name="figure" bloc="LEFT RIGHT hs2 hs3" content="FIGS"/> > </layer> </page> ...

Use of Document Recognition Models • There is no universal approach ! • Document recognition systems must be tuned • for specific applications • for specific document classes • Contextual information is required • Models provide information like • generic document structures (DTD or XML-schema)‏ • geometrical and typographical attributes (style information)‏ • semantic information (keywords, dictionaries, databases, ...)‏ • statistical information

Content of document models • Generic structure • Document Type Definition (DTD) or XML-schema • Style information • Absolute or relative positioning • Typographical attributes & formatting rules • Semantics (if available)‏ • Linguistic information, keywords • Application specific ontology • Probabilistic information • Frequencies of items or sequences, co-occurrences

Trouble with document models • Document models are hard to produce and to maintain • implicit models (hard coded in the application)‏ • => hard to modify, adapt, extend • explicit models, written in a formal language • => cumbersome to produce, needs high expertise • abstract models, learned automatically • => needs a lot of training data (with ground-truth!)‏ • Need for more flexible tools: • assisted environments with friendly user interfaces • recognition improving with use • models are learned incrementally

patt. Pattern Based Document Understanding [Robaday 03] • Configurations consist of • Set of vertices • Labeled (type)‏ • Attributed (pos, typo, ...)‏ • Edges between vertices • Labeled (neighborhood relation)‏ • Attributed (geom, ...)‏ • Model consists of • Extraction rules • For each class • Attribute selector • List of pattern document image model rules extraction configura-tion classification selector id

Evolution of 2-CREM performance improvement of correct labeling as a function of clicks used for correcting labels manually

Conclusion • Structure recognition of documents is still an open issue • Solutions exist for specialized applications • Generic approaches are not mature • model are hard to establish • training data is missing • As alternative • interactive systems • with incremental model adaptation

Document Analysis: Structure Recognition