1 / 28

Document Analysis: Structure Recognition

Document Analysis: Structure Recognition. Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008. Outline. Objectives Examples of applications Physical and logical structures Methodologies for structure recognition Microstructures vs. macrostructures Role of models

helia
Télécharger la présentation

Document Analysis: Structure Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Analysis:Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

  2. Outline • Objectives • Examples of applications • Physical and logical structures • Methodologies for structure recognition • Microstructures vs. macrostructures • Role of models • Interactive Systems

  3. Importance of document structures • Document = Content + Structures • Structures convey abstract high level information • Structuresare revealed by styles

  4. Applications of document structure recognition • Information extraction • form analysis (check readers, ...) • business applications : mail distribution, invoice processing, ... • analysis of museum & library notices • analysis of bibliographical references • Document mining, content analysis • business reports • legal documents • scientific publications • Intelligent indexing • laws • magazine & newspaper • Document restyling • teaching material • ...

  5. Preprocessing Image Layout analysis Segmentation OFR Blocs Fonts OCR Struct. Document Logical labeling Postanalysis Simple Text Extended Processing Chain

  6. document region region region block hr block block hr block frm block Physical document structures • Reveal the publisher's view • Composed of a hierarchy of physical entities • text blocs, text lines and tokens • graphical primitives • Universal, i.e. independent of the document class

  7. Illustration of physical document structure from A. Belaïd

  8. document article article title hdln p p p p p p title p p p author link author link Logical structures • Reflect the author’s mind • Independent of presentation • can be mapped on various physical structures • Composed of application dependent logical entities • Specific to the application and document class

  9. Illustration of logical document structure

  10. Relation between logical and physical structures • There is no 1:1 relation between physical and logical structure • There are some correspondences between as shown below

  11. edit display formatting print Logical Structure Physical Structure Stylesheet analysis Role of style sheets • Document formatting is straightforward ... • But document analysis is a non trivial task that generally can not be fully automated

  12. Methodologies • Document structural analysis can be • data-driven : the recognition task is based on image analysis • model-driven approaches : the recognition task is led by the model • Methods of structural document analysis can be classified into • syntactic approaches based on formal grammars • structural approaches based on graphs • rule based approaches • expert systems (artificial intelligence) • machine learning

  13. Syntactic Document Recognition [Ingold89] • Fully model driven • Formal document description language • attributed grammar • translated into an analysis graph • Top down matching algorithm with backtracking • for macro-structure as well as micro-structure recognition • Very generic approach • Sensitive to noise (no error recovering) • Theoretically exponential complexity

  14. Document Description Language [Ingold89] • Document class specific formal description composed of • composition rules (context-free grammar) • typographical rules (attributes) Act:DOC => ActNumber ActContent FootNotes Headings ; ActNumber:FRG => {Number $ Period} ; ActContent:PRT => ActTitle ActDate Otgan {Provis} Formul {Chapter} [Validity] ; ... Chapter:PRT => ChTitle ({Section} | {Article}) ; ChTitle.zone = Inherited ChTitle.alignment = (Allowed, Centered, 0pt, 0pt, Undefined) ; ChTitle.lineHeight = 11pt ; ChTitle.spaceBefore = (Allowed,[6pt, 60pt] ) ; ChTitle.interSpace = (Forbidden, [2pt, 3pt]) ; ChTitle.font = (Times, 11pt, Bold, Roman); Article.spaceBefore = <FST: (Forbidden, [6pt, 30pt]), NXT: (Allowed, [6pt, 30pt])> ; ...

  15. Analysis graph [Ingold89] • Analysis graph for syntactic analysis where each node has two links • successor (in case of successful match) • alternative (in case of unsuccessful match)

  16. Fuzzy document structure recognition [Hu94] • The previous approach has been adapted to be less sensitive to matching errors

  17. Fuzzy document structure recognition [Hu94] • Pattern matching is using fuzzy logic • Parsing is expressed as a cost function to be optimized • Solution consist of finding the shortest path in a graph • solved by linear programming

  18. Graphein : Blackboard approach [Chenevoy92]

  19. Model of Graphein [Chenevoy92]

  20. Complex Layout Analysis [Azolky95]

  21. Modeling of Scientific Journals [Azokly95]

  22. Model for a Scientific Journal <volume name="article" width="160" height="240"> <page name="first"> ... </page> <page name="even"> <hsep name="hs1" bloc="4 3 LEFT RIGHT" type="BLANK"/> <layer name="principle"> <vsep name="vs1" bloc="40 65 TOP hs1" type="BLANK"/> <vsep name="vs2" bloc="[50,60] 4 hs1 BOTTOM" type="BLANK"/> <region name="center" bloc="vs2 RIGHT hs1 BOTTOM" content="ANY"/> <region name="margin" bloc="LEFT vs2 hs1 BOTTOM" content="TEXT"/> ... </layer> <layer name="secondary"> <hsep name="hs2" bloc="[10,220] 2 LEFT RIGHT" type="BLANK"> <subst value="hs1"/> <hsep/> <hsep name="hs3" bloc="[10,220] 2 LEFT RIGHT" type="BLANK"> <subst value="BOTTOM"/> <hsep/> <region name="figure" bloc="LEFT RIGHT hs2 hs3" content="FIGS"/> > </layer> </page> ...

  23. Use of Document Recognition Models • There is no universal approach ! • Document recognition systems must be tuned • for specific applications • for specific document classes • Contextual information is required • Models provide information like • generic document structures (DTD or XML-schema)‏ • geometrical and typographical attributes (style information)‏ • semantic information (keywords, dictionaries, databases, ...)‏ • statistical information

  24. Content of document models • Generic structure • Document Type Definition (DTD) or XML-schema • Style information • Absolute or relative positioning • Typographical attributes & formatting rules • Semantics (if available)‏ • Linguistic information, keywords • Application specific ontology • Probabilistic information • Frequencies of items or sequences, co-occurrences

  25. Trouble with document models • Document models are hard to produce and to maintain • implicit models (hard coded in the application)‏ • => hard to modify, adapt, extend • explicit models, written in a formal language • => cumbersome to produce, needs high expertise • abstract models, learned automatically • => needs a lot of training data (with ground-truth!)‏ • Need for more flexible tools: • assisted environments with friendly user interfaces • recognition improving with use • models are learned incrementally

  26. patt. Pattern Based Document Understanding [Robaday 03] • Configurations consist of • Set of vertices • Labeled (type)‏ • Attributed (pos, typo, ...)‏ • Edges between vertices • Labeled (neighborhood relation)‏ • Attributed (geom, ...)‏ • Model consists of • Extraction rules • For each class • Attribute selector • List of pattern document image model rules extraction configura-tion classification selector id

  27. Evolution of 2-CREM performance improvement of correct labeling as a function of clicks used for correcting labels manually

  28. Conclusion • Structure recognition of documents is still an open issue • Solutions exist for specialized applications • Generic approaches are not mature • model are hard to establish • training data is missing • As alternative • interactive systems • with incremental model adaptation

More Related