250 likes | 446 Vues
Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd. Chemical structure Diagrams. Chemical structure diagrams are a form of representation of chemical compounds. Information contained in a structure diagram can be divided into three areas:. Atom information.
 
                
                E N D
Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.
Chemical structure Diagrams • Chemical structure diagrams are a form of representation of chemical compounds. • Information contained in a structure diagram can be divided into three areas: • Atom information • Bond information chemical elements, functional groups, generic elements, • Structural information bond orders, bond styles, bond labels vertex label, charge, atomic weight, hybridization, etc. atom information, bond information, overall charge, structure label
Publication process Manual reproduction chemical OCR What is chemical OCR for? All chemical information is lost! chemical structure diagrams are converted to images 29 31 0 0 0 0 0 0 0 0999 V2000 -1.9417 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.3542 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.9417 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1167 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.7042 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1167 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.1792 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.5917 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.5917 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -4.0042 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.1208 1.6794 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 0.7042 1.0961 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -0.0927 2.4763 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.7042 2.2628 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1.5292 1.0961 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.9417 0.3816 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 automatic extraction of chemical information from chemical structure depictions 20-90 seconds per page slow and prone to errors
CLiDE Pro A chemical OCR software tool The latest incarnation of software to emerge from the long-term CLiDE (Chemical Literature Data Extraction) project [1-3]. [1] P. Ibison, M. Jacguot, F. Kam, A. Neville, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Literature Data Extraction: The CLiDE Project. J. Chem. Inf. Comput. Sci.1993, 33(3), 338-344. [2] P. Ibison, F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Structure Recognition and Generic Text in the CLiDE Project. In Proceedings on Online Information 92. 1992, London, England. [3] A. Simon and A.P. Johnson. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents. J. Chem. Inf. Comput. Sci. 1997, 37(1), 109-116.
Features • Converts chemical images into connection tables • Loads PDF documents, as well as TIFF and BMP image files • Exports chemical information into MDL MOL files • Supports document-oriented processing as opposed to page-oriented processing • The whole document is loaded and processed at once rather than individual pages. • Handles various difficult drawing features • Interprets generic structures • Operates in interactive or batch mode • Tools for structure and text editing
Three main problems involved in chemical OCR Identification of chemical images within a document. Compilation of chemical graphs of individual molecules from chemical images. • Interpretation of complex objects such as generic structures using • the retrieved chemical graphs.
Document image segmentation CLiDE Pro’s solutions to Problem 1 Identification of connected components Digitized image of a document page of a patent Segmented document highlighting recognized text blocks and graphic blocks Bottom-up layout analysis by building the tree structure of the page Problem 1: Identification of chemical images within a document
1 4 Chemical image Vectorization 2 Classification of connected components 5 6 3 Construction of connection table Construction of atom labels Construction of dashed bonds CLiDE Pro’s solutions to Problem 2 2 Classification of connected components into basic groups: characters lines dashes graphics Construction of dashed bonds based on the Hough transform method[4] 3 1 A chemical image Construction of atom labels: OCR Grouping characters into atom labels Recognition of superatoms 6 5 4 3D molecular structure after exporting the constructed CT into SDF file in 2D and converting the structure from 2D to 3D Construction of connection table: Connecting lines to atoms Joining lines to form implicit Carbon atoms Vectorization based on a polygon approximation method [5] Problem 2: Extraction of connection tables from chemical images [4] R.O. Duda and P.E. Hart. Use of the Hough Transform to Detect Lines and Curves in Pictures. Graphics Image Process. 1972, 1. [5] J. Sklansky and V. Gonzalez. Fast Polygonal Approximation of Digitized Curves. Pattern Recognit. 1980, 12, 327-331.
CLiDE Pro’s solutions to Problem3 1 Generic text interpretation (GTI) R-groups, substitution values, labels • Currently, GTI is limited to the presence of ‘=‘ sign separating the R-groups and the substituents. 2 Association the generic text block to the structure by matching R-groups present in both the text and the structure • However, combined assignment to R-groups are handled successfully. Problem 3: Interpretation of generic structures
Alignment of Atom Labels Two types of alignment of atom labels with more than one character: Horizontal atom labels Vertical atom labels Examples
Constructed molecule Input image Alignment of Atom labels The interpreted structure in CLiDE Pro’s GUI:
Ambiguity in interpretation Horizontal lines representing dashes of a dashed wedged bond A horizontal line representing a negative charge Contextual analysis
Constructed molecule Ambiguity in interpretation The interpreted structure in CLiDE Pro’s GUI: Input image
Ambiguity in interpretation A vertical line part of a double bond Vertical lines representing Iodine atoms Contextual analysis
Ambiguity in interpretation The interpreted structure in CLiDE Pro’s GUI: Input image Constructed molecule
Ambiguity in interpretation Circles represent: Oxygen atoms aromatic rings Contextual analysis
Constructed molecule Ambiguity in interpretation Input image
Constructed molecule Input image Crossing bonds in bridged molecule • No extra Carbon atom is generated at the point where bonds cross each other • Functional groups are expanded in the exported structure
Constructed molecule Input image A generic structure R = H R = Me
Constructed molecule Input image Bad image quality • Isolated black spots (noise from scanning) • Black spots touching one CC • Black spots merging two or more CCs
Constructed molecule Input image Bad image quality
Conclusions and Outlook • CLiDE Pro, a chemical OCR tool • 3 main problems in chemical OCR and CLiDE Pro’s solutions • The quality of interpretation depends on the ability of dealing with difficult situations such as -ambiguous drawing features -distortions resulting from bad image quality • Goal to extend CLiDE Pro on further chemical drawing features such as -Reaction schemes (partly implemented) -Improved generic text interpretation (dealing with tables of R-groups) -Frequency variation in Markush structures -Positional variation in Markush structures -Other difficult situations (e.g. missing bonds between ring atoms)
Palytoxin – A complex structure Input image Constructed molecule
Further Information Acknowledgments CLiDE Pro is licensed withKeymodule Ltd.andSimBioSys Inc. http://www.keymodule.co.uk http://www.simbiosys.ca People who previously worked on CLiDE Live demo at Booth #817