1 / 28

Aniko T. Valko Keymodule Ltd.

Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents. Aniko T. Valko Keymodule Ltd. Peter Johnson Vilmos A. Valko. About CLiDE What is CLiDE for?. Summary. Performance against a benchmark set of images

zasha
Télécharger la présentation

Aniko T. Valko Keymodule Ltd.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents Aniko T. Valko Keymodule Ltd. Peter Johnson Vilmos A. Valko

  2. About CLiDE • What is CLiDE for? Summary • Performance against a benchmark set of images • About the benchmark set • Performance of CLiDE • Enhancements made in CLiDE • Comparison with selected systems • Performance against selected patents • About patents • Performance of CLiDE • Comparison with selected systems Conclusions and future work

  3. Part 1: About CLiDE

  4. CLiDEis an Optical Chemical Structure Recognition (OCSR) software application, aimed at converting structure diagrams to computer-readable structures (i.e. connection tables) What is CLiDE for? PDF, DOC, DOCX, HTML BMP, GIF, JPEG, PBM, PGM, PNG, PNM, PPM, TIFF, XBM, XPM Molfile, RGfile, SDfile CDX, CML, MRV XML

  5. Part 2: Performance against a benchmark set of images

  6. Images of isolated structures, one structure per image • US Patent Office Complex Work Unit • Available on the OSRA web site • #images: 5735 Benchmark set • Verification set: Each image is associated with a Molfile meant to describe • the correct connection table US07317070-20080108-C00008 US07314700-20080101-C00001 US07316739-20080108-C00281 US07323286-20080129-C00108 US07321045-20080122-C00150 US07314576-20080101-C00035 US07320972-20080122-C00016 US07314876-20080101-C00035 US07320974-20080122-C00070 US07314511-20080101-C00002

  7. Test environment Test runs on the benchmark set • Test run per image CLiDE was run on an image CLiDE analysed the image and generated a connection table The connection table extracted by CLiDE was compared (using canonical SMILES) to the corresponding connection table from the verification set (so called ‘ground truth’) • Performance measurements • Accuracy rate: the percentage of images that were correctly • processed by CLiDE Runtime: the total runtime measured over all the test runs

  8. Performance against benchmark set Optimization and improvements in CLiDE’s document segmentation method (see later) 44 min  20 min 57.62%  87.98% Further improvements to chemical formula parsing Better handling of aromatic rings Auto correction of atom labels Parsing chemical formulas Avoidance of loss of characters in atom labels Better handling of thick bonds

  9. Corrections in atom labels 59.30%  81.81% • Auto correction of OCR errors in atom labels • Avoidance of misinterpretation of ‘Cl’ labels as Carbons Enhancements in CLiDE

  10. Chemical formula parsing 82.75%  84.60% Two-step process: • Parsing the chemical formula into a sub connection table Enhancements in CLiDE • Generating atom coordinates for the sub connection table Problem categories: • Super atoms in chemical formulas • Super atoms in chemical formulas (―CO2Ph) • Left- and right-aligned chemical formulas • Branching in chemical formulas • Chemical formulas with multiple attachments • Left- and right-aligned chemical formulas (―CH2NH2vsNH2CH2 ―) • Branching in chemical formulas (―OC(CH3)3) • Chemical formulas with multiple attachments (―OCH2CH2O―) Future work: • Variables in chemical formulas (―CO2R, ―NHZ, ―SiR3) Super Atom Database: over 1000 super atoms, e.g. Me, Ph, Boc, TBDMS

  11. Avoidance of loss of characters from atom labels 84.60%  85.78% Enhancements in CLiDE

  12. Better handling of thick bonds (stereo indicators) 85.78%  86.55% Enhancements in CLiDE

  13. Comparison with selected systems Accuracy rate Runtime (hour:min) 87.96% 57.62% 00:44 00:20 CLiDE CLiDE

  14. #Molfiles to be corrected: 117 • Anomalies: 10 • Stereo bonds: 22 Is the benchmark set correct? • Incorrect sub connection tables for chemical formulae • (e.g. NC, H3CO2S, OCF3): 63 Verification set • Errors in atom label: 14 • Other kinds of error: 17 US07314693-20080101-C00370.TIF US07314872-20080101-C00024.TIF US07314693-20080101-C00370.MOL US07314872-20080101-C00024.MOL US07316472-20080108-C00239.TIF US07316472-20080108-C00239.MOL

  15. #images to be excluded: 16 • incorrect chemical formula: 1 • incorrect or ambiguous stereo bond: 6 • disconnected atom: 1 Is the benchmark set correct? • arrow with unknown meaning: 8 Input images US07314693-20080101-C00112.TIF US07314874-20080101-C00551.TIF USRE039991-20080101-C00187.TIF US07320974-20080122-C00022.TIF USRE039991-20080101-C00188.TIF

  16. #images: 5735 • #corrected Molfiles: 117 • #excluded images: 16 Performance after corrections CLiDE 5.5.4 87.96%  90.11% OSRA 1.4.0 68.68%  69.84% Imago 2.0 beta 61.28%  61.91%

  17. Part 3: Performance against selected patents

  18. Challenges: • Chemical structure diagrams have to be identified within the document page About patents • Interpretation of Markush structures Markush structures were excluded from our tests

  19. Underlined text Page 65 of US6410540 5.5.4 4.4.0 Challenge: Document segmentation

  20. Table Page 188 of WO2008019099 5.5.4 4.4.0 Challenge: Document segmentation Performance measurements • Accuracy rate • Runtime • #Garbage structures: • The number of structures that were assigned to • non-chemical structure diagrams

  21. Performance of CLiDE US6410540 WO2008019099

  22. Comparison with selected systems Comparison with OSRA US6410540 WO2008019099

  23. Part 4: Conclusions and future work

  24. There has been considerable progress in OCSR, but nevertheless therestill remain many problems to be solved • The test sets showcased the diversity and the frequency of the problem types Conclusions • Regarding performance: • CLiDEhas greatly improved during the last few years • CLiDE compares well with the other OCSR systems available to us for testing • In favourable cases, OCSR as exemplified by CLiDE now approaches OCR in accuracy (90%)

  25. Short-term goals: • Further improvements to structure recognition • Further improvements to document segmentation Future work • Identification and exclusion of non-chemical structure diagrams • Filtering out garbage structures Long-term goals: • Contextual document analysis, aimed at linking structures to text data

  26. Flavours of CLiDE CLiDEis released in three variants, designed for individual user needs Designed for the individual chemist who wishes to convert selected images into editable structures for use in reports etc. CLiDE Standard GUI enterprise version to process whole documents with interactive editing CLiDE Professional Unsupervised extraction for database creation etc. CLiDE Batch

  27. Further information www.keymodule.co.uk info@keymodule.co.uk

  28. Peter JohnsonKeymodule Ltd. and University of Leeds Acknowledgment Vilmos A. ValkoKeymodule Ltd. Anthony P. Cook University of Leeds • Reseller agents • SimBioSys Inc. (North America) • NeoTrident Technology Ltd. (China) • Hulinks Inc. (Japan) All users who gave us constructive feedback

More Related