1 / 25

Mohamed Attia & Mohamed El-Mahallawy RDI ’s Meeting Room; Oct. 2007

بسم الله الرحمن الرحيم. Presenting Results and Training Data of Expanded Evaluation Experiment of HMM-Based Arabic Omni Font-Written OCR. Mohamed Attia & Mohamed El-Mahallawy RDI ’s Meeting Room; Oct. 2007. Overall Results.

mick
Télécharger la présentation

Mohamed Attia & Mohamed El-Mahallawy RDI ’s Meeting Room; Oct. 2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. بسم الله الرحمن الرحيم Presenting Results and Training Data of Expanded Evaluation Experiment ofHMM-Based Arabic Omni Font-Written OCR Mohamed Attia & Mohamed El-Mahallawy RDI ’s Meeting Room; Oct. 2007

  2. Overall Results The Omni quality of an OCR system is measured by its capabilities at.. Assimilation: How good at recognizing pages (whose text contents are not included in the training data) printed in fonts represented in the training data. Ultimate predefined goal: WERA around 3% Generalization: How good at recognizing pages printed in fonts not represented in the training data Ultimate predefined goal: WERG around 3·WERA

  3. Shape Size Error Analysis of Assimilation Test Regarding Font Shape/Size

  4. These are the most frequent 17 mistakes that contribute to about 63.15% of WERG Error Analysis of Assimilation Test Regarding the Most Frequent Recognition Mistakes

  5. Shape Size Error Analysis of Generalization TestRegarding Font Shape/Size Sample pages from 2 books of a typical quality have also been tried. The WERG of book#1 sample pages (1,700 words) is 11.70%, and that of book#2’s samples (1,100 words) is 7.25%.

  6. These are the most frequent 19 mistakes that contribute to about 55.90% of WERG Error Analysis of Generalization Test Regarding the Most Frequent Recognition Mistakes

  7. Training and Evaluation Data • 9 distinct fonts, with the significant writing size range of each, are used for training/building recognition models. 7 of them are MS-Windows ones and 2 fonts are Mac. ones. • At each size of each fonts; 25 different pages are used for training and other 5 different ones are used for assimilation test. This sums up to (9·6·25 = 1,350) pages ≈ 1,350·200 = 270,000 words ≈ 270,000·4 = 1,080,000 graphemes for training and (9·6·5=270) pages ≈ 54,000 words ≈ 216,000 graphemes for assimilation testing • 3 Mac. OS. fonts at their full size range are used for generalization test. • At each size of each font of these 3 fonts, 5 pages are used for generalization test. This sums up to (5·6·3=90) pages ≈ 18,000 words ≈ 72,000 graphemes for generalization testing

  8. Effect of Language Model • Our language model is neither constrained by a certain lexicon nor by a set of linguistic rules; i.e. it is an open vocabulary language model. • Our statistical language model (SLM) is an m-Gram one built using Bayes_Good-Turing_Back-Off methodology. • The unit of our SLM is the grapheme; i.e. the ligature. • The order of the deployed SLM in our system is 2. • Our SLM is built from the NEMLAR raw text corpus with size of 550,000 words (≈ 2,200,000 graphemes) distributed over the 13 most significant domains of modern and heritage standard Arabic. • Deploying/neutralizing the SLM has the following effect on the realized WER of our system:

  9. Appreciating How-Distinct are the Fonts Used for Training, and Assimilation & Generalization testing

  10. Can our OCR System StatisticallyBuild Concepts of Font Shapes?A Case Study • Some fonts which are conceptually distinct from the ones comprising the training data, are very challenging to generalization testing; i.e. WERG>>WERA • Upon our first trial to run a generalization test, the recognition models are built from the 7 MS-Windows fonts and the testing data was composed of 3 Mac. OS fonts. Under these conditions we got the poor results of WERG≈35%≈11·WERA (WERG>>WERA) • After error analysis and some contemplation, we realized that Mac. OS fonts are built with different concepts not covered by the 7 MS-Windows fonts; e.g. connected dots, overlapping of the tails of some graphemes, …, etc. • After adding 2 Mac. OS fonts to introduce those concepts in the training data, we have achieved the dramatic enhancement of WERG=10.32%≈3.4·WERA Our OCR system can statistically build font shape concepts.

  11. Current Parameters Setting and Computational Capacity Computational Capacity of the current pilot system: Runtime; Recognition phase: Some what slow but bearable. Offline; Training phase: Very slow! As per the experiment reported here.. Building the Codebook takes about 45 hours. Building the HMM ’s takes about 53 hours. - As our pilot system is built from a hybrid of off-the-shelf tools (some are voluntarily built), a professionally optimised s/w implementation of the system may save up to 50% of the training/recognition time. - Other 25% may be saved using more powerful contemporary hardware.

  12. Conclusion • Is our OCR system truly omni? Yes It can both assimilate and generalize at a remarkable WER under tough training and testing data sets. In fact, the obtained WER ’s are the best reported in the published literature in this regard. • Is there a room for further enhancement? Yes Regarding Both: - Reducing WERG by building recognition models from more distinct fonts (esp. Mac. ones), sizes, and writing styles. - Reducing the training/recognition time by a professionally optimized re-build of the core system, as well as using a more powerful hardware.

  13. Thank you for your kind attention To probe further, contact.. m_Atteya@RDI-eg.com Mahallaway@AAST.edu

  14. Simplified Arabic

  15. Mudir(MS-Windows)

  16. Koufi(MS-Windows)

  17. Traditional Arabic(MS-Windows)

  18. Akhbar(MS-windows)

  19. Tahoma(MS-Windows)

  20. Courier new (MS-Windows)

  21. Baghdad (Mac.)

  22. Demashq(Mac.)

  23. Nadeem(Mac.)

  24. Naskh(Mac.)

  25. Giza(Mac.)

More Related