By Prof. Mohsen A. A. Rashwan; Cairo University, RDI & Dr. Mohamed Attia; RDI

Egyptian Ministry of Communications and Information TechnologyResearch and Development Centers of Excellence InitiativeData Mining and Computer Modeling Center of ExcellenceArabic Text Mining Project Presentation By Prof. Mohsen A. A. Rashwan; Cairo University, RDI & Dr. Mohamed Attia; RDI

Formation • EMCIT has sought to make the Centers-of-Excellence initiative in a try to establish slim, focused, responsive, and effective bodies of R&D in vital modernistic areas of advanced CIT, beyond any bureaucracy of the bulkier conventional institutions. • EMCIT has started with the Data Mining & Computer Modeling CoE, and other centers of Mobile Computing, Micro-Electronics, …, are following. • The Data Mining CoE is now up and running with 5 major projects serving; Arabic Text Mining, Basic DM Research, Tourism, e-Health, and Oil & Gas. • The staff of the Text Mining project is a selected group of - so far 27 - brightest professors, graduate researchers, and engineers specialized in Computer Science, Computational Linguistics, and Classic Linguistics. They come from both the academia and the private IT sector.

Need, Challenge, Edge, and Capability • The strategic move towards CIT as a firm basis of a modernized economy infrastructure for Egypt makes it clear why Data Mining in general and Text Mining in specific emerge as an R&D priority in Egypt. • As mountains of Arabic text documents have been accumulating over years, the knowledge contained in these treasures are badly sought as the basis of sound decision making regarding virtually all kinds of vital activities. • The novelty of the TM paradigm, along with the sophisticated Arabic language specifics which is 1600+ years aged and spoken natively by about 6% of world population, both present the non trivial challenge of developing effective Arabic Text Mining tools & applications. • In addition of the well chosen HR devoted to such as task, we think we have an edge in this area due to being native specialists in Arabic NLP with good past experiences in such projects; e.g. the Euro-Med. project of NEMLAR; www.NEMLAR.org

Arabic NLP infrastructure, Text Mining tools, and Applications

Phenomenon, Challenge, and Solution • Phenomenon: Arabic is a highly reflective and inflective language with a tremendous vocabulary generation capabilities. Billions of full-form words are possible! • Challenge: This makes all various kinds of stochastic methodologies deployed in language-independent Text Mining tools perform poorly when applied on full-form Arabic text than on other less inflective and derivative languages (e.g. English) due to a higher dimensionality and more diluted correlations. • Solution: Our approach is to replace the surface target text by effective types of Text Factorization that both reduces dimensionality and concentrates correlations of the resulting sequences over the (original) surface text. Finding and deploying effective language factorization(s) with those two features strikingly helps whatever kind of statistical machine learning methodology used for text mining applications on Arabic text (or the languages alike).

Arabic Language Factorisation • Arabic lexical factorization, Part-of-Speech tagging, and lexical semantic factorization are kinds of text factorizations of special relevance to text mining as we think. • A simple, regular, and comprehensive Arabic lexical model with a compact set of morphemes has been designed and proven to cover the lexical sophistications of Arabic language. • Arabic lexicon, lexical analyzer, and PoS tagger have been built according to this model and deployed into many application where they proved effective. • A knowledge base that maps the Arabic lexicon to (tokenized) semantic fields have been built. • Cont.

Arabic Language Factorisation • Cont.’d • The standard semantic relations (synonymy, antonymy, …, etc.) among our set of semantic fields along with the lexical semantic analyzer based on them are being perfected over the rest of the TM project life time. • In fact, that lexical → semantic knowledge base maps minimally constrained lexical compounds (not final-form words) to semantic fields which allows best chances for maximum hits ratio as well as least ambiguous lexical semantic factorization of input Arabic text. • In all the aforementioned types of Arabic text factorization, considerable ambiguity arises in different phases of analysis. Disambiguation is done through statistical methods working on stochastic supervised training models.

Thanks for your attention.To probe further.. Mohsen_Rashwan@RDI-eg.com & m_Atteya@RDI-eg.com

By Prof. Mohsen A. A. Rashwan; Cairo University, RDI & Dr. Mohamed Attia; RDI