1 / 19

KING FAHD UNIVERSITY OF PETROLEUM AND MINERALS

KING FAHD UNIVERSITY OF PETROLEUM AND MINERALS. An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih 214935 Natural Language Processing ICS 482-062. OUTLINE. INTRODUCTION. Sources of Legal Morphological Ambiguity in Arabic.

zaviera
Télécharger la présentation

KING FAHD UNIVERSITY OF PETROLEUM AND MINERALS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KING FAHD UNIVERSITY OF PETROLEUM AND MINERALS An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih 214935 Natural Language Processing ICS 482-062

  2. OUTLINE • INTRODUCTION. • Sources of Legal Morphological Ambiguity in Arabic. • Development strategies of Arabic Morphology. • Existing Arabic Morphological Systems. • SYSTEM DESCRIPTION. • Finite State Technology. • Techniques followed in Limiting Ambiguities. • Disadvantages. • Evaluation. • Conclusion.

  3. Glossary • Diacritics: Lack of short vowels. • MWE: Multiword Expression. • Relaxation Rules: Combining words with clitics. • Lexc Language: Kind of grammar in standard finite state system.

  4. INTRODUCTION • Morphological Ambiguity in Arabic. • Two Morphological Analyzers: Xerox & Buckwalter. • Problems: classical entries.

  5. Sources of legal Morphological Ambiguity in Arabic 1. Orthographic alternation operations: 2. Some lemmas have doubled sound: 3. Change in pronunciation without explicit orthographical effect due to diacritics:

  6. Sources of legal Morphological Ambiguity in Arabic 4. Some prefixes and suffixes can be morphological to each other: 5. Coincidental identity: 6. Clitics:

  7. Sources of legal Morphological Ambiguity in Arabic 7. Usual homographs of inflected words with/out same pronunciation, but different meaning:

  8. Development strategies for Arabic Morphology Two main strategies depending on level of analyzers: • Stem-based morphologies: analyzing Arabic at the stem level using regular concatenation. • Root-based morphologies: analyzing Arabic words as composed of roots, patterns and concatenations. • Which is better?

  9. Existing Arabic Morphological Systems. Morphological Analyzers for Arabic: • Xerox Arabic Morphological analysis and Generation. • Buckwalter Arabic Morphological Analyzer. • Diinar. • Sakhr. • Morfix.

  10. Existing Arabic Morphological Systems. 1. Buckwalter Arabic Morphological Analyzer: • Advantages: a. Reconstruction of vowel marks & English glossary. b. Less ambiguous than Xerox Analyzer. • Disadvantages: a. All word forms are entered manually. b. System is not suited for generation. c. Underspecification in imperative forms. d. Underspecification in the passive morphology. e. No handling of MultiWord Expressions (MWE).

  11. Existing Arabic Morphological Systems. 2. Xerox Arabic Morphological analysis and Generation: • Adopts the root & pattern approach. • Includes 4930 roots & 400 patterns, generating 90,000 stems. • Advantages: a. Reconstructions vowel marks & English glossary. b. It is rule-based with large coverage.

  12. Existing Arabic Morphological Systems. • Disadvantages: a. Lack of specifications for MWEs & improper spelling relaxation rules. b. Overgeneration on word derivation. c. Underspecification in POS classification. d. Increased rate of ambiguity.

  13. SYSTEM DESCRIPTION. • Rule-based. • It is built using finite state technology. • Suitable for both analysis and generation. • Contains 9741 lemmas & 2826 MWEs. • Efficiently handle compound names:

  14. SYSTEM DESCRIPTION. • Finite State Technology: • Used successfully in developing morphologies for many languages. • Lexical entries –with all possible affixes & clitics- are encoded in the lexc language. • We obtain a transducer with a binary relation between two sets of strings: lower language(surface forms), upper language(lexical forms):

  15. SYSTEM DESCRIPTION. • Techniques Followed in Limiting Ambiguities: • Using the stem as the base form. • Excluding classical words. • Rules of combination of words. • Specifying which verbs can have passive forms. • Specifying which verbs can have imperative forms.

  16. SYSTEM DESCRIPTION. • System Disadvantages: • Limited coverage. • Not handling diacritics texts. • No reconstruction of diacritics. • No English glossary.

  17. Evaluation • Ambiguity rate.

  18. Conclusion • Arabic & Ambiguity. • Classical entries. • Non-used stems. • Word clitic combination rules.

  19. Any Questions!

More Related