1 / 21

EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS

EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS. Karine Megerdoomian University of Maryland, College Park karinem@umiacs.umd.edu. دانشگاه تهران. دومین کارگاه پژوهشی زبان فارسی و رایانه. Talk Outline. Persian Weblogs

lucille
Télécharger la présentation

EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS Karine Megerdoomian University of Maryland, College Park karinem@umiacs.umd.edu دانشگاه تهران دومین کارگاه پژوهشی زبان فارسی و رایانه

  2. Talk Outline • Persian Weblogs • Persian is the 4th largest blog language in the world (~75,000 sites) • Description of a finite-state morphological analyzer for Persian • System description • Language issues and implementation • Computational issues in weblogs

  3. Language of Blogs • Contain both formal and informal morphology • Morphology • Informal text is very different from formal مرا گرفته است گرفتهتم • Features that don’t exist in formal فروشندهه؛ رفتش • Shortened verbal stems and inflection می گویند میگن

  4. Language of Blogs • Morphology • Colloquial pronunciation غلطای املایی ؛ این سایتو ؛ دوستامونم ؛ دردناکه ؛ مثل منن ازشون ؛ خودتون ؛ نگاههایشان ؛ همسایهاشون • Spelling errors and non-standard punctuation & spacing • Emoticons  and hyperlinks

  5. Language of Blogs • Lexicon • Wordforms follow pronunciation اوضاش ؛ برام ؛ نگامی کنم ؛ خونه ؛ تمبل ؛ همدیگه ؛ بش گفتم • Colloquial forms تو دانشگاه ؛ واسه استادام • New words لینکدونی ؛ دوستان کامنتگذار

  6. Language of Blogs • Lexicon • Loan words چتروم ؛ آنلاین ؛ دانلود کنین • Interjections آاااخ! ؛ والا ؛ وای ؛ اوووه! • More idiomatic expressions دمشگرم آقا

  7. Language of Blogs • Huge amount of variation!! • Need for flexible rules • Phonological rules to represent colloquial speech • Need to disambiguate(statistical component?) • Formal blog text is also different from traditional formal text

  8. Language of Blogs BBCخوابگرد موافقاند موافقند بینندهگان بینندگان کتاباش کتابش کمتر کمتر کافیست کافیست حتا حتی

  9. Finite-State Transducers (FST) • Two-level network or transducer • Input = lower-side of arc • Output = upper-side of arc b i r d +Noun +Pl b i r d s

  10. MA: System Description • Developed on Xerox Finite State Technology (XFST) [Karttunen & Beesley 1992] • Components: • Lexicon and morphology rules (lexc) • Phonological rules (regular expressions) • Compiled into a FST (finite-state transducer) • FST for each part of speech created separately then composed  final FST for morphological analysis

  11. MA: System Description Input string Phonology rules Noun FST Verb FST Final FST For Morphology  COMPOSITION Adverb FST Output string

  12. MA: System Description • Coverage: formal Persian language • Full verbal conjugation • Nonverbal inflection مسافرین ؛ فقرا • Productive derivational morphology سرسامآور • ~20 phonological rules • Proper nouns of people, places, organizations

  13. Inflectional Morphology LEXICON Root ktab Noun ; LEXICON Noun +Pl:ha # ; کتابها +Pl:_ha # ; کتابها +Sg:0 # ; کتاب +Pl:a # ; کتابا

  14. Complex Tokens • Two different POS categories بعقیدهشما ؛ اینکار؛ بهترست - دردفتر ؛ وگفت bh+Prep<eqydh+Noun+Sgبعقیده dr+Prep<dftr+Noun+Sgدردفتر ktab+Noun+Pl>av+Pron+Pers+Poss+1P+Plکتابهایمان برادرشهbradr+Noun+Sg>av+Pron+Pers+Poss+1P+Pl >bvdn+Verb+Ind+Pres+3P+Sg

  15. Verbal Morphology • Two different stems

  16. LEXICON PastStem tvanst Infl1 ; rft Infl1 ; xndyd Infl1 ; LEXICON PresentStem tvanst:tvan Infl2 ; rft:rv Infl2; xndyd:xnd Infl2; LEXICON PstStemBlog tvnst InflBlog1; LEXICON PrStemBlog tvanst:tvn Infl2 ; rft:r Infl2; Verbal Morphology

  17. Long Distance Dependencies • Some tenses of the verb can only be determined if we take into account the co-occurrence of the prefix and the person inflection / auxiliary problem for linear approaches

  18. Long Distance Dependencies • Leads to very complex paths and continuation classes in lexc • Using filters largely increases the size of the FST • Use flag diacriticsfor unification (@U.Feature.Value@) - Keeps FST small- Can apply constraints between non-adjacent morphemes

  19. Optional in informal blog text Phonology Rules • Form of affixes may change based on the ending character of the stem Formal: کتابش ؛ چشمهایش/صدایش ؛ همسایهاش Informal: کتابش ؛ چشماش/صداش ؛ همسایش define clitic1 [^NB  0 || Cons __ ] ; define clitic2 [^NB  y || Vowel __ ] ; define clitic3 [^NB  “\u200c” a || e __ ] ; ktab^NBš Sda^NBš hmsaye^NBš

  20. Evaluation • FST: 178,452 states; 928,982 arcs before optimization • Speed: 20.84 CPU time in seconds for 10 MB file, on SunSparcStation • Coverage=97.5%; Accuracy=95% • Unanalyzed tokens: proper nouns + missing lexicon words • No weblog language rules included yet!

  21. Conclusion • Challenges in morphological analysis of Persian formal text  Solutions in XFST system • New issues and variance due to blog language • Need robust system: Lexicon updated with colloquial forms Flexible morphological rules + derivational morphology rules Transliteration component for loan words Statistical approach to disambiguate and to deal with unknowns

More Related