1 / 33

Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques. Guy De Pauw Walter Daelemans guy.depauw@ua.ac.be walter.daelemans@ua.ac.be CNTS – Language Technology Group http://www.cnts.ua.ac.be. Morpho-Syntactic Analysis using Machine Learning Techniques. Why?

honora
Télécharger la présentation

Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Morpho-SyntacticAnalysis and Language Modeling using Machine Learning Techniques Guy De Pauw Walter Daelemans guy.depauw@ua.ac.be walter.daelemans@ua.ac.be CNTS – Language Technology Group http://www.cnts.ua.ac.be

  2. Morpho-Syntactic Analysis using Machine Learning Techniques • Why? • As an NLP tool proper (!) • Annotate new datasets (e.g. Mediargus) • Extra information source for language modeling • How? • Machine Learning techniques (MBL + maxent) • Shallow linguistic analysis

  3. Shallow linguistic analysis • For many NLP applications, full analysis is often not necessary • e.g. morphological analysis • uitzonderingsgevallen: FULL:((((uitzonder)[V],(ing)[N|V.])[N],(s)[N|N.N],(geval)[N])[N]),(en)[N-m] vs SHALLOW: uitzonder@V + ing@N|V. + s@N|N.N + geval@N + en@N-m • Shallow Analysis: fast + robust

  4. Shallow linguistic analysis

  5. Shallow linguistic analysis [ADVP nu] [SMAIN tref+t] [NP de niets+vermoed+end+e pool+reiziger] [NP vuilnis+belt+en] [PP tussen ] [NP de ijs+berg+en] [SVP aan] .

  6. Morphological Analysis parelvissers Segmentation parel+viss+er+s Tagging parel@N+viss@V+er@N|V.+s@INFLm Alternation parel@N+vis@V+er@N|V.+s@INFLm

  7. Morphological Analysis Parelvissers Segmentation Parel+viss+er+s Tagging parel@N+viss@V+er@N|V.+s@INFLm Alternation parel@N+vis@V+er@N|V.+s@INFLm

  8. Morphological Segmentation

  9. Morphological Segmentation • Trained and evaluated on (adapted) morphological database of CELEX • Experimental Results (full word score): • FS (minimal boundaries + unigram): 86.7% • Morpheme Boundary Prediction: 89.2% • FS + Morpheme Prediction: 94.8%

  10. Morphological Analysis Parelvissers Segmentation Parel+viss+er+s Tagging parel@N+viss@V+er@N|V.+s@INFLm Alternation parel@N+vis@V+er@N|V.+s@INFLm 96%

  11. Morphological Analysis parelvissers Segmentation parel+viss+er+s Tagging parel@N+viss@V+er@N|V.+s@INFLm Alternation parel@N+vis@V+er@N|V.+s@INFLm

  12. Alternation • Map parel+viss+er+s to parel+vis+er+s aan+lop+en to aan+loop+en but also aan+ge+bracht to aan+ge+breng

  13. Alternation • Grapheme based alternation

  14. Alternation • Grapheme based alternation

  15. Alternation • Grapheme based alternation

  16. Alternation • Grapheme based alternation • 99.4% of morphemes correctly alternated • Including complex alternations like bracht->breng

  17. Morphological Analysis • Use morphological analysis cascade to analyze all words in CGN and Mediargus (not in CELEX) e.g. F1: flowerpower-afstammelingen F2: flowerpower-@N+af@P+stamm@V+eling@N|V.+en@INFLm F3: flowerpower@N+af@P+stam@V+eling@N|V.+en@INFLm F4: m • Huge morphological database of ±2.7M words

  18. Shallow linguistic analysis

  19. Part-of-Speech Tagging • Trained and evaluated on CGN + STIL • Some Experimental Results • Contextual + orthographic features 96.6% (uw82.5%) • + morphological information 97.2% (uw86.9%) • Tags of morphemes • Lemma • Flection tag

  20. Shallow linguistic analysis • 89.5% tagging accuracy • 87.4 F-score

  21. System for morpho-syntactic analysis • Morphological analysis: ±5 w/s • Tagging + Phrase Chunking: ±450 w/s • Used to annotate entire Mediargus corpus • Morphological analysis (±2B morphemes) • Part-of-speech tags • Phrase chunks ::demo:: http://www.cnts.ua.ac.be/flavor

  22. Language Modeling • Problem1: input is not a sequence of words, but a sequence of morphemes • Problem2: scoring hypotheses using shallow linguistic annotation

  23. Language Modeling • Problem1: input is a sequence of morphemes Nu tref t de niets vermoed end e pool reiziger vuil nis belt en tussen de ijs berg en aan • Disambiguate between word and morpheme boundaries • Use morphologically analyzed mediargus as training material • Approach: morpheme sequence tagging

  24. Language Modeling

  25. Language Modeling • Problem1: input is a sequence of morphemes Nu tref t de niets vermoed end e pool reiziger vuil nis belt en tussen de ijs berg en aan [w nu ] [w tref t] [w de ] [w niets vermoed end e ] … • word boundaries: 97.2% • Morpheme boundaries: 93.1% • F-score of 92.3%

  26. Language Modeling • (Big) remaining problem: • aanlopen -> aan+lop+en or aan+loop+en • gebracht -> ge+bracht or ge+breng • But not: aan+loop+en en ge+bracht • Information not available in CELEX • But: Orthography closest guess True pronounced morphemes quite workable Decent accuracy on harder task ?? Regular expression + grapheme-to-phoneme conversion • Not yet integrated in recognizer

  27. Language Modeling • Turn morphemes into word forms (+ reverse alternation) • Re-analyze word form • Tag + shallow parse sequence of words ::demo:: www.cnts.ua.ac.be/flavor

  28. Language Modeling • Problem2: scoring hypotheses • Option1: n-gram models trained on annotated Mediargus corpus • Morpheme N-grams: de niets vermoed end <e> • Tagged-morpheme N-grams Ewb B V A|BV. <INFLPWB> • Word n-grams • Part-of-Speech tag n-grams • Shallow Parsing tag n-grams • Combination: de@LID@NP <kan@WW@NP> or <kan@N1@NP> • Interpolate LM scores

  29. Language Modeling • Problem2: scoring hypotheses • Option2: classifier “certainty” • Use maximum entropy classifiers, that can output proper probabilities • Quite informative for WSJ LM-task

  30. Language Modeling • Problem2: scoring hypotheses • Option3: Maxent classifier as LM • Information Source: surrounding context (words, morphemes, linguistic annotation) • To classify: word (or morpheme) • VERY slow training time

  31. Language Modeling: circumstantial evidence • Wall-Street Journal: n-gram rescoring • VP set: 8.11% 7.57% • NVP set: 8.08% 7.74% + maxent classifier probabilities + POS 3-grams • Mediargus: perplexity • Word 3-gram: 148.42 • Morpheme 3-gram: 56.36 • Tagged Morpheme 3-gram: 53.17

  32. Limitations • Morpheme representation problematic for integration in recognizer • Efficiency as LM not yet properly evaluated for Dutch

  33. Available Tools & Data Tools: • All-in-one morpho-syntactic analyzer for Dutch • Morphological analyzer • Part-of-Speech tagger • Phrase Chunker • Word vs Morpheme Boundary detector for Dutch • Promising outlook for Dutch N-gram LM using extra annotation layers Data: • Adjusted version of CELEX (incl segmented orthographic forms) • 2.7M word database of morphologically analyzed words • Morphologically analyzed, tagged & shallow-parsed Mediargus

More Related