1 / 32

Hindi Parts-of-Speech Tagging & Chunking

Hindi Parts-of-Speech Tagging & Chunking. Baskaran S MSRI. What's in?. Why POS tagging & chunking? Approach Challenges Unseen tag sequences Unknown words Results Future work Conclusion. Intro & Motivation. POS. Parts-of-Speech Dionysius Thrax (ca 100 BC)

garron
Télécharger la présentation

Hindi Parts-of-Speech Tagging & Chunking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI

  2. What's in? • Why POS tagging & chunking? • Approach • Challenges • Unseen tag sequences • Unknown words • Results • Future work • Conclusion NWAI

  3. Intro & Motivation NWAI

  4. POS • Parts-of-Speech • Dionysius Thrax (ca 100 BC) • 8 types – noun, verb, pronoun, preposition, adverb, conjunction, participle and article I get my thing in action. (Verb, that's what's happenin') To work, (Verb!) To play, (Verb!) To live, (Verb!) To love... (Verb!...) - Schoolhouse Rock NWAI

  5. Tagging Assigning the appropriate POS or lexical class marker to words in a given text • Symbols, punctuation markers etc. are also assigned specific tag(s) NWAI

  6. Why POS tagging? • Gives significant information about a word and its neighbours • Adjective near noun • Adverb near verb • Gives clue on how a word is pronounced • OBject as noun • obJECT as verb • Speech synthesis, full parsing of sentences, IR, word sense disambiguation etc. NWAI

  7. Chunking • Identifying simple phrases • Noun phrase, verb phrase, adjectival phrase… • Useful as a first step to Parsing • Named entity recognition NWAI

  8. POS tagging & Chunking NWAI

  9. Stochastic approaches • Availability of tagged corpora in large quantity • Most are based on HMM • Weischedel ’93 • DeRose ’88 • Skut and Brants ’98 – extending HMM to chunking • Zhou and Su ‘00 • and lots more… NWAI

  10. Annotated corpus Tag-sequence probability Word-emit probability HMM • Assumptions • Probability of a word is dependent only on its tag • Approximate the tag history to the most recent two tags NWAI

  11. Structural tags • A triple – POS tag, structural relation & chunk tag • Originally proposed by Skut & Brants ’98 • Seven relations • Enables embedded and overlapping chunks NWAI

  12. Structural relations परीक्षा में भीप्रथम श्रेणीप्राप्त कीऔरविद्यालय मेंकुलपति द्वाराविशेष पुरस्कार भीउन्हीं कोप्राप्त हुआ । SSF NP VG परीक्षा में । End SSF SSF 00 09 VG NP NP Beg परीक्षा श्रेणीप्राप्त 90 99 NWAI

  13. Decoding • Viterbi mostly used (also A* or stack) • Aims at finding the best path (tag sequence) given observation sequence • Possible tags are identified for each transition, with associated probabilities • The best path is the one that maximizes the product of these transition probabilities NWAI

  14. अबजीवन काएकअन्य रूपउनके सामनेआया । JJ NLOC NN PREP PRP QFN RB VFM SYM NWAI

  15. अबजीवन काएकअन्य रूपउनके सामनेआया । JJ NLOC NN PREP PRP QFN RB VFM SYM NWAI

  16. अबजीवन काएकअन्य रूपउनके सामनेआया । JJ NLOC NN PREP PRP QFN RB VFM SYM NWAI

  17. Issues NWAI

  18. 1. Unseen tag sequences • Smoothing (Add-One, Good-Turing) and/ or Backoff (Deleted interpolation) • Idea is to distribute some fractional probability (of seen occurrences) to unseen • Good-Turing • Re-estimates the probability mass of lower count N-grams by that of higher counts • - Number of N-grams occurring c times NWAI

  19. 2. Unseen words • Insufficient corpus (even after 10 mn words) • Not all of them are proper names • Treat them as rare words that occur once in the corpus - Baayen and Sproat ’96, Dermatas and Kokkinakis ’95 • Known Hindi corpus of 25 K words and unseen corpus of 6 K words • All words vs. Hapax vs. Unknown NWAI

  20. Tag distribution analysis NWAI

  21. 3. Features • Can we use other features? • Capitalization • Word endings and Hyphenations • Weishedel ’93 reports about 66% reduction in error rate with word endings and hyphenations • Capitalizations, though useful for proper nouns are not very effective NWAI

  22. Contd… • String length • Prefix & suffix – fixed characters width • Character encoding range • Complete analysis remains to be done • Expected to be very effective for morphologically rich languages • To be experimented with Tamil NWAI

  23. 4. Multi-part words • Examples In/ terms/ of/ United/ States/ of/ America/ • More problematic in Hindi United/NNPC States/NNPC of/NNPC America/NNP Central/NNC government/NN NNPC – Compound proper noun, NN - noun NNP – Proper noun, NNC – Compound noun • How does the system identify the last word in multi-part word? • 10% of errors is due to this in Hindi (6 K words tested) NWAI

  24. Results NWAI

  25. Evaluation metrics • Tag precision • Unseen word accuracy • % of unseen words that are correctly tagged • Estimates the goodness of unseen words • % reduction in error • Reduction in error after the application of a particular feature NWAI

  26. Results - Tagger • No structural tags  better smoothing • Unseen data – significantly more unknowns NWAI

  27. Results – Chunk tagger • Training » 22 K, development data » 8 K • 4-cross validation • Test data » 5 K NWAI

  28. Results – Tagging error analysis • Significant issues with nouns/multi-part words • NNP  NN • NNC  NN • Also, • VAUX  VFM; VFM  VAUX and • NVB  NN; NN  NVB NWAI

  29. HMM performance (English) • > 96% reported accuracies • About 85% for unknown words • Advantage • Simple and most suitable with the availability of annotated data NWAI

  30. Conclusion NWAI

  31. Future work • Handling unseen words • Smoothing • Can we exploit other features? • Especially morphological ones • Multi-part words NWAI

  32. Summary • Statistical approaches now include linguistic features for higher accuracies • Improvement required • Tagging • Precision – 79.22% • Unknown words – 41.6% • Chunking • Precision – 60% • Recall – 62% NWAI

More Related