1 / 7

Part-of-speech Tagging

Part-of-speech Tagging. cs224n Final project Spring, 2008 Tim Lai. POS Tagging – 3 general techniques. 1. Rule based system Relies on a hand-picked set of rules Performance is not very good 2. Stochastic methods HMM with Viterbi algorithm to determine best tagging

hvogel
Télécharger la présentation

Part-of-speech Tagging

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai

  2. POS Tagging – 3 general techniques • 1. Rule based system • Relies on a hand-picked set of rules • Performance is not very good • 2. Stochastic methods • HMM with Viterbi algorithm to determine best tagging • Uses emission probabilities, i.e. P(word | tag) • and transition probabilities, i.e. P(prevTag | currentTag) • Maximum Entropy models also useful • 3. Hybrid of the two • Rules-based system to do POS tagging • Uses rule templates and learns useful rules during training

  3. Simple HMM vs Max-Ent • HMM using bigrams for transition probabilities • Max-Ent using simple features such as previous tag and current word

  4. Error Analysis • HMM and Max-Ent both perform well when tested on data from same domain • Only 6.6 % of words were ambiguous, making known words easy to tag • Accuracy drops when using test data from another domain • Most errors are caused by unknown words, or the POS tagging of words near unknown words. • In sentences without unknown words, accuracy ~ 99%! • Most common mistake is mis-tagging JJ as NN • Need to enhance both taggers to deal with unknowns.

  5. Enhancement ideas • For HMM – • Transition probabilities can be modeled using trigrams, taking more context information into account when word is unknown • For Max-Ent – • Word shapes, word features, and more context can help • Results: • HMM – Switching from Unigram to Bigram helps a lot, but using Trigram doesn’t help much. • Max-Ent – Hand picked features did not help much, but adding prefixes and suffixes were most helpful.

  6. Transformation-based tagging • One more idea to try – using rule-based templates to learn POS tagging rules • Sample rule template: • Change tag A to tag B when the [preceding | following] word is tagged Z. • Change tag A to tag B when the the tag Z appears within [N] positions of the current word. • Result • Using a very restricted set of rule templates, accuracy went up 0.5 %

  7. Final results • HMM with bigram and rule-based adjustments • Max-Ent with prefix/suffix, word shape features and rule-based adjustments • Max-Ent performs better, with 97% accuracy achievable

More Related