1 / 11

Part of Speech (POS) Tagging Lab

Part of Speech (POS) Tagging Lab. CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005. Examples taken from the Bird, Klein and Loper: NLTK Tutorial, Tagging, nltk.sourceforge.net/tutorial/tagging/index.html. Simple Taggers.

carol
Télécharger la présentation

Part of Speech (POS) Tagging Lab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part of Speech (POS) Tagging Lab CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 Examples taken from the Bird, Klein and Loper: NLTK Tutorial, Tagging, nltk.sourceforge.net/tutorial/tagging/index.html CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari

  2. Simple Taggers • Three simple taggers in NLTK • Default tagger • Regular expression tagger • Unigram tagger • All start with tokenized text. >>> from nltk.tokenizer import * >>> text_token = Token(TEXT="John saw 3 polar bears .") >>> WhitespaceTokenizer().tokenize(text_token) >>> print text_token <[<John>, <saw>, <3>, <polar>, <bears>, <.>]> CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari

  3. Default Tagger • Assigns the same tag to every token. • We create an instance of the tagger and give it the desired tag. >>> from nltk.tagger import * >>> my_tagger = DefaultTagger('nn') >>> my_tagger.tag(text_token) >>> print text_token <[<John/nn>, <saw/nn>, <3/nn>, <polar/nn>, <bears/nn>, <./nn>]> • We’ve just labeled everything as a noun. • 20-30% accuracy (terrible), but useful as an adjunct to other taggers. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari

  4. Regular Expression Tagger • Takes a list of regular expressions and tags to assign when they match. • >>> NN_CD_tagger = RegexpTagger([(r'^[0-9]+(.[0-9]+)?$', 'cd'), (r'.*', 'nn')]) • >>> NN_CD_tagger.tag(text_token) • >>> print text_token <[<John/nn>, <saw/nn>, <3/cd>, <polar/nn>, <bears/nn>, <./nn>]> • This tags cardinal numbers as CD and everything else as nouns. • Still pretty poor, but may be a useful step in conjunction with other taggers. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari

  5. Unigram Tagger • Assign each word its most frequent tag • Must be trained to determine frequency. • Will assign “none” as a tag to any word not seen in the training set. • About 90% accurate. • Example training case (from Brown corpus) • The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/‘ that/cs any/dti irregularities/nns took/vbd place/nn ./. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari

  6. Train the Unigram Tagger >>> from nltk.tagger import * >>> from nltk.corpus import brown # Tokenize ten texts from the Brown Corpus >>> train_tokens = [ ] >>> for item in brown.items()[:10]: ... train_tokens.append(brown.read(item)) # Initialise and train a unigram tagger >>> mytagger = UnigramTagger(SUBTOKENS='WORDS') >>> for tok in train_tokens: mytagger.train(tok) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari

  7. And Then Tag New Text >>> text_token = Token(TEXT="John saw the book on the table") >>> WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(text_token) >>> mytagger.tag(text_token) >>> print text_token <[<John/np>, <saw/vbd>, <the/at>, <book/None>, <on/in>, <the/at>, <table/nn>]> CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari

  8. Testing a Tagger • So how well does the tagger do? • Split up the inputs into training and testing sets >>> train_tokens = [ ] >>> for item in brown.items()[:10]:# texts 0-9 ... train_tokens.append(brown.read(item)) >>> unseen_tokens = [ ] >>> for item in brown.items()[10:12]:# texts 10-11 ... unseen_tokens.append(brown.read(item)) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari

  9. Train And Test >>> for tok in train_tokens: mytagger.train(tok) >>> acc = tagger_accuracy(mytagger, unseen_tokens) >>> print 'Accuracy = %4.1f%%' % (100 * acc) Accuracy = 64.6% CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari

  10. More in NLTK • Error analysis • Higher order taggers • Bigram • Nth-order • Combining taggers • Brill tagger CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari

  11. For Lab/Homework • Complete the tagger tutorial from the NLTK tutorial page. • Tutorial exercises 1, 3, 4, 5 and 10. • 8.2 (we will compare next time) • 8.9 (using the NLTK and any higher-order tagger) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari

More Related