200 likes | 217 Vues
Lemmatization Tagging. LELA 30922. Lemmatization. Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in a text Related to morphological processing Lemmatization merely identifies lemma
E N D
LemmatizationTagging LELA 30922
Lemmatization • Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in a text • Related to morphological processing • Lemmatization merely identifies lemma • Morphological processing would (also) try to interpret the inflection etc. • eg running (lemma = run) (analysis: lex=run, form=prespart)
Lemmatization – how to? • Simplest solution would be to have a list of all possible word forms and associated lemma information • Somewhat inefficient (for English) and actually impossible for some other languages • And not necessary, since there are many regularly formed inflections in English • Of course, list of irregularities needed as well
Lemmatization – how to? • Computational morphology quite well established now: various methods • Brute force: try every possible segmentation of word and see which ones match known stems and affixes • Rule-based (simplistic method): Have list of known affixes, see which ones apply • Rule-based (more sophisticated): List of known affixes, and knowledge about allowable combinations, eg -ing can only attach to a verb stem
Lemmatization – how to? • Problem well studied and understood, though that’s not to say it’s trivial • Morphological processes can be quite complex, cf running, falling, hopping, hoping, healing, … • Need to deal with derivation as well as inflection • Not just suffixes, other types of morphological process (prefix, ablaut, etc.) • Plenty of ambiguities • ambiguous morphemes, eg fitter, books • ambiguity between single morph and inflected form, eg flower
POS Tagging • POS = part of speech • Familiar (?) from school and/or language learning (noun, verb, adjective, etc.) • POS tagsets usually identifier more fine-grained distinctions, eg proper noun, common noun, plural noun, etc • In fact POS tagsets often have ~60 different categories, even as many as 400!
POS Tagging • Assigning POS tags to individual words involves a degree of analysis • of the word form itself (cf lemmatization) • of the word in context • Individual words are often ambiguous (particularly for English, where huge percentage of words are at least 2-ways ambiguous) • Disambiguation often depends on context
What is a tagger? • Lack of distinction between … • Software which allows you to create something you can then use to tag input text, e.g. “Brill’s tagger” • The result of running such software, e.g. a tagger for English (based on the such-and-such corpus) • Taggers (even rule-based ones) are almost invariably trained on a given corpus • “Tagging” usually understood to mean “POS tagging”, but you can have other types of tags (eg semantic tags)
Simple taggers • Default tagger has one tag per word, and assigns it on the basis of dictionary lookup • Tags may indicate ambiguity but not resolve it, e.g. NVB for noun-or-verb • Words may be assigned different tags with associated probabilities • Tagger will assign most probable tag unless • there is some way to identify when a less probable tag is in fact correct • Tag sequences may be defined, and assigned probabilities (including 0 for illegal sequences – negative rules)
Rule-based taggers • Earliest type of tagging: two stages • Stage 1: look up word in lexicon to give list of potential tags • Stage 2: Apply rules which certify or disallow tag sequences • Rules originally handwritten; more recently Machine Learning methods can be used • “Transformation-based tagging” most common example
Transformation-based tagging • Eric Brill (1993) • Start from an initial tagging, and apply a series of transformations • Transformations are learned as well, from the training data • Captures the tagging data in much fewer parameters than statistical models • The transformations learned (often) have linguistic “reality”
Transformation-based tagging • Three stages: • Lexical look-up • Lexical rule application for unknown words • Contextual rule application to correct mis-tags
Transformation-based learning • Change tag a to b when: • Internal evidence (morphology) • Contextual evidence • One or more of the preceding/following words has a specific tag • One or more of the preceding/following words is a specific word • One or more of the preceding/following words has a certain form • Order of rules is important • Rules can change a correct tag into an incorrect tag, so another rule might correct that “mistake”
Stochastic taggers • Nowadays, pretty much all taggers are statistics-based and have been since 1980s (or even earlier ... Some primitive algorithms were already published in 60s and 70s)
How do they work? • Tagger must be “trained” • Many different techniques, but typically … • Small “training corpus” hand-tagged • Tagging rules learned automatically • Rules define most likely sequence of tags • Rules based on • Internal evidence (morphology) • External evidence (context) • Probabilities
What probabilities do we have to learn? • Individual word probabilities • P that a given tag is appropriate for a given word • Learned from corpus evidence • Problem of “sparse data” • Tag sequence probabilities • P that a given sequence of tags is appropriate • Again, learned from corpus evidence
Individual word probability • Simple calculation • suppose the word run occurs 4800 times in the training corpus: • 3600 times as a verb • 1200 times as a noun • P(verb|run) = 0.75 • P(noun|run) = 0.25
“Sparse data” • What if there is no evidence for a particular combination? • Could mean it is impossible, or just that it doesn’t happen to occur • Calculations involving and don’t like 0s • “Smoothing”: add a tiny amount to all values, so there are no zeros • Probabilities are reduced, but not 0.
Tag sequence probability • Probability that a given tag sequence is appropriate for a givenword sequence • Much too hard to calculate probabilities for all possible sequences • Subsequences are more practical • Turns out that good accuracy gained just looking at sequences of 2 or 3 tags (bigrams, trigrams)
Tagging – final word • Tagging now quite well understood technology • Accuracy typically >97% • Hard to imagine how to get improvements of even as much as 1% • Many taggers now available for download • Sometimes not clear whether “tagger” means • Software enabling you to build a tagger given a corpus • An already built tagger for a given language • Because a given tagger (2nd sense) will have been trained on some corpus, it will be biased towards that (kind of) corpus • Question of goodness of match between original training corpus and material you want to use the tagger on