640 likes | 799 Vues
Ling 570. Day # 3 Stemming, Probabilistic Automata, Markov Chains/Model. Morphology and FSTs. Last Class. FST as Translator. FR: ce bill met de le baume sur une blessure EN: this bill puts balm on a sore wound. FST Application Examples. Case folding: He said he said
E N D
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model
Last Class FST as Translator FR: ce bill met de le baume sur une blessure EN: this bill putsbalm on a sore wound
FST Application Examples • Case folding: • He said he said • Tokenization: • “He ran.” “ He ran . “ • POS tagging: • They can fish PRO VERB NOUN
FST Application Examples • Pronunciation: • B AH T EH R B AH DX EH R • Morphological generation: • Fox s Foxes • Morphological analysis: • cats cat s
Roadmap • Motivation: • Representing words • A little (mostly English) Morphology • Stemming
The Lexicon • Goal: Represent all the words in a language • Approach?
The Lexicon • Goal: Represent all the words in a language • Approach? • Enumerate all words?
The Lexicon • Goal: Represent all the words in a language • Approach? • Enumerate all words? • Doable for English • Typical for ASR (Automatic Speech Recognition) • English is morphologically relatively impoverished
The Lexicon • Goal: Represent all the words in a language • Approach? • Enumerate all words? • Doable for English • Typical for ASR (Automatic Speech Recognition) • English is morphologically relatively impoverished • Other languages?
The Lexicon • Goal: Represent all the words in a language • Approach? • Enumerate all words? • Doable for English • Typical for ASR (Automatic Speech Recognition) • English is morphologically relatively impoverished • Other languages? • Wildly impractical • Turkish: 40,000 forms/verb; uygarlas¸tıramadıklarımızdanmıs¸sınızcasına “(behaving) as if you are among those whom we could not civilize”
Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes
Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language.
Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language. • Stem: the morpheme that forms the central meaning unit in a word • Affix: prefix, suffix, infix, circumfix
Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language. • Stem: the morpheme that forms the central meaning unit in a word • Affix: prefix, suffix, infix, circumfix • Prefix: e.g., possible impossible
Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language. • Stem: the morpheme that forms the central meaning unit in a word • Affix: prefix, suffix, infix, circumfix • Prefix: e.g., possible impossible • Suffix: e.g., walk walking
Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language. • Stem: the morpheme that forms the central meaning unit in a word • Affix: prefix, suffix, infix, circumfix • Prefix: e.g., possible impossible • Suffix: e.g., walk walking • Infix: e.g., hingihumingi (Tagalog)
Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language. • Stem: the morpheme that forms the central meaning unit in a word • Affix: prefix, suffix, infix, circumfix • Prefix: e.g., possible impossible • Suffix: e.g., walk walking • Infix: e.g., hingihumingi (Tagalog) • Circumfix: e.g., sagengesagt (German)
Surface Variation & Morphology • Searching (a la Bing) for documents about: • Televised sports
Surface Variation & Morphology • Searching (a la Bing) for documents about: • Televised sports • Many possible surface forms: • Televised, television, televise,.. • Sports, sport, sporting,…
Surface Variation & Morphology • Searching (a la Bing) for documents about: • Televised sports • Many possible surface forms: • Televised, television, televise,.. • Sports, sport, sporting,… • How can we match?
Surface Variation & Morphology • Searching (a la Bing) for documents about: • Televised sports • Many possible surface forms: • Televised, television, televise,.. • Sports, sport, sporting,… • How can we match? • Convert surface forms to common base form • Stemming or morphological analysis
Two Perspectives • Stemming: • writing
Two Perspectives • Stemming: • writing write (or writ) • Beijing
Two Perspectives • Stemming: • writing write (or writ) • Beijing Beije • Morphological Analysis:
Two Perspectives • Stemming: • writing write (or writ) • Beijing Beije • Morphological Analysis: • writing write+V+prog
Two Perspectives • Stemming: • writing write (or writ) • Beijing Beije • Morphological Analysis: • writing write+V+prog • cats cat + N + pl • writes write+V+3rdpers+Sg
Stemming • Simple type of morphological analysis • Supports matching using base form • e.g. Television, televised, televising televise
Stemming • Simple type of morphological analysis • Supports matching using base form • e.g. Television, televised, televising televise • Most popular: Porter stemmer
Stemming • Simple type of morphological analysis • Supports matching using base form • e.g. Television, televised, televising televise • Most popular: Porter stemmer • Task: Given surface form, produce base form • Typically, removes suffixes
Stemming • Simple type of morphological analysis • Supports matching using base form • e.g. Television, televised, televising televise • Most popular: Porter stemmer • Task: Given surface form, produce base form • Typically, removes suffixes • Model: • Rule cascade • No lexicon!
Stemming • Used in many NLP/IR applications • For building equivalence classes Connect Connected Connecting Connection Connections Porter Stemmer, simple and efficient Website: http://www.tartarus.org/~martin/PorterStemmer On patas: ~/dropbox/12-13/570/porter Same class; suffixes irrelevant
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL ATE
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes • Step 5: cleanup • Pros:
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes • Step 5: cleanup • Pros: Simple, fast, buildable for a variety of languages • Cons:
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes • Step 5: cleanup • Pros: Simple, fast, buildable for a variety of languages • Cons: Overaggressive and underaggressive
Evaluating Performance • Measures of Stemming Performance rely on similar metrics used in IR: • Precision: measure of the proportion of selected items the system got right • precision = tp / (tp + fp) • # of correct answers / # of answers given • Recall: measure of the proportion of the target items the system selected • recall = tp / (tp + fn) • # of correct answers / # of possible correct answers • Rule of thumb: as precision increases, recall drops, and vice versa • Metrics widely adopted in Stat NLP
Precision and Recall • Take a given stemming task • Suppose there are 100 words that could be stemmed • A stemmer gets 52 of these right (tp) • But it inadvertently stems 10 others (fp) Precision = 52 / (52 + 10) = .84 Recall = 52 / (52 + 48) = .52
Precision and Recall • Take a given stemming task • Suppose there are 100 words that could be stemmed • A stemmer gets 52 of these right (tp) • But it inadvertently stems 10 others (fp) Precision = 52 / (52 + 10) = .84 Recall = 52 / (52 + 48) = .52 Note: easy to get precision of 1.0. Why?
PFA Definition • A Probabilistic Finite-State Automaton is a 6-tuple: • A set of states Q • An alphabet Σ • A set of transitions: δsubset Q x Σ x Q • Initial state probabilities: Q R+ • Transition probabilities: δ R+ • Final state probabilities: Q R+
PFA Recap • Subject to constraints: • Computing sequence probabilities
PFA Example • Example • I(q0)=1 • I(q1)=0 • F(q0)=0 • F(q1)=0.2 • P(q0,a,q1)=1; P(q1,b,q1) =0.8 • P(abn) = I(q0)*P(q0,a,q1)*P(q1,b,q1)n*F(q1) • = 0.8n*0.2
Markov Chain • A Markov Chain is a special case of a PFA in which the sequence uniquely determines which states the automaton will go through. • Markov Chains can not represent inherently ambiguous problems • Can assign probability to unambiguous sequences