Ling 570

Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Morphology and FSTs

Last Class FST as Translator FR: ce bill met de le baume sur une blessure EN: this bill putsbalm on a sore wound

FST Application Examples • Case folding: • He said  he said • Tokenization: • “He ran.”  “ He ran . “ • POS tagging: • They can fish  PRO VERB NOUN

FST Application Examples • Pronunciation: • B AH T EH R  B AH DX EH R • Morphological generation: • Fox s  Foxes • Morphological analysis: • cats  cat s

Roadmap • Motivation: • Representing words • A little (mostly English) Morphology • Stemming

The Lexicon • Goal: Represent all the words in a language • Approach?

The Lexicon • Goal: Represent all the words in a language • Approach? • Enumerate all words?

The Lexicon • Goal: Represent all the words in a language • Approach? • Enumerate all words? • Doable for English • Typical for ASR (Automatic Speech Recognition) • English is morphologically relatively impoverished

The Lexicon • Goal: Represent all the words in a language • Approach? • Enumerate all words? • Doable for English • Typical for ASR (Automatic Speech Recognition) • English is morphologically relatively impoverished • Other languages?

The Lexicon • Goal: Represent all the words in a language • Approach? • Enumerate all words? • Doable for English • Typical for ASR (Automatic Speech Recognition) • English is morphologically relatively impoverished • Other languages? • Wildly impractical • Turkish: 40,000 forms/verb; uygarlas¸tıramadıklarımızdanmıs¸sınızcasına “(behaving) as if you are among those whom we could not civilize”

Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes

Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language.

Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language. • Stem: the morpheme that forms the central meaning unit in a word • Affix: prefix, suffix, infix, circumfix

Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language. • Stem: the morpheme that forms the central meaning unit in a word • Affix: prefix, suffix, infix, circumfix • Prefix: e.g., possible impossible

Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language. • Stem: the morpheme that forms the central meaning unit in a word • Affix: prefix, suffix, infix, circumfix • Prefix: e.g., possible impossible • Suffix: e.g., walk walking

Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language. • Stem: the morpheme that forms the central meaning unit in a word • Affix: prefix, suffix, infix, circumfix • Prefix: e.g., possible impossible • Suffix: e.g., walk walking • Infix: e.g., hingihumingi (Tagalog)

Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language. • Stem: the morpheme that forms the central meaning unit in a word • Affix: prefix, suffix, infix, circumfix • Prefix: e.g., possible impossible • Suffix: e.g., walk walking • Infix: e.g., hingihumingi (Tagalog) • Circumfix: e.g., sagengesagt (German)

Surface Variation & Morphology • Searching (a la Bing) for documents about: • Televised sports

Surface Variation & Morphology • Searching (a la Bing) for documents about: • Televised sports • Many possible surface forms: • Televised, television, televise,.. • Sports, sport, sporting,…

Surface Variation & Morphology • Searching (a la Bing) for documents about: • Televised sports • Many possible surface forms: • Televised, television, televise,.. • Sports, sport, sporting,… • How can we match?

Surface Variation & Morphology • Searching (a la Bing) for documents about: • Televised sports • Many possible surface forms: • Televised, television, televise,.. • Sports, sport, sporting,… • How can we match? • Convert surface forms to common base form • Stemming or morphological analysis

Two Perspectives • Stemming: • writing 

Two Perspectives • Stemming: • writing  write (or writ) • Beijing

Two Perspectives • Stemming: • writing  write (or writ) • Beijing  Beije • Morphological Analysis:

Two Perspectives • Stemming: • writing  write (or writ) • Beijing  Beije • Morphological Analysis: • writing  write+V+prog

Two Perspectives • Stemming: • writing  write (or writ) • Beijing  Beije • Morphological Analysis: • writing  write+V+prog • cats  cat + N + pl • writes  write+V+3rdpers+Sg

Stemming • Simple type of morphological analysis • Supports matching using base form • e.g. Television, televised, televising  televise

Stemming • Simple type of morphological analysis • Supports matching using base form • e.g. Television, televised, televising  televise • Most popular: Porter stemmer

Stemming • Simple type of morphological analysis • Supports matching using base form • e.g. Television, televised, televising  televise • Most popular: Porter stemmer • Task: Given surface form, produce base form • Typically, removes suffixes

Stemming • Simple type of morphological analysis • Supports matching using base form • e.g. Television, televised, televising  televise • Most popular: Porter stemmer • Task: Given surface form, produce base form • Typically, removes suffixes • Model: • Rule cascade • No lexicon!

Stemming • Used in many NLP/IR applications • For building equivalence classes Connect Connected Connecting Connection Connections Porter Stemmer, simple and efficient Website: http://www.tartarus.org/~martin/PorterStemmer On patas: ~/dropbox/12-13/570/porter Same class; suffixes irrelevant

Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2

Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε

Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL  ATE

Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL  ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing

Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL  ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes

Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL  ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes • Step 5: cleanup • Pros:

Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL  ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes • Step 5: cleanup • Pros: Simple, fast, buildable for a variety of languages • Cons:

Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL  ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes • Step 5: cleanup • Pros: Simple, fast, buildable for a variety of languages • Cons: Overaggressive and underaggressive

Stemming & eval

Evaluating Performance • Measures of Stemming Performance rely on similar metrics used in IR: • Precision: measure of the proportion of selected items the system got right • precision = tp / (tp + fp) • # of correct answers / # of answers given • Recall: measure of the proportion of the target items the system selected • recall = tp / (tp + fn) • # of correct answers / # of possible correct answers • Rule of thumb: as precision increases, recall drops, and vice versa • Metrics widely adopted in Stat NLP

Precision and Recall • Take a given stemming task • Suppose there are 100 words that could be stemmed • A stemmer gets 52 of these right (tp) • But it inadvertently stems 10 others (fp) Precision = 52 / (52 + 10) = .84 Recall = 52 / (52 + 48) = .52

Precision and Recall • Take a given stemming task • Suppose there are 100 words that could be stemmed • A stemmer gets 52 of these right (tp) • But it inadvertently stems 10 others (fp) Precision = 52 / (52 + 10) = .84 Recall = 52 / (52 + 48) = .52 Note: easy to get precision of 1.0. Why?

Weighted automata & Markov chains

PFA Definition • A Probabilistic Finite-State Automaton is a 6-tuple: • A set of states Q • An alphabet Σ • A set of transitions: δsubset Q x Σ x Q • Initial state probabilities: Q  R+ • Transition probabilities: δ R+ • Final state probabilities: Q  R+

PFA Recap • Subject to constraints: • Computing sequence probabilities

PFA Example • Example • I(q0)=1 • I(q1)=0 • F(q0)=0 • F(q1)=0.2 • P(q0,a,q1)=1; P(q1,b,q1) =0.8 • P(abn) = I(q0)*P(q0,a,q1)*P(q1,b,q1)n*F(q1) • = 0.8n*0.2

Markov Chain • A Markov Chain is a special case of a PFA in which the sequence uniquely determines which states the automaton will go through. • Markov Chains can not represent inherently ambiguous problems • Can assign probability to unambiguous sequences

Ling 570

Ling 570

Presentation Transcript

LIS 570

LIS 570

Ling 570 Day 9: Text Classification and Sentiment Analysis

Ling 570

Ling 570 Day 6: HMM POS Taggers

Ling 570 Day 17: Named Entity Recognition Chunking

570 A.D.

Ling 570: Day 8 Classification, Mallet

Ling 570 Day 16 : Sequence modeling Named Entity Recognition

Ling 570

LIS 570

EECS 570

PSC 570

LIS 570

570

Ling 570 Day 7: Classifiers

LIS 570

PSC 570

EECS 570

LIS 570

LIS 570

LIS 570