1 / 113

Morphology, Phonology & FSTs

Morphology, Phonology & FSTs. Shallow Processing Techniques for NLP Ling570 October 12, 2011. Roadmap. Motivation: Representing words A little (mostly English) Morphology Stemming FSTs & Morphology Stemming Morphological analysis FSTs & Phonology. Words.

betrys
Télécharger la présentation

Morphology, Phonology & FSTs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011

  2. Roadmap • Motivation: • Representing words • A little (mostly English) Morphology • Stemming • FSTs & Morphology • Stemming • Morphological analysis • FSTs & Phonology

  3. Words • Goal: Compact representation of all surface forms in a language

  4. Lexicon • Goal: Compact representation of all surface forms in a language • Enumeration: • Impractical for morphologically rich languages • Descriptively unsatisfying for most languages

  5. Lexicon • Goal: Compact representation of all surface forms in a language • Enumeration: • Impractical for morphologically rich languages • Descriptively unsatisfying for most languages • Orthographic variation: • Fly+er Flier

  6. Lexicon • Goal: Compact representation of all surface forms in a language • Enumeration: • Impractical for morphologically rich languages • Descriptively unsatisfying for most languages • Orthographic variation: • Fly+er Flier • Morphological variation: • saw + s  saws; fish + s  fish; goose + s  geese

  7. Lexicon • Goal: Compact representation of all surface forms in a language • Enumeration: • Impractical for morphologically rich languages • Descriptively unsatisfying for most languages • Orthographic variation: • Fly+er Flier • Morphological variation: • saw + s  saws; fish + s  fish; goose + s  geese • Phonological variation: • dog + s  dog + /z/; fox + s  fox + /IH Z/

  8. Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language. • Stem: the morpheme that forms the central meaning unit in a word • Affix: prefix, suffix, infix, circumfix • Prefix: e.g., possible impossible • Suffix: e.g., walk walking • Infix: e.g., hingihumingi (Tagalog) • Circumfix: e.g., sagengesagt (German)

  9. Combining Morphemes • Inflection: Stem + gram. morpheme  same class • E.g.: help + ed  helped

  10. Combining Morphemes • Inflection: Stem + gram. morpheme  same class • E.g.: help + ed  helped • Derivation: Stem + gram. morpheme  new class • E.g. Walk + er  walker (N)

  11. Combining Morphemes • Inflection: Stem + gram. morpheme  same class • E.g.: help + ed  helped • Derivation: Stem + gram. morpheme  new class • E.g. Walk + er  walker (N) • Compounding: multiple stems  new word • E.g. doghouse, catwalk, …

  12. Combining Morphemes • Inflection: Stem + gram. morpheme  same class • E.g.: help + ed  helped • Derivation: Stem + gram. morpheme  new class • E.g. Walk + er  walker (N) • Compounding: multiple stems  new word • E.g. doghouse, catwalk, … • Clitics: stem+clitic • I + ll  I’ll; he + is  he’s

  13. Inflectional Morphology(Mostly English) • Relatively simple inflectional system • Nouns, verbs, some adjectives

  14. Inflectional Morphology(Mostly English) • Relatively simple inflectional system • Nouns, verbs, some adjectives • Noun inflection: • Only plural, possessive • Non-English???

  15. Inflectional Morphology(Mostly English) • Relatively simple inflectional system • Nouns, verbs, some adjectives • Noun inflection: • Only plural, possessive • Non-English??? • Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x • Possessive:

  16. Inflectional Morphology(Mostly English) • Relatively simple inflectional system • Nouns, verbs, some adjectives • Noun inflection: • Only plural, possessive • Non-English??? • Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x • Possessive: sg, irregpl: +’s; regpl, after s,z: ‘

  17. Verb Inflectional Morphology • Classes: • Main (eat, hit), modal (can, should), primary (be, have) • Only main, primary inflected

  18. Verb Inflectional Morphology • Classes: • Main (eat, hit), modal (can, should), primary (be, have) • Only main, primary inflected • Regular verbs: Forms predictable from stem, productive

  19. Verb Inflectional Morphology • Classes: • Main (eat, hit), modal (can, should), primary (be, have) • Only main, primary inflected • Regular verbs: Forms predictable from stem, productive • Irregular verbs: Only about 250, but very frequent

  20. Derivational Morphology • Relatively complex, common in English • Nominalization: Verb or Adj + affix  Noun

  21. Derivational Morphology • Relatively complex, common in English • Nominalization: Verb or Adj + affix  Noun • Adjectives: Verb or Noun + affix  Adj

  22. Derivational Morphology • Relatively complex, common in English • Nominalization: Verb or Adj + affix  Noun • Adjectives: Verb or Noun + affix  Adj

  23. Cliticization • Clitics: between affix and word • Affix: short, reduced • Word: act as pronouns, articles, conj, verbs

  24. Cliticization • Clitics: between affix and word • Affix: short, reduced • Word: act as pronouns, articles, conj, verbs • In English: • Presence is (mostly) unambiguous: ‘ • Meaning is often ambiguous: e.g. he’s

  25. Cliticization • Clitics: between affix and word • Affix: short, reduced • Word: act as pronouns, articles, conj, verbs • In English: • Presence is (mostly) unambiguous: ‘ • Meaning is often ambiguous: e.g. he’s • More complex in other languages: e.g. Arabic

  26. Cliticization • Clitics: between affix and word • Affix: short, reduced • Word: act as pronouns, articles, conj, verbs • In English: • Presence is (mostly) unambiguous: ‘ • Meaning is often ambiguous: e.g. he’s • More complex in other languages: e.g. Arabic • Can prefix (proclitic) article, prep, conj, • No markers • Removal of such clitics often referred to as light stemming

  27. Stemming • Simple type of morphological analysis • Commonly used in information retrieval (IR) • Supports matching using base form • e.g. Television, televised, televising  televise

  28. Stemming • Simple type of morphological analysis • Commonly used in information retrieval (IR) • Supports matching using base form • e.g. Television, televised, televising  televise • Typically improves retrieval of short documents – why?

  29. Stemming • Simple type of morphological analysis • Commonly used in information retrieval (IR) • Supports matching using base form • e.g. Television, televised, televising  televise • Typically improves retrieval of short documents – why? • Most popular: Porter stemmer (snowball.tartarus.org)

  30. Stemming • Simple type of morphological analysis • Commonly used in information retrieval (IR) • Supports matching using base form • e.g. Television, televised, televising  televise • Typically improves retrieval of short documents – why? • Most popular: Porter stemmer (snowball.tartarus.org) • Task: Given surface form, produce base form • Typically, removes suffixes

  31. Stemming • Simple type of morphological analysis • Commonly used in information retrieval (IR) • Supports matching using base form • e.g. Television, televised, televising  televise • Typically improves retrieval of short documents – why? • Most popular: Porter stemmer (snowball.tartarus.org) • Task: Given surface form, produce base form • Typically, removes suffixes • Model: • Rule cascade • No lexicon!

  32. Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2

  33. Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε

  34. Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL  ATE

  35. Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL  ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing

  36. Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL  ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes

  37. Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL  ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes • Step 5: cleanup • Pros:

  38. Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL  ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes • Step 5: cleanup • Pros: Simple, fast, buildable for a variety of languages • Cons:

  39. Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1  PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL  ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes • Step 5: cleanup • Pros: Simple, fast, buildable for a variety of languages • Cons: Overaggressive and underaggressive • Limited in application

  40. FST Morphological Analysis • Focus on English morphology • FSA acceptor: • cats  yes; foxes  yes; childs  no

  41. FST Morphological Analysis • Focus on English morphology • FSA acceptor: • cats  yes; foxes  yes; childs  no • FST morphological analyzer: • fox + N + pl  fox^s#

  42. FST Morphological Analysis • Focus on English morphology • FSA acceptor: • cats  yes; foxes  yes; childs  no • FST morphological analyzer: • fox + N + pl  fox^s# • FST for orthographic rules: • fox^s#  foxes#

  43. Morphological AnalysisComponents • Lexicon: List of stems and affixes • E.g.: cat: N • -s: Pl

  44. Morphological AnalysisComponents • Lexicon: List of stems and affixes • E.g.: cat: N • -s: Pl • Morphotactics: Model of morpheme ordering • Association with classes, affix ordering • E.g. Pl follows N

  45. Morphological AnalysisComponents • Lexicon: List of stems and affixes • E.g.: cat: N • -s: Pl • Morphotactics: Model of morpheme ordering • Association with classes, affix ordering • E.g. Pl follows N • Orthographic rules: Spelling rules • Changes when morphemes combine • E.g. y  ie in try + s

  46. Example • Goal: foxes  fox + N + Pl

  47. Example • Goal: foxes  fox + N + Pl • Surface: foxes

  48. Example • Goal: foxes  fox + N + Pl • Surface: foxes • Orthographic rules • Intermediate: fox s

  49. Example • Goal: foxes  fox + N + Pl • Surface: foxes • Orthographic rules • Intermediate: fox s • Lexicon + morphotactics • Lexical: fox + N + Pl

  50. Multiple Levels • Generation and Analysis • Generation: fox + N + Pl  fox^s#; fox^s#  foxes# • Analysis: foxes#  fox^s#; fox^s#  fox + N + Pl

More Related