1 / 47

CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalask

CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari. Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ and

benjamin
Télécharger la présentation

CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalask

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSC 9010Natural Language ProcessingLecture 3: Morphology, Finite State TransducersPaula MatuszekMary-Angela Papalaskari Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ and Jim Martin: http://www.cs.colorado.edu/~martin/csci5832.html CSC 9010- NLP - 3: Morphology, Finite State Transducers

  2. Today • Elementary Morphology • Computational morphology • Finite State Transducers • Lexicon-only schemes • Rule-only schemes • Lab: Introduction to NLTK CSC 9010- NLP - 3: Morphology, Finite State Transducers

  3. Morphology • Morphology: • The study of the way words are built up from smaller meaning units. • Morphemes: • The smallest meaningful unit in the grammar of a language. • Contrasts: • Derivational vs. Inflectional • Regular vs. Irregular • Concatinative vs. Templatic (root-and-pattern) • A useful resource: • Glossary of linguistic terms by Eugene Loos • http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm CSC 9010- NLP - 3: Morphology, Finite State Transducers

  4. Examples (English) • “unladylike” • 3 morphemes, 4 syllables un- ‘not’ lady ‘(well behaved) female adult human’ -like ‘having the characteristics of’ • Can’t break any of these down further without distorting the meaning of the units • “technique” • 1 morpheme, 2 syllables • “dogs” • 2 morphemes, 1 syllable -s, a plural marker on nouns CSC 9010- NLP - 3: Morphology, Finite State Transducers

  5. Morpheme Definitions • Root • The portion of the word that: • is common to a set of derived or inflected forms, if any, when all affixes are removed • is not further analyzable into meaningful elements • carries the principal portion of meaning of the words • Stem • The root or roots of a word, together with any derivational affixes, to which inflectional affixes are added. • Affix • A bound morpheme that is joined before, after, or within a root or stem. • Clitic • a morpheme that functions syntactically like a word, but does not appear as an independent phonological word • Spanish: un beso, las aguas • English: Hal’s (genetive marker) • Proto-European: Kwe -que (Latin), te (Greek), and –ca (Sanskrit) CSC 9010- NLP - 3: Morphology, Finite State Transducers

  6. Inflectional vs. Derivational • Word Classes • Parts of speech: noun, verb, adjectives, etc. • Word class dictates how a word combines with morphemes to form new words • Inflection: • Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast. • Doesn’t change the word class • Usually produces a predictable, non-idiosyncratic change of meaning. • Derivation: • The formation of a new word or inflectable stem from another word or stem. CSC 9010- NLP - 3: Morphology, Finite State Transducers

  7. Inflectional Morphology • Adds: • tense, number, person, mood, aspect • Word class doesn’t change • Word serves new grammatical role • Examples • come is inflected for person and number: The pizza guy comes at noon. • las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s las manzanas rojas (‘the red apples’) CSC 9010- NLP - 3: Morphology, Finite State Transducers

  8. Derivational Morphology • Nominalization (formation of nouns from other parts of speech, primarily verbs in English): • computerization • appointee • killer • fuzziness • Formation of adjectives (primarily from nouns) • computational • clueless • Embraceable • Diffulcult cases: • building  from which sense of “build”? CSC 9010- NLP - 3: Morphology, Finite State Transducers

  9. Concatinative Morphology • Morpheme+Morpheme+Morpheme+… • Stems: also called lemma, base form, root, lexeme • hope+ing  hoping hop  hopping • Affixes • Prefixes: Antidisestablishmentarianism • Suffixes: Antidisestablishmentarianism • Infixes: hingi (borrow) – humingi (borrower) in Tagalog • Circumfixes: sagen (say) – gesagt (said) in German • Agglutinative Languages • uygarlaştıramadıklarımızdanmışsınızcasına (Turkish) • uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına • Behaving as if you are among those whom we could not cause to become civilized CSC 9010- NLP - 3: Morphology, Finite State Transducers

  10. Templatic Morphology • Roots and Patterns • Example: Hebrew verbs • Root: • Consists of 3 consonants CCC • Carries basic meaning • Template: • Gives the ordering of consonants and vowels • Specifies semantic information about the verb • Active, passive, middle voice • Example: • lmd (to learn or study) • CaCaC -> lamad (he studied) • CiCeC -> limed (he taught) • CuCaC -> lumad (he was taught) CSC 9010- NLP - 3: Morphology, Finite State Transducers

  11. Nouns and Verbs (in English) • Nouns have simple inflectional morphology • cat • cat+s, cat+’s • Verbs have more complex morphology CSC 9010- NLP - 3: Morphology, Finite State Transducers

  12. Nouns and Verbs (in English) • Nouns • Have simple inflectional morphology • Cat/Cats • Mouse/Mice, Ox, Oxen, Goose, Geese • Verbs • More complex morphology • Walk/Walked • Go/Went, Fly/Flew CSC 9010- NLP - 3: Morphology, Finite State Transducers

  13. Regular (English) Verbs CSC 9010- NLP - 3: Morphology, Finite State Transducers

  14. Irregular (English) Verbs CSC 9010- NLP - 3: Morphology, Finite State Transducers

  15. “To love” in Spanish CSC 9010- NLP - 3: Morphology, Finite State Transducers

  16. Syntax and Morphology • Phrase-level agreement • Subject-Verb • John studies hard (STUDY+3SG) • Noun-Adjective • Las vacas hermosas • Sub-word phrasal structures • שבספרינו • ש+ב+ספר+ים+נו • That+in+book+PL+Poss:1PL • Which are in our books CSC 9010- NLP - 3: Morphology, Finite State Transducers

  17. Phonology and Morphology • Script Limitations • Spoken English has 14 vowels • heed hid hayed head had hoed hood who’d hide how’d taught Tut toyenough • English Alphabet has 5 • Use vowel combinatios: far fair fare • Consonantal doubling (hopping vs. hoping) CSC 9010- NLP - 3: Morphology, Finite State Transducers

  18. Computational Morphology • Approaches • Lexicon only • Rules only • Lexicon and Rules • Finite-state Automata • Finite-state Transducers • Systems • WordNet’s morphy • PCKimmo • Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay • Accurate but complex • http://www.sil.org/pckimmo/ • Two-level morphology • Commercial version available from InXight Corp. • Background • Chapter 3 of Jurafsky and Martin • A short history of Two-Level Morphology • http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/ CSC 9010- NLP - 3: Morphology, Finite State Transducers

  19. Computational Morphology WORD STEM (+FEATURES)* • cats cat +N +PL • cat cat +N +SG • cities city +N +PL • geese goose +N +PL • ducks (duck +N +PL) or (duck +V +3SG) • merging merge +V +PRES-PART • caught (catch +V +PAST-PART) or (catch +V +PAST) CSC 9010- NLP - 3: Morphology, Finite State Transducers

  20. FSAs and the Lexicon • First we’ll capture the morphotactics • The rules governing the ordering of affixes in a language. • Then we’ll add in the actual words CSC 9010- NLP - 3: Morphology, Finite State Transducers

  21. Simple Rules CSC 9010- NLP - 3: Morphology, Finite State Transducers

  22. Adding the Words CSC 9010- NLP - 3: Morphology, Finite State Transducers

  23. Derivational Rules CSC 9010- NLP - 3: Morphology, Finite State Transducers

  24. Parsing/Generation vs. Recognition • Recognition is usually not quite what we need. • Usually if we find some string in the language we need to find the structure in it (parsing) • Or we have some structure and we want to produce a surface form (production/generation) • Example • From “cats” to “cat +N +PL”and back • Morphological analysis CSC 9010- NLP - 3: Morphology, Finite State Transducers

  25. Finite State Transducers • The simple story • Add another tape • Add extra symbols to the transitions • On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around. CSC 9010- NLP - 3: Morphology, Finite State Transducers

  26. FSTs CSC 9010- NLP - 3: Morphology, Finite State Transducers

  27. +N:ε +PL:s c:c a:a t:t Transitions • c:c means read a c on one tape and write a c on the other • +N:ε means read a +N symbol on one tape and write nothing on the other • +PL:s means read +PL and write an s CSC 9010- NLP - 3: Morphology, Finite State Transducers

  28. Ambiguity • Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. • Didn’t matter which path was actually traversed • In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result CSC 9010- NLP - 3: Morphology, Finite State Transducers

  29. Ambiguity • What’s the right parse for • Unionizable • Union-ize-able • Un-ion-ize-able • Each represents a valid path through the derivational morphology machine. CSC 9010- NLP - 3: Morphology, Finite State Transducers

  30. Ambiguity • There are a number of ways to deal with this problem • Simply take the first output found • Find all the possible outputs (all paths) and return them all (without choosing) • Bias the search so that only one or a few likely paths are explored CSC 9010- NLP - 3: Morphology, Finite State Transducers

  31. The Gory Details • Of course, its not as easy as • “cat +N +PL” <-> “cats” • As we saw earlier there are geese, mice and oxen • But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes • Cats vs Dogs Multi-tape machines CSC 9010- NLP - 3: Morphology, Finite State Transducers

  32. Multi-Level Tape Machines • We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape CSC 9010- NLP - 3: Morphology, Finite State Transducers

  33. Lexical to Intermediate Level CSC 9010- NLP - 3: Morphology, Finite State Transducers

  34. Intermediate to Surface • The add an “e” rule as in fox^s# <-> foxes CSC 9010- NLP - 3: Morphology, Finite State Transducers

  35. Foxes CSC 9010- NLP - 3: Morphology, Finite State Transducers

  36. Foxes CSC 9010- NLP - 3: Morphology, Finite State Transducers

  37. FST Review • FSTs allow us to take an input and deliver a structure based on it • Or… take a structure and create a surface form • Or take a structure and create another structure In many applications its convenient to decompose the problem into a set of cascaded transducers where • The output of one feeds into the input of the next. • We’ll see this scheme again for deeper semantic processing. CSC 9010- NLP - 3: Morphology, Finite State Transducers

  38. Overall Plan CSC 9010- NLP - 3: Morphology, Finite State Transducers

  39. Lexicon-only Morphology • The lexicon lists all surface level and lexical level pairs • No rules … • Analysis/Generation is easy • Very large for English • What about • Arabic or • Turkish or • Chinese? acclaim acclaim $N$ acclaim acclaim $V+0$ acclaimed acclaim $V+ed$ acclaimed acclaim $V+en$ acclaiming acclaim $V+ing$ acclaims acclaim $N+s$ acclaims acclaim $V+s$ acclamation acclamation $N$ acclamations acclamation $N+s$ acclimate acclimate $V+0$ acclimated acclimate $V+ed$ acclimated acclimate $V+en$ acclimates acclimate $V+s$ acclimating acclimate $V+ing$ CSC 9010- NLP - 3: Morphology, Finite State Transducers

  40. Stemming vs Morphology • Sometimes you just need to know the stem of a word and you don’t care about the structure. • In fact you may not even care if you get the right stem, as long as you get a consistent string. • This is stemming… it most often shows up in IR applications CSC 9010- NLP - 3: Morphology, Finite State Transducers

  41. Stemming in IR • Run a stemmer on the documents to be indexed • Run a stemmer on users queries • Match • This is basically a form of hashing • Example: Computerization • ization -> -ize computerize • ize -> εcomputer CSC 9010- NLP - 3: Morphology, Finite State Transducers

  42. Porter Stemmer Step 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible CSC 9010- NLP - 3: Morphology, Finite State Transducers

  43. Porter • No lexicon needed • Basically a set of staged sets of rewrite rules that strip suffixes • Handles both inflectional and derivational suffixes Doesn’t guarantee that the resulting stem is really a stem (see first bullet) Lack of guarantee doesn’t matter for IR CSC 9010- NLP - 3: Morphology, Finite State Transducers

  44. Porter Stemmer • Errors of Omission • European Europe • analysis analyzes • matrices matrix • noise noisy • explain explanation • Errors of Commission • organization organ • doing doe • generalization generic • numerical numerous • university universe CSC 9010- NLP - 3: Morphology, Finite State Transducers

  45. Soundex • You work as the Villanova telephone operator. Someone calls looking for: Dr Papalarsky or Dr Matuzka • ???????? What do you type as your query string? CSC 9010- NLP - 3: Morphology, Finite State Transducers

  46. Soundex • Keep the first letter • Drop non-initial occurrences of vowels, h, w and y • Replace the remaining letters with numbers according to group (e.g.. b, f, p, and v -> 1 • Replace strings of identical numbers with a single number (333 -> 3) • Drop any numbers beyond a third one CSC 9010- NLP - 3: Morphology, Finite State Transducers

  47. Soundex • Effect is to map (hash) all similar sounding transcriptions to the same code. • Structure your directory so that it can be accessed by code as well as by correct spelling • Used for census records, phone directories, author searches in libraries etc. CSC 9010- NLP - 3: Morphology, Finite State Transducers

More Related