Morphology: Words and their Parts

Morphology: Wordsand their Parts CS 4705 Slides adapted from Jurafsky, Martin Hirschberg and Dorr.

English Morphology • Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes • We can usefully divide morphemes into two classes • Stems: The core meaning bearing units • Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions

Nouns and Verbs (English) • Nouns are simple (not really) • Markers for plural and possessive • Verbs are only slightly more complex • Markers appropriate to the tense of the verb

Regulars and Irregulars • Ok so it gets a little complicated by the fact that some words misbehave (refuse to follow the rules) • Mouse/mice, goose/geese, ox/oxen • Go/went, fly/flew • The terms regular and irregular will be used to refer to words that follow the rules and those that don’t.

Regular and Irregular Nouns and Verbs • Regulars… • Walk, walks, walking, walked, walked • Table, tables • Irregulars • Eat, eats, eating, ate, eaten • Catch, catches, catching, caught, caught • Cut, cuts, cutting, cut, cut • Goose, geese

Why care about morphology? Spelling correction: referece • Morphology in machine translation • Spanish words quiero and quieres are both related to querer ‘want’ • Hyphenation algorithms: refer-ence • Part-of-speech analysis: google, googler • Text-to-speech: grapheme-to-phoneme conversion • hothouse (/T/ or /D/) • Allows us to guess at meaning • ‘Twas brillig and the slithy toves… • Muggles moogled migwiches

Concatenative Morphology • Morpheme+Morpheme+Morpheme+… • Stems: often called lemma, base form, root, lexeme • hope+ing hoping hop hopping • Affixes • Prefixes: Antidisestablishmentarianism • Suffixes: Antidisestablishmentarianism • Infixes: hingi (borrow) – humingi (borrower) in Tagalog • Circumfixes: sagen (say) – gesagt (said) in German

What useful information does morphology give us? • Different things in different languages • Spanish: hablo, hablaré/ English: I speak, I will speak • English: book, books/ Japanese: hon, hon • Languages differ in how they encode morphological information • Isolating languages (e.g. Cantonese) have no affixes: each word usually has 1 morpheme • Agglutinative languages (e.g. Finnish, Turkish) are composed of prefixes and suffixes added to a stem (like beads on a string) – each feature realized by a single affix, e.g. Finnish

epäjärjestelmällistyttämättömyydellänsäkäänköhän ‘Wonder if he can also ... with his capability of not causing things to be unsystematic’ • Inflectional languages (e.g. English) merge different features into a single affix (e.g. ‘s’ in likes indicates both person and tense); and the same feature can be realized by different affixes • Polysynthetic languages (e.g. Inuit languages) express much of their syntax in their morphology, incorporating a verb’s arguments into the verb, e.g. Western Greenlandic Aliikusersuillammassuaanerartassagaluarpaalli.aliiku-sersu-i-llammas-sua-a-nerar-ta-ssa-galuar-paal-lientertainment-provide-SEMITRANS-one.good.at-COP-say.that-REP-FUT-sure.but-3.PL.SUBJ/3SG.OBJ-but'However, they will say that he is a great entertainer, but ...' • So….different languages may require very different morphological analyzers

What we want • Something to automatically do the following kinds of mappings: • Cats cat +N +PL • Cat cat +N +SG • Cities city +N +PL • Merging merge +V +Present-participle • Caught catch +V +past-participle

Morphology Can Help Define Word Classes • AKA morphological classes, parts-of-speech • Closed vs. open (function vs. content) class words • Pronoun, preposition, conjunction, determiner,… • Noun, verb, adverb, adjective,… • Identifying word classes is useful for almost any task in NLP, from translation to speech recognition to topic detection…very basic semantics

(English) Inflectional Morphology Word stem + grammatical morpheme  different forms of same word • Usually produces word of same class • Usually serves a syntactic or grammatical function (e.g. agreement) like  likes or liked bird  birds • Nominal morphology • Plural forms • s or es • Irregular forms (goose/geese)

Mass vs. count nouns (fish/fish(es), email or emails?) • Possessives (cat’s, cats’) • Verbal inflection • Main verbs (sleep, like, fear) relatively regular • -s, ing, ed • And productive: emailed, instant-messaged, faxed, homered • But some are not: • eat/ate/eaten, catch/caught/caught • Primary (be, have, do) and modal verbs (can, will, must) often irregular and not productive • Be: am/is/are/were/was/been/being • Irregular verbs few (~250) but frequently occurring

Derivational Morphology • Word stem + syntactic/grammaticalmorpheme  new words • Usually produces word ofdifferent class • Incomplete process: derivational morphs cannot be applied to just any member of a class • Verbs --> nouns • -ize verbs  -ation nouns • generalize, realize  generalization, realization • synthesize but not synthesization

Verbs, nouns  adjectives • embrace, pity embraceable, pitiable • care, wit  careless, witless • Adjective  adverb • happy  happily • Process selective in unpredictable ways • Less productive: nerveless/*evidence-less, malleable/*sleep-able, rar-ity/*rareness • Meanings of derived terms harder to predict by rule • clueless, careless, nerveless, sleepless

Compounding • Two base forms join to form a new word • Bedtime, Weinerschnitzel, Rotwein • Careful? Compound or derivation?

Morphotactics • What are the ‘rules’ for constructing a word in a given language? • Pseudo-intellectual vs. *intellectual-pseudo • Rational-ize vs *ize-rational • Cretin-ous vs. *cretin-ly vs. *cretin-acious

Semantics: In English, un- cannot attach to adjectives that already have a negative connotation: • Unhappy vs. *unsad • Unhealthy vs. *unsick • Unclean vs. *undirty • Phonology: In English, -er cannot attach to words of more than two syllables • great, greater • Happy, happier • Competent, *competenter • Elegant, *eleganter • Unruly, ?unrulier

Morphological Parsing • These regularities enable us to create software to parse words into their component parts

Morphology and FSAs • We’d like to use the machinery provided by FSAs to capture facts about morphology • Ie. Accept strings that are in the language • And reject strings that are not • And do it in a way that doesn’t require us to in effect list all the words in the language

What do we need to build a morphological parser? • Lexicon: list of stems and affixes (w/ corresponding p.o.s.) • Morphotactics of the language: model of how and which morphemes can be affixed to a stem • Orthographic rules: spelling modifications that may occur when affixation occurs • in  il in context of l (in- + legal) • Most morphological phenomena can be described with regular expressions – so finite state techniques often used to represent morphological processes

Start Simple • Regular singular nouns are ok • Regular plural nouns have an -s on the end • Irregulars are ok as is

Simple Rules

Now Add in the Words

q1 q2 q0 adj-root1 -er, -ly, -est un- • Derivational morphology: adjective fragment adj-root1 q5 q3 q4  -er, -est adj-root2 • Adj-root1: clear, happi, real (clearly) • Adj-root2: big, red (*bigly)

Parsing/Generation vs. Recognition • We can now run strings through these machines to recognize strings in the language • Accept words that are ok • Reject words that are not • But recognition is usually not quite what we need • Often if we find some string in the language we might like to find the structure in it (parsing) • Or we have some structure and we want to produce a surface form (production/generation) • Example • From “cats” to “cat +N +PL”

Finite State Transducers • The simple story • Add another tape • Add extra symbols to the transitions • On one tape we read “cats”, on the other we write “cat +N +PL”

Applications • The kind of parsing we’re talking about is normally called morphological analysis • It can either be • An important stand-alone component of an application (spelling correction, information retrieval) • Or simply a link in a chain of processing

FSTs • Kimmo Koskenniemi’s two-level morphology • Idea: word is a relationship betweenlexicallevel (its morphemes) and surface level (its orthography)

+N:ε +PL:s c:c a:a t:t Transitions • c:c means read a c on one tape and write a c on the other • +N:ε means read a +N symbol on one tape and write nothing on the other • +PL:s means read +PL and write an s

Typical Uses • Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA). • And we’ll write to the second tape using the other symbols on the transitions. • In general, FSTs can be used for • Translators (Hello:Ciao) • Parser/generators (Hello:How may I help you?) • As well as Kimmo-style morphological parsing

Ambiguity • Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. • Didn’t matter which path was actually traversed • In FSTs the path to an accept state does matter since differ paths represent different parses and different outputs will result

Ambiguity • What’s the right parse (segmentation) for • Unionizable • Union-ize-able • Un-ion-ize-able • Each represents a valid path through the derivational morphology machine.

Ambiguity • There are a number of ways to deal with this problem • Simply take the first output found • Find all the possible outputs (all paths) and return them all (without choosing) • Bias the search so that only one or a few likely paths are explored

The Gory Details • Of course, its not as easy as • “cat +N +PL” <-> “cats” • As we saw earlier there are geese, mice and oxen • But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes • Cats vs Dogs • Fox and Foxes

Multi-Tape Machines • To deal with this we can simply add more tapes and use the output of one tape machine as the input to the next • So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols

Generativity • Nothing really privileged about the directions. • We can write from one and read from the other or vice-versa. • One way is generation, the other way is analysis

Multi-Level Tape Machines • We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape

Lexical to Intermediate Level

Intermediate to Surface • The add an “e” rule as in fox^s# <-> foxes#

Foxes

Note • A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply. • Meaning that they are written out unchanged to the output tape.

Overall Scheme • We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity). • Lexical level to intermediate forms • We have a larger set of machines that capture orthographic/spelling rules. • Intermediate forms to surface forms

Overall Scheme

Cascades • This is a scheme that we’ll see again and again. • Overall processing is divided up into distinct rewrite steps • The output of one layer serves as the input to the next • The intermediate tapes may or may not wind up being useful in their own right

Porter Stemmer (1980) • Used for tasks in which you only care about the stem • IR, modeling given/new distinction, topic detection, document similarity • Lexicon-free morphological analysis • Cascades rewrite rules (e.g. misunderstanding --> misunderstand --> understand --> …) • Easily implemented as an FST with rules e.g. • ATIONAL  ATE • ING  ε • Not perfect …. • Doing doe

Policy police • Does stemming help? • IR, little • Topic detection, more

Summing Up • FSTs provide a useful tool for implementing a standard model of morphological analysis, Kimmo’s two-level morphology • But for many tasks (e.g. IR) much simpler approaches are still widely used, e.g. the rule-based Porter Stemmer • Next time: • Read Ch 4 • HW1 assigned; see web page: http://www.cs.columbia.edu/~kathy/NLP

Morphology: Words and their Parts