1 / 108

Morphology

Morphology. Sudeshna Sarkar IIT Kharagpur. Morphology Morphology – analysis and synthesis - computational aspects. Morphology. Morphology is the field of linguistics that studies the internal structure of words

elina
Télécharger la présentation

Morphology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Morphology Sudeshna Sarkar IIT Kharagpur

  2. Morphology • Morphology – analysis and synthesis - computational aspects

  3. Morphology • Morphology is the field of linguistics that studies the internal structure of words • How words are built up from smaller meaningful units called morphemes (morph = shape, logos = word) • We can usefully divide morphemes into two classes • Stems: The core meaning bearing units • Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions • Prefix: un-, anti-, etc (a- ati- pra- etc) • Suffix: -ity, -ation, etc ( -taa, -ke, -ka etc) • Infix: are inserted inside the stem • Tagalog: um + hingi humingi • Circumfixes – precede and follow the stem • Turkish can have words with a lot of suffixes (agglutinative language) Many indian languages also have agglutinative suffixes

  4. Examples (English) • “unladylike” • 3 morphemes, 4 syllables un- ‘not’ lady ‘(well behaved) female adult human’ -like ‘having the characteristics of’ • Can’t break any of these down further without distorting the meaning of the units • “technique” • 1 morpheme, 2 syllables • “dogs” • 2 morphemes, 1 syllable -s, a plural marker on nouns Modified from Dorr and Habash (after Jurafsky and Martin)

  5. Examples (Bengali) • “chhelederTaakei” • 5 morphemes chhele ‘boy’ -der ‘plural genitive’ -Taa ‘classifier’ -ke ‘dative’ -i ‘emphasizer’ Can’t break any of these down further without distorting the meaning of the units • “atipraakrritake” ati- praakrrita -ke Modified from Dorr and Habash (after Jurafsky and Martin)

  6. Morpheme Definitions • Root • The portion of the word that: • is common to a set of derived or inflected forms, if any, when all affixes are removed • is not further analyzable into meaningful elements • carries the principle portion of meaning of the words • Stem • The root or roots of a word, together with any derivational affixes, to which inflectional affixes are added. • Affix • A bound morpheme that is joined before, after, or within a root or stem. • Clitic • a morpheme that functions syntactically like a word, but does not appear as an independent phonological word • Spanish: un beso, las aguas, English: Hal’s (genetive marker) Modified from Dorr and Habash (after Jurafsky and Martin)

  7. Inflectional & Derivational Morphology • We can also divide morphology up into two broad classes • Inflectional • Derivational

  8. Inflectional Morphology • Inflection: • Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast. • Doesn’t change the word class • Usually produces a predictable, nonidiosyncratic change of meaning. • Serves a grammatical/semantic purpose different from the original After a combination with an inflectional morpheme, the meaning and class of the actual stem usually do not change. • eat / eats pencil / pencils • helaa / khele / khelchhila bai / baiTAke / baiyera

  9. Inflectional Morphology • Adds: • tense, number, person, mood, aspect • Word class doesn’t change • Word serves new grammatical role • Examples • come is inflected for person and number: The pizza guy comes at noon. • las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s las manzanas rojas (‘the red apples’) Modified from Dorr and Habash (after Jurafsky and Martin)

  10. Derivational Morphology • Derivation: • The formation of a new word or inflectable stem from another word or stem. • After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change. • compute / computer do / undo friend / friendly • Uygar / uygarlaş kapı /kapıcı • udaara (J) / udaarataa (N) • bhadra / abhadra • baayu / baayabiiya • Irregular changes may happen with derivational affixes.

  11. Derivational Morphology • Nominalization (formation of nouns from other parts of speech, primarily verbs in English): • computerization • appointee • killer • fuzziness • Formation of adjectives (primarily from nouns) • computational • clueless • Embraceable Modified from Dorr and Habash (after Jurafsky and Martin)

  12. Concatinative Morphology • Morpheme+Morpheme+Morpheme+… • Stems: also called lemma, base form, root, lexeme • hope+ing  hoping hop  hopping • Affixes • Prefixes: Antidisestablishmentarianism • Suffixes: Antidisestablishmentarianism • Infixes: hingi (borrow) – humingi (borrower) in Tagalog • Circumfixes: sagen (say) – gesagt (said) in German • Agglutinative Languages • uygarlaştıramadıklarımızdanmışsınızcasına • uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına • Behaving as if you are among those whom we could not cause to become civilized Modified from Dorr and Habash (after Jurafsky and Martin)

  13. Templatic Morphology • Roots and Patterns • Example: Hebrew verbs • Root: • Consists of 3 consonants CCC • Carries basic meaning • Template: • Gives the ordering of consonants and vowels • Specifies semantic information about the verb • Active, passive, middle voice • Example: • lmd (to learn or study) • CaCaC -> lamad (he studied) • CiCeC -> limed (he taught) • CuCaC -> lumad (he was taught) Modified from Dorr and Habash (after Jurafsky and Martin)

  14. Syntax and Morphology • Phrase-level agreement • Subject-Verb • John studies hard (STUDY+3SG) • Noun-Adjective • Las vacas hermosas • Sub-word phrasal structures • שבספרינו • ש+ב+ספר+ים+נו • That+in+book+PL+Poss:1PL • Which are in our books Modified from Dorr and Habash (after Jurafsky and Martin)

  15. Surface and Lexical Forms • The surface level of a word represents the actual spelling of that word. • geliyorum eats cats kitabım • The lexical level of a word represents a simple concatenation of morphemes making up that word. • gel +PROG +1SG • eat +AOR • cat +PLU • kitap +P1SG • Morphological processors try to find correspondences between lexical and surface forms of words. • Morphological recognition/ analysis – surface to lexical • Morphological generation/ synthesis – lexical to surface

  16. Morphology: Morphemes & Order • Handles what is an isolated form in written text • Grouping of phonemes into morphemes • sequence deliverables ~deliver, able and s(3 units) • Morpheme Combination • certain combinations/sequencing possible, other not: • deliver+able+s, but not able+derive+s; noun+s, but not noun+ing • typically fixed (in any given language)

  17. Morphological Parsing • Morphological parsing is to find the lexical form of a word from its surface form. • cats -- cat +N +PLU • cat -- cat +N +SG • goose -- goose +N +SG or goose +V • geese -- goose +N +PLU • gooses -- goose +V +3SG • catch -- catch +V • caught -- catch +V +PAST or catch +V +PP • AsachhilAma AsA+PROG+PAST+1st I/We was/were coming • There can be more than one lexical level representation for a given word. (ambiguity) flies flyVERB+PROG flyNOUN+PLU mAtAla kare

  18. The history of morphological analysis dates back to the ancient Indian linguist Pāṇini, who formulated the 3,959 rules of Sanskrit morphology in the text Aṣṭādhyāyī by using a Constituency Grammar.

  19. Formal definition of the problem • Surface form: The word (ws) as it occurs in the text. [sings] ws L  Σ+ • Lexical form: The root word(s) (r1, r2, …) and other grammatical features (F). [sing,v,+sg,+3rd ] wl {Σ+,}+F+ wl  Δ+

  20. Analysis & Synthesis • Morphological Analysis: Maps a string from surface form to corresponding lexical form. fMA:Σ+  Δ+ • Morphological Synthesis: Maps a string from lexical form to surface form. fMA:Δ+ Σ+

  21. Relationship between MA & MS fMSfMA(ws) = ws fMAfMS(wl) = wl fMS = fMA, fMA = fMS But is that really the case? -1 -1

  22. Fly + s  flys  flies (y i rule) • Duckling Go-getter  get + er Doer  do + er Beer  ? What knowledge do we need? How do we represent it? How do we compute with it?

  23. Knowledge needed • Knowledge of stems or roots • Duck is a possible root, not duckl We need a dictionary (lexicon) • Only some endings go on some words • Do + er ok • Be + er – not ok • In addition, spelling change rules that adjust the surface form • Get + er – double the t getter • Fox + s – insert e – foxes • Fly + s – insert e – flys – y to i – flies • Chase + ed – drop e - chased

  24. Put all this in a big dictionary (lexicon) • Turkish – approx 600  106 forms • Finnish – 107 • Hindi, Bengali, Telugu, Tamil? • Besides, always novel forms can be constructed • Anti-missile • Anti-anti-missile • Anti-anti-anti-missile • …….. • Compounding of words – Sanskrit, German

  25. Morphology: From Morphemes to Lemmas & Categories • Lemma: lexical unit, “pointer” to lexicon • typically is represented as the “base form”, or “dictionary headword” • possibly indexed when ambiguous/polysemous: • state1 (verb), state2 (state-of-the-art), state3 (government) • from one or more morphemes (“root”, “stem”, “root+derivation”, ...) • Categories: non-lexical • small number of possible values (< 100, often < 5-10)

  26. Morphology Level: The Mapping • Formally: A+  2(L,C1,C2,...,Cn) • A is the alphabet of phonemes (A+ denotes any non-empty sequence of phonemes) • L is the set of possible lemmas, uniquely identified • Ci are morphological categories, such as: • grammatical number, gender, case • person, tense, negation, degree of comparison, voice, aspect, ... • tone, politeness, ... • part of speech (not quite morphological category, but...) • A, L and Ci are obviously language-dependent

  27. Morphological Analysis (cont.) • Relatively simple for English. • But for many Indian languages, it may be more difficult. Examples Inflectional and Derivational Morphology. • Common tools: Finite-state transducers • A transducer maps a set/string of symbols to another set/string of symbols

  28. A simpler problem • Linear concatenation of morphemes with possible spelling changes at the boundary and a few irregular cases. • Quite practical assumptions • English, Hindi, Bengali, Telugu, Tamil, French, Turkish … • Exceptions: Semitic languages, Sanskrit

  29. Computational Morphology • Approaches • Lexicon only • Rules only • Lexicon and Rules • Finite-state Automata • Finite-state Transducers Modified from Dorr and Habash (after Jurafsky and Martin)

  30. Computational Morphology • Systems • WordNet’s morphy • PCKimmo • Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay • Accurate but complex • http://www.sil.org/pckimmo/ • Two-level morphology • Commercial version available from InXight Corp. • Background • Chapter 3 of Jurafsky and Martin • A short history of Two-Level Morphology • http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/ Modified from Dorr and Habash (after Jurafsky and Martin)

  31. Finite State Machines • FSAs are equivalent to regular languages • FSTs are equivalent to regular relations (over pairs of regular languages) • FSTs are like FSAs but with complex labels. • We can use FSTs to transduce between surface and lexical levels.

  32. Can FSAs help? Reg-noun Plural (-s) Q0 Q1 Q2 Irreg-pl-noun Irreg-sg-noun

  33. What’s this for? un Adj-root Q0 Q1 Q2 Q3 -er -est -ly ε un?ADJ-ROOT{er | est | ly}?

  34. Morphotactics • The last two examples basically model some parts of the English morphotactics • But where is the information about regular and irregular roots? LEXICON • Can we include the lexicon in the FSA?

  35. Reg-noun Plural (-s) Q0 Q1 Q2 Irreg-pl-noun Irreg-sg-noun The English Pluralization FSA

  36. After adding a mini-lexicon a s g u b s Q1 Q2 Q0 d o g m a n n e

  37. Elegance & Power • FSAs are elegant because • NFA  DFA • Closed under Union, Intersection, Concatenation, Complementation • Traversal is always linear on input size • Well-known algorithms for minimization, determinization, compilation etc. • They are powerful because they can capture • Linear morphology • Irregularities

  38. But… FSAs are language recognizer/generator. We need transducers to build Morphological Analyzer (fMA) & Morphological Synthesizers (fMS)

  39. Finite State Transducers Surface form Finite State Machine Lexical form

  40. Formal Definition • A 6-tuple {Σ,Δ,Q,δ,q0,F} • Σ is the (finite) set of input symbols • Δ is the (finite) set of output symbols • Q is the set (FINITE) of states • δ is the transition function Q Σ to Q  Δ • q0 Q is the start state • F  Q is the set of accepting states

  41. An example FST a:a s:ε g:g b:b u:u s:s Q1 Q2 Q0 d:d o:o g:g a:a m:m n:n n:n e:a

  42. The Lexicon FST a:a s:+Pl g:g #:+Sg b:b u:u s:s Q1 Q2 Q0 d:d o:o g:g #:+Sg a:a n:n m:m Q3 e:a #:+Pl n:n Q4

  43. Ways to look at FSTs • Recognizer of a pair of strings • Generator of a pair of strings • Translator from one regular language to another • Computer of a relation – regular relation

  44. Questions about FSTs • Suppose T1 and T2 are two FSTs. Which of the following are FSTs (i.e. regular relations)? • Union: T1 T2 • Composition: T1 T2 • Intersection: T1 T2 • If R is a regular relation, is R-1 regular?

  45. Questions about FSTs • Suppose T1 and T2 are two FSTs. Which of the following are FSTs (i.e. regular relations)? • Union: T1 T2 YES • Composition: T1 T2 YES • Intersection: T1 T2 NOT in general • If R is a regular relation, is R-1 regular?  YES

  46. Invertibility • Given T = {Σ,Δ,Q,δ,q0,F} • Construct T-1 = {Δ,Σ,Q,δ-1,q0,F} such that if δ(x,q)  (y,q’) then δ-1(y,q)  (x,q’) where, x Σand y  Δ

  47. Compositionality • T1 = {Σ, X, Q1,δ1,q1,F1} & T2 = {X, Δ, Q2,δ2,q2,F2} • DefineT3 = {Σ, Δ, Q3,δ3,q3,F3} such that Q3 = Q1 Q2 q3 = (q1, q2) δ3 ((q,s), i) = ((q’,s’),o) if c s.t δ1 (q, i) = (q’,c) andδ2 (s, c) = (s’,o)

  48. Modelling Orthographic Rules • Spelling changes in morpheme boundaries • bus+s  buses, watch+s  watches • fly+s  flies • make+ing  making • Rules • E-insertion takes place if the stem ends in s, z, ch, sh etc. • y maps to ie when pluralization marker s is added

  49. Rewrite Rules • Chomsky and Halle (1968) • General form: ab / λ__ ρ • E-insertion: ε e / {x,s,z,ch,sh…}^ __ s# • Kay and Kaplan (1994) showed that FSTs can be compiled from general rewrite rules

  50. Two-level Morphology (Koskenniemi, 1983) lexical LEXICON FST intermediate FST1 FSTn orthographic rules surface

More Related