1 / 58

Analogy in morphology: Only a beginning

Analogy in morphology: Only a beginning. John Goldsmith The University of Chicago CNRS/MoDyCo. Analogy in grammar: Form and acquisition Max Planck Institute for Evolutionary Biology Leipzig September 2006. Outline of talk. Word segmentation problem

otto
Télécharger la présentation

Analogy in morphology: Only a beginning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analogy in morphology:Only a beginning John Goldsmith The University of Chicago CNRS/MoDyCo Analogy in grammar: Form and acquisition Max Planck Institute for Evolutionary Biology Leipzig September 2006

  2. Outline of talk • Word segmentation problem • Minimum Description Length (MDL) framework • Learning morphological structure: analogy takes us only so far

  3. signature Finite State Automaton

  4. Part 1:The word segmentation problem

  5. Input: inprincipioerailverbo Language-independent device Output: in principio era il verbo

  6. Word segmentation Work by Michael Brent and by Carl de Marcken in the mid-1990s at MIT. A lexicon Lis a pair of objects (L, pL ): a set L A*, and a probability distribution pL that is defined on A* for which L is the support of pL. We call L the words. • We insist that A L: all individual letters are words. • We define a language as a subset of L*; its members are sentences. • Each sentence can be uniquely associated with an utterance (an element in A *) by a mapping F:

  7. F:L*A*

  8. F:L*A* S If F(S) = U then we say that S is a parse of U. U

  9. F:L*A* S U We pull back the measure from the space of letters to thespace of words.

  10. Different lexicons lead to different probabilities of the data Given an utterance U The probability of a string of letters is the probabilityassigned to its best parse.

  11. Class of models originally studied in the word segmentation problem Our data is a finite string (“corpus”), generated by a finite alphabet; We find the best parse for the string; The probability of the parse is the product of the probability of its words; The words are assigned a maximum likelihood probability of the simplest sort.

  12. Results • The Fulton County Grand Ju ry said Friday an investi gation of At l anta 's recent prim arye lectionproduc ed no e videnc e that any ir regul ar it i e s took place . • Thejury further s aid in term - end present ment sthatthe City Ex ecutiveCommit t e e ,which had over - all charg eofthee lection , d e serv e s the pra is e and than ksofthe City of At l antafortheman ner in whichthee lection was conduc ted. Chunks are too big Chunks are too small

  13. From Encarta: trained on the first 150 articles La funzione originari a dell'abbigliament o fu for s e quell a di pro t egger e il corpo dalle av vers i tà del c li ma . Ne i paesi cal di uomini e donn e indoss ano gonn ell in i mor bi di e drappeggi ati e per i z om i . In generale gli abit ant i delle zon ecal d e non port ano ma i più di due stra t i di vestit i. Al contr ari o, nei luog h i dove il c li ma è più rigid o sono diffus i abiti ader enti e a più stra ti . C omun e alle due tradizion i è tuttavi a l' abitudin e di ricor re re a mantell i per ri par arsi dagli e le ment i.

  14. 3 major categories of failures of MDL word-discovery • Many failures of word-discovery are correct discovery of morphemes (word-pieces) investi-gation, pro-t-egger-e. • Many (thought fewer) failures of word-discovery are discovery of pairs of words that frequently appear together (for example, ofthe). • Many failures are too short to be likely words.

  15. As we add more linguistic sophistication to the class of models considered, MDL makes increasingly better predictions.

  16. Part 2: Minimum Description Length (MDL) Analysis Jorma Rissanen (1989) Stochastic complexity in statistical enquiry.

  17. Synthetic apriori • The mind’s construction of the world is its best understanding of what the senses provides it with. • The real world is the one which is most probable, given our observations. Bayesian, maximum a posteriori reasoning

  18. Bayes’ Rule D = Data H = Hypothesis

  19. Bayes’ Rule D = Data H = Hypothesis Definition Define pr(A|B) = pr(A&B)/pr(B)

  20. Bayes’ Rule D = Data H = Hypothesis Definition Definition

  21. Bayes’ Rule D = Data H = Hypothesis Definition Definition

  22. Bayes’ Rule D = Data H = Hypothesis 

  23. If reality is the most probable hypothesis, given the evidence... • we must find the hypothesis for which the following is a maximum: D = Data H = Hypothesis How do we calculate the probability of our hypothesis about what reality is? How do we calculate the probability of our observations, given our understanding of reality? rationalism empiricism

  24. How do we calculate the probability of our hypothesis about what reality is? How do we calculate the probability of our observations, given our understanding of reality? Assign a (“prior”) probability to all hypotheses, based on their coherence. Measure the coherence. Call it an evaluation metric. Insist that your grammars be probabilistic: they assign a probability to their generated output. Kraft’s inequality: If grammars have the “prefix property” (guaranteed local punctuation), then we can assign pr(G) = 2-length(G)

  25. Usage of MDL If description length of data D, given model M, is equal to the inverse log probability assigned to D by M + compressed length of M, then The process of word-learning is unambiguously one of increasing the probability of the data, and using the length of M as a stopping criterion.

  26. Essence of MDL 2. MDL

  27. Part 3Learning morphology

  28. Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Naïve MDL 3. Morphology

  29. 1st approximation: a morphology is: a list of stems, a list of affixes (prefixes, suffixes), and a list of pointers indicating which combinations are permissible. Unlike the word segmentation problem, now we have no obvious search heuristics. These are very important (for that reason)—and I will not talk about them. Model/heuristic 3. Morphology

  30. Size of model 3. Morphology M[orphology] = { Stems T, Affixes F, Signatures S } stems affixes What is a signature, and what is its length? sig’s extensivity

  31. What is a signature? 3. Morphology

  32. What is the length (=information content) of a signature? A signature is an ordered pair of two sets of pointers: (i) a set of pointers to stems; and (ii) a set of pointers to affixes. The length of a pointer p is –log freq (p): So the total length of the signatures is: Sum over signatures Sum over stem ptrs

  33. Generation 1 Linguistica http://linguistica.uchicago.edu Initial pass: assumes that words are composed of 1 or 2 morphemes; finds all cases where signatures exist with at least 2 stems and 2 affixes: 3. Morphology

  34. Generation 1 3. Morphology Then it refines this initial approximation in a large number of ways, always trying to decrease the description length of the initial corpus.

  35. French roots 3. Morphology

  36. 3. Morphology 4. Detect allomorphy Signature: <e>ion . NULL composite concentrate corporate détente discriminate evacuate inflate opposite participate probate prosecute tense What is this? composite and composition composite composit  composit + ion It infers that iondeletes a stem-final ‘e’ before attaching.

  37. 4. Morphology Swahili verb

  38. 4. Morphology Swahili verb Subject marker

  39. 4. Morphology Swahili verb Subject marker Tense marker

  40. 4. Morphology Swahili verb Subject marker Tense marker Object marker

  41. 4. Morphology Swahili verb Subject marker Object marker Tense marker Root

  42. 4. Morphology Swahili verb Subject marker Object marker Tense marker Root Voice (active/passive)

  43. 4. Morphology Swahili verb Subject marker Object marker Tense marker Root Voice (active/passive) Finalvowel

  44. Signature: reduces false positives 4. Morphology

  45. Generalize the signature… 4. Morphology Sequential FSA: each state has a unique successor.

  46. Alignments 4. Morphology

More Related