1 / 43

Acquisition of Morphology by Computer: Unsupervised learning

Acquisition of Morphology by Computer: Unsupervised learning. John Goldsmith The University of Chicago. The goal:. To produce a morphological analysis of a corpus from an “unknown” language automatically that is, with no knowledge of the structure of that language built in;

napua
Télécharger la présentation

Acquisition of Morphology by Computer: Unsupervised learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Acquisition of Morphology by Computer: Unsupervised learning John Goldsmith The University of Chicago

  2. The goal: To produce a morphological analysis of a corpus from an “unknown” language automatically that is, with no knowledge of the structure of that language built in; To produce both generalizations about the language, and a correct analysis of each word in the corpus.

  3. raw data Linguistica Analyzed data

  4. Implemented in Linguistica, a program that runs under Windows that you can download at: humanities.uchicago.edu/faculty/goldsmith

  5. The goal is not to eliminate either linguists or linguistics; The goal is to understand what the goal of a linguistic analysis is so well that we can state it explicitly and algorithmically.

  6. Other work in this area • Derrick Higgins on Thursday; • Michael Brent 1993; • Zellig Harris: 1955 and 1967, follow-up: Hafer and Weiss 1974

  7. Global approach • Focus on devising a method for evaluating a hypothesis, given the data. • Finding explicit methods of discovery is important, but those methods play no role in evaluating the analysis for a given corpus. (Very similar in conception to Chomsky’s notion of an evaluation metric.)

  8. Framework for evaluation: Jorma Rissanen’s Minimum Description Length (“MDL”). Quite intricate; but we can get a very good feel for the general idea with a naïve version of MDL...

  9. Naive description length Count the total number of letters in the list of stems and affixes: the fewer, the better.

  10. Intuition: A word which is morphologically complex reveals that composite character by virtue of being composed of (one or more) strings of letters which have a relatively high frequency throughout the corpus.

  11. Naive description length: 2 Lexicographers know what they are doing when they indicate the entry for the verb laugh as laugh, ~s, ~ed, ~ing -- They recognize that the tilde “ ~” allows them to utilize the regularities of the language in order to save space and specification, and implicitly to underscore the regularity of the pattern that the stem possesses.

  12. Morphological analysis is not merely a matter of frequency. Not every word that ends in –ing is morphologically complex: string, sing, etc.

  13. Naive Minimum Description Length: Analyze the words of a corpus into stem + suffix with the requirement that every stem and every suffix must be used in at least 2 distinct words. Tally up the total number of letters in (a) each of the proposed stems, (b) each of the proposed suffixes, and (c) each of the unanalyzed words, and call that total the “naive description length”.

  14. Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Naive Minimum Description Length Notice that the description length goes UP if we analyze sing into s+ing

  15. Frequencies matter, but only in the overarching context of a total morphological analysis of all of the words of the language.

  16. Let’s look at how the work is done, step by step...

  17. Corpus Pick a large corpus from a language -- 5,000 to 1,000,000 words.

  18. Corpus Feed it into the “bootstrapping” heuristic... Bootstrap heuristic

  19. Corpus Bootstrap heuristic Out of which comes a preliminary morphology, which need not be superb. Morphology

  20. Corpus Bootstrap heuristic Feed it to the incremental heuristics... Morphology incremental heuristics

  21. Corpus Out comes a modified morphology. Bootstrap heuristic Morphology modified morphology incremental heuristics

  22. Corpus Is the modification an improvement? Ask MDL! Bootstrap heuristic Morphology modified morphology incremental heuristics

  23. Corpus If it is an improvement, replace the morphology... Bootstrap heuristic modified morphology Morphology Garbage

  24. Corpus Send it back to the incremental heuristics again... Bootstrap heuristic modified morphology incremental heuristics

  25. Continue until there are no improvements to try. Morphology modified morphology incremental heuristics

  26. Bootstrapping...initial hypothesis = initial morphology of the corpus

  27. First: a set of candidate suffixesfor the language Using some interesting statistics.

  28. 4. Weight the stickiness (3) by how often the string shows up in the corpus 1. Observed frequency of a string (e.g., ing) 3. The computed “stickiness” of that string 2. Predicted frequency of the same string if there were no morphemes in the language

  29. Rank all word-final sequences of letters (of length 1-4 letters); • This gives us an excellent first guess of the suffixes of the language. • See Handout for English, French, Spanish, and Latin.

  30. Given a candidate set of 100 suffixes... • It is not difficult to find the set of stems that gives us the largest number of analyses employing only those suffixes. • We use these to find the major signatures present in the corpus ...

  31. Discovery of signatures: The first 8 stems in the largest signature in a 500,000 word corpus of English. Set of suffixes that appears with all of these stems

  32. Minimum Description Length The real thing, this time: Rissanen 1989. Evaluate a morphology by: 1. How well the morphology extracts generalizations present in the data: how well it describes the data. 2. How concise the morphology is. The “naïve MDL” we just looked at only covered the second point, and only crudely.

  33. Measure how well the morphology fits the data: 1. Compute the predicted inverse log frequency of each word in the corpus, and sum: This is a well-understood quantity in information theory, called the “optimal compressed length” of the corpus based on the probability distribution defined by the morphology.

  34. Conciseness Sum all the letters, plus all the structure inherent in the description, using information theory.

  35. structure Number of letters + Signatures, which we’ll get to shortly

  36. Information contained in the Signature component list of pointers to signatures <X> indicates the number of distinct elements in X

  37. Results…

  38. French

  39. Spanish

  40. Latin

  41. Future directions Develop it to work with languages with greater complexity; and Use it as an aide in the task of learning syntax in the same unsupervised fashion.

More Related