120 likes | 267 Vues
This document presents an unsupervised methodology for morphological analysis of large text corpora, based on the Minimum Description Length (MDL) principle. The technique processes input files ranging from 5,000 to 1,000,000 words to perform a partial morphological analysis, determining optimal splits of words into stems and suffixes without relying on a dictionary or predefined morphological rules. Key approaches include predicting morpheme boundaries, identifying likely morpheme-internal patterns, and achieving a concise global analysis in terms of data representation using an information-theoretic model.
E N D
Linguistica • INPUT: text file as input, typically 5,000 to 1,000,000 words • OUTPUT: partial morphological analysis of most of the words in the corpus • Unsupervised • No dictionary • No morphological rules • MDL Framework (Rissanen 1989) Learning Morphology
The Problem • Determination of the correct mophological split for individual words into stem and suffixes. • Establishment of accurate categories of stems based on the range of suffixes they accept. Learning Morphology
Four Approaches • Identify morpheme boundaries (and hence morphemes) on the basis of degree of predictibility of n+1st letter given the first n letters. (Z.Harris, 1955, 1967) • Identify bigrams and trigrams that have a high probability of being morpheme-internal • Discovery of patterns of phonological relationships between pairs of related words • Seek analysis that is globally most concise (Goldsmith 2001) Learning Morphology
Minimum Description Length Model: 4 Components • A model of a set of data that assigns a probability distribution to the sample space fron which the data is drawn. • The model can be used to assign a compressed length to the data using information-theoretic notions. • The model can itself be assigned a length. • The optimal analysis of the data is the one for which the sum of the length of the compressed data and the length of the model is the smallest. • In other words, we seek a minimally compact representation of both the model and the data simultaneously. Learning Morphology
An Example Model • List of stems • The set of unanalysed words plus the material that precedes the final suffix of any unanalysed word • List of suffixes that occur with at least one stem • List of signatures • Each stem is associated with a list of observed suffixes. This is the stem’s signature. This list is created using pointers Learning Morphology
STEMS:9 cat dog hat John jump laugh sav the walk AFFIXES:6 NULL ed ing s e es MDL Example Learning Morphology
MDL Example: Signatures S1: ptr(cat) ptr(NULL) ptr(dog) ptr(s) ptr(hat) S2: ptr(sav) ptr(e) ptr(es) ptr(ing) S3: ptr(jump) ptr(NULL) ptr(laugh) ptr(ed) ptr(walk) ptr(ing) ptr(s) S4: ptr(John) ptr(the) Learning Morphology
Notation t a stem f a suffix s signature T set of stems in corpus F set of suffixes in corpus S set of signatures in corpus <T>, <F>, <S> cardinalities of T,F,S [t],[f] frequency of t, f in corpus W set of words in the corpus [W] length of the corpus <W> vocabulary size Learning Morphology
A signature comprises two lists: • List of pointers to stems • List of pointers to suffixes To specify a list of length N need L(N) bits where L(N) ~= log2(N) A pointer to a stem t is of length –log(P(t)) where P(t) = [t]/[W] Learning Morphology