Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior

Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior Mikhail Belkin and John Goldsmith The University of Chicago July 2002

Dual motivation • Unsupervised learning of syntactic behavior of words • Solving a problem in the unsupervised learning of morphology: disambiguating morphs

Disambiguating morphs? • Automatic learning of morphology can provide us with a signature associated with a given stem: • Signature = alphabetized list of affixes associated with a given stem in a corpus.

For example: Signature NULL.ed.ing.s: • aid, ask, call, claim, help,kick Signature NULL.ed.ing: • add, assist, attend, consider Signature NULL.s • achievement, acre, action, administrator, affair

The signature NULL.ed.ing is much more a subsignature of NULL.ed.ing.s than NULL.s is because of s’s ambiguity (noun, verb).

How can we determine whether a given morph (“ed”, “s”) represents more than 1 morpheme? • I don’t think that we can do this on the basis of morphological information.

Goal: find a way of describing syntactic behavior in a way that is dependent only on a corpus. • That is, in a fashion that is language-independent but corpus-dependent – though the global structure that is induced from 2 corpora from the same language will be very similar.

Finite verbs French plural nouns Fem. sg. nouns

With such a method… We can look at words formed with the “same” suffix, putting words into buckets based on the signature their stem is in: • Bucket 1 (NULL.ed.ing.s): aided, asked, called • Bucket 2 (NULL.ed.ing): added, assisted, attended. Q: do the average position of each bucket form a tight cluster?

If the average locations of each bucket of –ed words form a tight cluster, then –ed is not ambiguous. If the average locations of each bucket (from distinct signatures) does not form a tight cluster, the morpheme is not the same across signatures.

Method • Not a clustering method; neither top-down nor bottom-up. • Two step procedure: 1. Construct a nearest-neighbor graph. 2. Reduce the graph to 2-dimensions by means of eigenvector decomposition.

Nearest neighbors Following a long list of researchers: • We begin by assuming that a word W’s distribution can be described by a vector L describing all of its left-hand neighbors and a vector R describing all of its right-hand neighbors.

V = Size of corpus’ vocabulary V Lw,Rw are vectors that live in RV. If V is ordered alphabetically, then Lw = (4, 0, 0, 0, …) # of occurrences of “abandoned” before w # of occurrences of “a” before w # of occurrences of “abatuna” before w

Similarity of syntactic behavior is modeled as closeness of L-vectors …where “closeness” of 2 vectors is modeled as the angle between them.

Construct a (non-directed) graph: Its vertices are the words W in V. For each word W: • Pick the K most-similar words (K = 20, 50) (by angle of L-vector) • Add an edge to the graph connecting W to each of those words.

Canonical matrix representation of a graph: M(i,j) = 1 iff there is an edge connecting wi and wj – that is, iff wi and wj are similar words as regards how they interact with the word immediately to the left.

Where is this matrix M? • It’s a point in a space of size V(V-1)/2. Not very helpful, really. • How can we optimally reduce it to a space of small dimension? • Find the eigenvectors of the normalized laplacian of the graph. See Chung, Malik and Shi, Belkin and Niyogi (references in written version)

A graph and its matrix M • The degree of a vertex (= word) is the number of edges adjacent (linked) to it. • Notice that this is not fixed across words. • The degree of vertex vi is the sum of the entries of the ith row.

The laplacian of the graph Let D = VxV diagonal matrix s.t. diagonal entry M(i,i) = degree of vi D – M is the Laplacian of the graph. Its rows sum to 0.

Normalized laplacian: • For each i, divide all entries in the ith row by √d(i). • For each i, divide all entries in the ith column by √d(i). • Result: Diagonal elements are all 1. • Generally:

Eigenvector decomposition • The eigenvectors form a spectrum, ranked by the value of their eigenvalues. • Eigenvalues run from 0 to 2 (L is positive semi-definite). • The eigenvector with 0 eigenvalue reflects word’s frequency (“zeroth”). • But the next smallest (the “first”) gives us a good represenation of the words…

…in the sense that the values associated with each word show how close the words are in the original graph. We can graph the first two eigenvectors of the Left (or Right) graph: each word is located at the coordinates corresponding to it in the eigenvector(s):

masculine plurals Spanish (left) fem. plurals feminine sg nouns masc. sg. nouns past participles finite verbs

German (left) Neuter sg nouns numbers, centuries Fem. sg. nouns Names of places

English (right) nouns modals prepositions + of + “to”

English (left) infinitives past verbs modals the +

Results of experiment • If we define the size of the minimal box that includes all of the vocabulary as being 1 by 1, then we find a small ( < 0.10 ) average distance to mean for unambiguous suffixes (e.g., -ed (English), -ait (French) ) – only for them.

Measure • To repeat: we find the “virtual” location of the conflation of all of the stems of a given signature, plus the suffix in questione.g., NULL.ed.ing_ed • We do this for all signatures containing “ed” • We compute average distance to the mean.

Average <= 0.10 Average > 0.10

Conclusion • The technique appears to work appropriately for the task. • But we suspect that the actual use of the technique is much more interesting and open-ended than this (simple) application suggests.

Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior

Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior

Presentation Transcript

Representation of a dissimilarity matrix using reticulograms

Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors

Representation of a dissimilarity matrix using reticulograms

Using a pot model to represent osmosis

Segmentation using eigenvectors

Using Infographics to Represent

Reading to Infer

Infer

Eigenvalues and Eigenvectors

Infer

Infer

Using an Ontological A-priori Score to Infer User’s Preferences

Predicting indices of climate extremes using eigenvectors of SST and MSLP

INFER!

Segmentation using eigenvectors

AUTOLOGOUS MATRIX-INDUCED CHONDROGENESIS MARKET