1.27k likes | 1.41k Vues
Language Independent Methods of Clustering Similar Contexts (with applications). Ted Pedersen University of Minnesota, Duluth http://www.d.umn.edu/~tpederse tpederse@d.umn.edu. Language Independent Methods. Do not utilize syntactic information
E N D
Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth http://www.d.umn.edu/~tpederse tpederse@d.umn.edu EACL-2006 Tutorial
Language Independent Methods • Do not utilize syntactic information • No parsers, part of speech taggers, etc. required • Do not utilize dictionaries or other manually created lexical resources • Based on lexical features selected from corpora • Assumption: word segmentation can be done by looking for white spaces between strings • No manually annotated data of any kind, methods are completely unsupervised in the strictest sense EACL-2006 Tutorial
Clustering Similar Contexts • A context is a short unit of text • often a phrase to a paragraph in length, although it can be longer • Input: N contexts • Output: K clusters • Where each member of a cluster is a context that is more similar to each other than to the contexts found in other clusters EACL-2006 Tutorial
Applications • Headed contexts (contain target word) • Name Discrimination • Word Sense Discrimination • Headless contexts • Email Organization • Document Clustering • Paraphrase identification • Clustering Sets of Related Words EACL-2006 Tutorial
Tutorial Outline • Identifying lexical features • Measures of association & tests of significance • Context representations • First & second order • Dimensionality reduction • Singular Value Decomposition • Clustering • Partitional techniques • Cluster stopping • Cluster labeling • Hands On Exercises EACL-2006 Tutorial
General Info • Please fill out short survey • Break from 4:00-4:30pm • Finish at 6pm • Reception tonight at 7pm at Castle (?) • Slides and video from tutorial will be posted (I will send you email when that is ready) • Questions are welcome • Now, or via email to me or SenseClusters list. • Comments, observations, criticisms are all welcome • Knoppix CD, will give you Linux and SenseClusters when computer is booted from the CD. EACL-2006 Tutorial
SenseClusters • A package for clustering contexts • http://senseclusters.sourceforge.net • SenseClusters Live! (Knoppix CD) • Integrates with various other tools • Ngram Statistics Package • CLUTO • SVDPACKC EACL-2006 Tutorial
Many thanks… • Amruta Purandare (M.S., 2004) • Founding developer of SenseClusters (2002-2004) • Now PhD student in Intelligent Systems at the University of Pittsburgh http://www.cs.pitt.edu/~amruta/ • Anagha Kulkarni (M.S., 2006, expected) • Enhancing SenseClusters since Fall 2004! • http://www.d.umn.edu/~kulka020/ • National Science Foundation (USA) for supporting Amruta, Anagha and me via CAREER award #0092784 EACL-2006 Tutorial
Background and Motivations EACL-2006 Tutorial
Headed and Headless Contexts • A headed context includes a target word • Our goal is to cluster the target words based on their surrounding contexts • Target word is center of context and our attention • A headless context has no target word • Our goal is to cluster the contexts based on their similarity to each other • The focus is on the context as a whole EACL-2006 Tutorial
Headed Contexts (input) • I can hear the ocean in that shell. • My operating system shell is bash. • The shells on the shore are lovely. • The shell command line is flexible. • The oyster shell is very hard and black. EACL-2006 Tutorial
Headed Contexts (output) • Cluster 1: • My operating system shell is bash. • The shell command line is flexible. • Cluster 2: • The shells on the shore are lovely. • The oyster shell is very hard and black. • I can hear the ocean in that shell. EACL-2006 Tutorial
Headless Contexts (input) • The new version of Linux is more stable and better support for cameras. • My Chevy Malibu has had some front end troubles. • Osborne made on of the first personal computers. • The brakes went out, and the car flew into the house. • With the price of gasoline, I think I’ll be taking the bus more often! EACL-2006 Tutorial
Headless Contexts (output) • Cluster 1: • The new version of Linux is more stable and better support for cameras. • Osborne made one of the first personal computers. • Cluster 2: • My Chevy Malibu has had some front end troubles. • The brakes went out, and the car flew into the house. • With the price of gasoline, I think I’ll be taking the bus more often! EACL-2006 Tutorial
Web Search as Application • Web search results are headed contexts • Search term is target word (found in snippets) • Web search results are often disorganized – two people sharing same name, two organizations sharing same abbreviation, etc. often have their pages “mixed up” • If you click on search results or follow links in pages found, you will encounter headless contexts too… EACL-2006 Tutorial
Name Discrimination EACL-2006 Tutorial
George Millers! EACL-2006 Tutorial
Email Foldering as Application • Email (public or private) is made up of headless contexts • Short, usually focused… • Cluster similar email messages together • Automatic email foldering • Take all messages from sent-mail file or inbox and organize into categories EACL-2006 Tutorial
Clustering News as Application • News articles are headless contexts • Entire article or first paragraph • Short, usually focused • Cluster similar articles together EACL-2006 Tutorial
What is it to be “similar”? • You shall know a word by the company it keeps • Firth, 1957 (Studies in Linguistic Analysis) • Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis) • Harris, 1968 (Mathematical Structures of Language) • Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis) • Miller and Charles, 1991 (Language and Cognitive Processes) • Various extensions… • Similar contexts will have similar meanings, etc. • Names that occur in similar contexts will refer to the same underlying person, etc. EACL-2006 Tutorial
General Methodology • Represent contexts to be clustered using first or second order feature vectors • Lexical features • Reduce dimensionality to make vectors more tractable and/or understandable • Singular value decomposition • Cluster the context vectors • Find the number of clusters • Label the clusters • Evaluate and/or use the contexts! EACL-2006 Tutorial
Identifying Lexical Features Measures of Association and Tests of Significance EACL-2006 Tutorial
What are features? • Features represent the (hopefully) salient characteristics of the contexts to be clustered • Eventually we will represent each context as a vector, where the dimensions of the vector are associated with features • Vectors/contexts that include many of the same features will be similar to each other EACL-2006 Tutorial
Where do features come from? • In unsupervised clustering, it is common for the feature selection data to be the same data that is to be clustered • This is not cheating, since data to be clustered does not have any labeled classes that can be used to assist feature selection • It may also be necessary, since we may need to cluster all available data, and not hold out some for a separate feature identification step • Email or news articles EACL-2006 Tutorial
Feature Selection • “Test” data – the contexts to be clustered • Assume that the feature selection data is the same as the test data, unless otherwise indicated • “Training” data – a separate corpus of held out feature selection data (that will not be clustered) • may need to use if you have a small number of contexts to cluster (e.g., web search results) • This sense of “training” due to Schütze (1998) EACL-2006 Tutorial
Lexical Features • Unigram – a single word that occurs more than a given number of times • Bigram – an ordered pair of words that occur together more often than expected by chance • Consecutive or may have intervening words • Co-occurrence – an unordered bigram • Target Co-occurrence – a co-occurrence where one of the words is the target word EACL-2006 Tutorial
Bigrams • fine wine (window size of 2) • baseball bat • house of representatives (window size of 3) • president of the republic (window size of 4) • apple orchard • Selected using a small window size (2-4 words), trying to capture a regular (localized) pattern between two words (collocation?) EACL-2006 Tutorial
Co-occurrences • tropics water • boat fish • law president • train travel • Usually selected using a larger window (7-10 words) of context, hoping to capture pairs of related words rather than collocations EACL-2006 Tutorial
Bigrams and Co-occurrences • Pairs of words tend to be much less ambiguous than unigrams • “bank” versus “river bank” and “bank card” • “dot” versus “dot com” and “dot product” • Three grams and beyond occur much less frequently (Ngrams very Zipfian) • Unigrams are noisy, but bountiful EACL-2006 Tutorial
“occur together more often than expected by chance…” • Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix • Throw out bigrams that include one or two stop words • Expected values are calculated, based on the model of independence and observed values • How often would you expect these words to occur together, if they only occurred together by chance? • If two words occur “significantly” more often than the expected value, then the words do not occur together by chance. EACL-2006 Tutorial
2x2 Contingency Table EACL-2006 Tutorial
2x2 Contingency Table EACL-2006 Tutorial
2x2 Contingency Table EACL-2006 Tutorial
Measures of Association EACL-2006 Tutorial
Measures of Association EACL-2006 Tutorial
Interpreting the Scores… • G^2 and X^2 are asymptotically approximated by the chi-squared distribution… • This means…if you fix the marginal totals of a table, randomly generate internal cell values in the table, calculate the G^2 or X^2 scores for each resulting table, and plot the distribution of the scores, you *should* get … EACL-2006 Tutorial
Interpreting the Scores… • Values above a certain level of significance can be considered grounds for rejecting the null hypothesis • H0: the words in the bigram are independent • 3.841 is associated with 95% confidence that the null hypothesis should be rejected EACL-2006 Tutorial
Measures of Association • There are numerous measures of association that can be used to identify bigram and co-occurrence features • Many of these are supported in the Ngram Statistics Package (NSP) • http://www.d.umn.edu/~tpederse/nsp.html EACL-2006 Tutorial
Measures Supported in NSP • Log-likelihood Ratio (ll) • True Mutual Information (tmi) • Pearson’s Chi-squared Test (x2) • Pointwise Mutual Information (pmi) • Phi coefficient (phi) • T-test (tscore) • Fisher’s Exact Test (leftFisher, rightFisher) • Dice Coefficient (dice) • Odds Ratio (odds) EACL-2006 Tutorial