1 / 200

Language Independent Methods of Clustering Similar Contexts (with applications)

Language Independent Methods of Clustering Similar Contexts (with applications). Ted Pedersen University of Minnesota, Duluth tpederse@d.umn.edu http://www.d.umn.edu/~tpederse/SCTutorial.html. Language Independent Methods. Do not utilize syntactic information

refugioe
Télécharger la présentation

Language Independent Methods of Clustering Similar Contexts (with applications)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth tpederse@d.umn.edu http://www.d.umn.edu/~tpederse/SCTutorial.html IJCAI-2007 Tutorial

  2. Language Independent Methods • Do not utilize syntactic information • No parsers, part of speech taggers, etc. required • Do not utilize dictionaries or other manually created lexical resources • Based on lexical features selected from corpora • Assumption: word segmentation can be done by looking for white spaces between strings • No manually annotated data, methods are completely unsupervised in the strictest sense IJCAI-2007 Tutorial

  3. A Note on Tokenization • Default tokenization is white space separated strings • Can be redefined using regular expressions • e.g., character n-grams (4 grams) • any other valid regular expression IJCAI-2007 Tutorial

  4. Clustering Similar Contexts • A context is a short unit of text • often a phrase to a paragraph in length, although it can be longer • Input: N contexts • Output: K clusters • Where each member of a cluster is a context that is more similar to each other than to the contexts found in other clusters IJCAI-2007 Tutorial

  5. Applications • Headed contexts (focus on target word) • Name Discrimination • Word Sense Discrimination • Headless contexts • Email Organization • Document Clustering • Paraphrase identification • Clustering Sets of Related Words IJCAI-2007 Tutorial

  6. Tutorial Outline • Identifying Lexical Features • First Order Context Representation • native SC : context as vector of features • Second Order Context Representation • LSA : context as average of vectors of contexts • native SC : context as average of vectors of features • Dimensionality reduction • Clustering • Hands-On Experience IJCAI-2007 Tutorial

  7. SenseClusters • A free package for clustering contexts • http://senseclusters.sourceforge.net • SenseClusters Live! (Knoppix CD) • Perl components that integrate other tools • Ngram Statistics Package • CLUTO • SVDPACKC • PDL IJCAI-2007 Tutorial

  8. Many thanks… • Amruta Purandare (M.S., 2004) • Now PhD student in Intelligent Systems at the University of Pittsburgh • http://www.cs.pitt.edu/~amruta/ • Anagha Kulkarni (M.S., 2006) • Now PhD student at the Language Technologies Institute at Carnegie-Mellon University • http://www.cs.cmu.edu/~anaghak/ • Ted, Amruta, and Anagha were supported by the National Science Foundation (USA) via CAREER award #0092784 IJCAI-2007 Tutorial

  9. Background and Motivations IJCAI-2007 Tutorial

  10. Headed and Headless Contexts • A headed context includes a target word • Our goal is to cluster the target word based on the surrounding contexts • The focus is on the target word and making distinctions among word meanings • A headless context has no target word • Our goal is to cluster the contexts based on their similarity to each other • The focus is on the context as a whole and making topic level distinctions IJCAI-2007 Tutorial

  11. Headed Contexts (input) • I can hear the ocean in that shell. • My operating system shell is bash. • The shells on the shore are lovely. • The shell command line is flexible. • An oyster shell is very hard and black. IJCAI-2007 Tutorial

  12. Headed Contexts (output) • Cluster 1: • My operating system shell is bash. • The shell command line is flexible. • Cluster 2: • The shells on the shore are lovely. • An oyster shell is very hard and black. • I can hear the ocean in that shell. IJCAI-2007 Tutorial

  13. Headless Contexts (input) • The new version of Linux is more stable and better support for cameras. • My Chevy Malibu has had some front end troubles. • Osborne made one of the first personal computers. • The brakes went out, and the car flew into the house. • With the price of gasoline, I think I’ll be taking the bus more often! IJCAI-2007 Tutorial

  14. Headless Contexts (output) • Cluster 1: • The new version of Linux is more stable and better support for cameras. • Osborne made one of the first personal computers. • Cluster 2: • My Chevy Malibu has had some front-end troubles. • The brakes went out, and the car flew into the house. • With the price of gasoline, I think I’ll be taking the bus more often! IJCAI-2007 Tutorial

  15. Web Search as Application • Snippets returned via Web search are headed contexts since they include the search term • Name Ambiguity is a problem with Web search. Results mix different entities • Group results into clusters where each cluster is associated with a unique underlying entity • Pages found by following search results can also be treated as headless contexts IJCAI-2007 Tutorial

  16. Name Discrimination IJCAI-2007 Tutorial

  17. George Millers! IJCAI-2007 Tutorial

  18. IJCAI-2007 Tutorial

  19. IJCAI-2007 Tutorial

  20. IJCAI-2007 Tutorial

  21. IJCAI-2007 Tutorial

  22. Email Foldering as Application • Email (public or private) is made up of headless contexts • Short, usually focused… • Cluster similar email messages together • Automatic email foldering • Take all messages from sent-mail file or inbox and organize into categories IJCAI-2007 Tutorial

  23. Clustering News as Application • News articles are headless contexts • Entire article or first paragraph • Short, usually focused • Cluster similar articles together, can also be applied to blog entries and other shorter units of text IJCAI-2007 Tutorial

  24. What is it to be “similar”? • You shall know a word by the company it keeps • Firth, 1957 (Studies in Linguistic Analysis) • Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis) • Harris, 1968 (Mathematical Structures of Language) • Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis) • Miller and Charles, 1991 (Language and Cognitive Processes) • Various extensions… • Similar contexts will have similar meanings, etc. • Names that occur in similar contexts will refer to the same underlying person, etc. IJCAI-2007 Tutorial

  25. General Methodology • Represent contexts to be clustered using first or second order feature vectors • Lexical features • Reduce dimensionality to make vectors more tractable and/or understandable (optional) • Singular value decomposition • Cluster the context vectors • Find the number of clusters • Label the clusters • Evaluate and/or use the contexts! IJCAI-2007 Tutorial

  26. Identifying Lexical Features Measures of Association and Tests of Significance IJCAI-2007 Tutorial

  27. What are features? • Features are the salient characteristics of the contexts to be clustered • Each context is represented as a vector, where the dimensions are associated with features • Contexts that include many of the same features will be similar to each other IJCAI-2007 Tutorial

  28. Feature Selection Data • The contexts to cluster (evaluation/test data) • We may need to cluster all available data, and not hold out any for a separate feature identification step • A separate larger corpus (training data), esp. if we cluster a very small number of contexts • local training – corpus made up of headed contexts • global training – corpus made up of headless contexts • Feature selection data may be either the evaluation/test data, or a separate held-out set of training data IJCAI-2007 Tutorial

  29. Feature Selection Data • Test / Evaluation data : contexts to be clustered • Assume that the feature selection data is the test data, unless otherwise indicated • Training data – a separate corpus of held out feature selection data (that will not be clustered) • may need to use if you have a small number of contexts to cluster (e.g., web search results) • This sense of “training” due to Schütze (1998) • does not mean labeled • simply an extra quantity of text IJCAI-2007 Tutorial

  30. Lexical Features • Unigram • a single word that occurs more than X times in feature selection data and is not in stop list • Stop list • words that will not be used in features • usually non-content words like the, and, or, it … • may be compiled manually • may be derived automatically from a corpus of text • any word that occurs in a relatively large percentage (>10-20%) of contexts may be considered a stop word IJCAI-2007 Tutorial

  31. Lexical Features • Bigram • an ordered pair of words that may be consecutive, or have intervening words that are ignored • the pair occurs together more than X times and/or more often than expected by chance in feature selection data • neither word in the pair may be in stop list • Co-occurrence • an unordered bigram • Target Co-occurrence • a co-occurrence where one of the words is the target IJCAI-2007 Tutorial

  32. Bigrams • Window Size of 2 • baseball bat, fine wine, apple orchard, bill clinton • Window Size of 3 • house ofrepresentatives, bottle of wine, • Window Size of 4 • president of the republic, whispering in the wind • Selected using a small window size (2-4 words) • Objective is to capture a regular or localized pattern between two words (collocation?) IJCAI-2007 Tutorial

  33. Co-occurrences • president law • the president signed a bill into law today • that law is unjust, said the president • the president feels that the law was properly applied • Usually selected using a larger window (7-10 words) of context, hoping to capture pairs of related words rather than collocations IJCAI-2007 Tutorial

  34. Bigrams and Co-occurrences • Pairs of words tend to be much less ambiguous than unigrams • “bank” versus “river bank” and “bank card” • “dot” versus “dot com” and “dot product” • Three grams and beyond occur much less frequently (Ngrams very Zipfian) • Unigrams occur more frequently, but are noisy IJCAI-2007 Tutorial

  35. “occur together more often than expected by chance…” • Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix • Expected values are calculated, based on the model of independence and observed values • How often would you expect these words to occur together, if they only occurred together by chance? • If two words occur “significantly” more often than the expected value, then the words do not occur together by chance. IJCAI-2007 Tutorial

  36. 2x2 Contingency Table IJCAI-2007 Tutorial

  37. 2x2 Contingency Table IJCAI-2007 Tutorial

  38. 2x2 Contingency Table IJCAI-2007 Tutorial

  39. Measures of Association IJCAI-2007 Tutorial

  40. Measures of Association IJCAI-2007 Tutorial

  41. Interpreting the Scores… • G^2 and X^2 are asymptotically approximated by the chi-squared distribution… • This means…if you fix the marginal totals of a table, randomly generate internal cell values in the table, calculate the G^2 or X^2 scores for each resulting table, and plot the distribution of the scores, you *should* get … IJCAI-2007 Tutorial

  42. IJCAI-2007 Tutorial

  43. Interpreting the Scores… • Values above a certain level of significance can be considered grounds for rejecting the null hypothesis • H0: the words in the bigram are independent • 3.84 is associated with 95% confidence that the null hypothesis should be rejected IJCAI-2007 Tutorial

  44. Measures of Association • There are numerous measures of association that can be used to identify bigram and co-occurrence features • Many of these are supported in the Ngram Statistics Package (NSP) • http://www.d.umn.edu/~tpederse/nsp.html • NSP is integrated into SenseClusters IJCAI-2007 Tutorial

  45. Measures Supported in NSP • Log-likelihood Ratio (ll) • True Mutual Information (tmi) • Pointwise Mutual Information (pmi) • Pearson’s Chi-squared Test (x2) • Phi coefficient (phi) • Fisher’s Exact Test (leftFisher) • T-test (tscore) • Dice Coefficient (dice) • Odds Ratio (odds) IJCAI-2007 Tutorial

  46. Summary • Identify lexical features based on frequency counts or measures of association – either in the data to be clustered or in a separate set of feature selection data • Language independent • Unigrams usually only selected by frequency • Remember, no labeled data from which to learn, so somewhat less effective as features than in supervised case • Bigrams and co-occurrences can also be selected by frequency, or better yet measures of association • Bigrams and co-occurrences need not be consecutive • Stop words should be eliminated • Frequency thresholds are helpful (e.g., unigram/bigram that occurs once may be too rare to be useful) IJCAI-2007 Tutorial

  47. References • Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on log-likelihood and exact tests http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdf • Pedersen, Kayaalp, and Bruce. 1996 (AAAI) explanation of the exact conditional test, a stochastic simulation of exact tests. http://www.d.umn.edu/~tpederse/Pubs/aaai96-cmpl.pdf • Pedersen, 1996 (SCSUG) explanation of exact tests for collocation identification, and comparison to log-likelihood http://arxiv.org/abs/cmp-lg/9608010 • Dunning, 1993 (Computational Linguistics) introduces log-likelihood ratio for collocation identification http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf IJCAI-2007 Tutorial

  48. Context Representations First and Second Order Methods IJCAI-2007 Tutorial

  49. Once features selected… • We will have a set of unigrams, bigrams, co-occurrences or target co-occurrences that we believe are somehow interesting and useful • We also have any frequency and measure of association score that have been used in their selection • Convert contexts to be clustered into a vector representation based on these features IJCAI-2007 Tutorial

  50. Possible Representations • First Order Features • Native SenseClusters • each context represented by a vectors of features • Second Order Co-Occurrence Features • Native SenseClusters • each word in a context replaced by vector of co-occurring words and averaged together • Latent Semantic Analysis • each feature in a context replaced by vector of contexts in which it occurs and averaged together IJCAI-2007 Tutorial

More Related