1 / 63

Text Mining

Text Mining. What is Text Mining?. There are many examples of text-based documents (all in ‘electronic’ format…) e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…

jillian
Télécharger la présentation

Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Mining Data Mining -Volinsky - 2011 - Columbia University

  2. What is Text Mining? • There are many examples of text-based documents (all in ‘electronic’ format…) • e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more… • Not enough time or patience to read • Can we extract the most vital kernels of information… • So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…! • Some others (e.g. DNA seq.) are hard to comprehend! Data Mining -Volinsky - 2011 - Columbia University

  3. What is Text Mining? • Traditional data mining uses ‘structured data’ (n x p matrix) • The analysis of ‘free-form text’ is also referred to as ‘unstructured data’, • successful categorisation of such data can be a difficult and time-consuming task… • Often, can combine free-form text and structured data to derive valuable, actionable information… (e.g. as in typical surveys) – semi-structured Data Mining -Volinsky - 2011 - Columbia University

  4. Text Mining: Examples • Text mining is an exercise to gain knowledge from stores of language text. • Text: • Web pages • Medical records • Customer surveys • Email filtering (spam) • DNA sequences • Incident reports • Drug interaction reports • News stories (e.g. predict stock movement) Data Mining -Volinsky - 2011 - Columbia University

  5. What is Text Mining • Data examples • Web pages • Customer surveys Data Mining -Volinsky - 2011 - Columbia University

  6. Amazon.com Data Mining -Volinsky - 2011 - Columbia University

  7. Of Mice and Men: Concordance Concordance is an alphabetized list of the most frequently occurring words in a book, excluding common words such as "of" and "it." The font size of a word is proportional to the number of times it occurs in the book. Data Mining -Volinsky - 2011 - Columbia University

  8. Of Mice and Men: Text Stats Data Mining -Volinsky - 2011 - Columbia University

  9. Text Mining: Yahoo Buzz Data Mining -Volinsky - 2011 - Columbia University

  10. Text Mining: Google News Data Mining -Volinsky - 2011 - Columbia University

  11. Text Mining • Typically falls into one of two categories • Analysis of text: I have a bunch of text I am interested in, tell me something about it • E.g. sentiment analysis, “buzz” searches • Retrieval: There is a large corpus of text documents, and I want the one closest to a specified query • E.g. web search, library catalogs, legal and medical precedent studies Data Mining -Volinsky - 2011 - Columbia University

  12. Text Mining: Analysis • Which words are most present • Which words are most surprising • Which words help define the document • What are the interesting text phrases? Data Mining -Volinsky - 2011 - Columbia University

  13. Text Mining: Retrieval • Find k objects in the corpus of documents which are most similar to my query. • Can be viewed as “interactive” data mining - query not specified a priori. • Main problems of text retrieval: • What does “similar” mean? • How do I know if I have the right documents? • How can I incorporate user feedback? Data Mining -Volinsky - 2011 - Columbia University

  14. Text Retrieval: Challenges • Calculating similarity is not obvious - what is the distance between two sentences or queries? • Evaluating retrieval is hard: what is the “right” answer ? (no ground truth) • User can query things you have not seen before e.g. misspelled, foreign, new terms. • Goal (score function) is different than in classification/regression: not looking to model all of the data, just get best results for a given user. • Words can hide semantic content • Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining • Polysemy: The same keyword may mean different things in different contexts, e.g., mining Data Mining -Volinsky - 2011 - Columbia University

  15. Basic Measures for Text Retrieval • Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) • Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved Data Mining -Volinsky - 2011 - Columbia University

  16. Precision vs. Recall • We’ve been here before! • Precision = TP/(TP+FP) • Recall = TP/(TP+FN) • Trade off: • If algorithm is ‘picky’: precision high, recall low • If algorithm is ‘relaxed’: precision low, recall high • BUT: recall often hard if not impossible to calculate actual outcome predicted outcome 16 Data Mining -Volinsky - 2011 - Columbia University

  17. Precision Recall Curves • If we have a labelled training set, we can calculate recall. • For any given number of returned documents, we can plot a point for precision vs. recall. (similar to thresholds in ROC curves) • Different retrieval algorithms might have very different curves - hard to tell which is “best” Data Mining -Volinsky - 2011 - Columbia University

  18. Term / document matrix • Most common form of representation in text mining is the term - document matrix • Term: typically a single word, but could be a word phrase like “data mining” • Document: a generic term meaning a collection of text to be retrieved • Can be large - terms are often 50k or larger, documents can be in the billions (www). • Can be binary, or use counts Data Mining -Volinsky - 2011 - Columbia University

  19. Term document matrix Example: 10 documents: 6 terms • Each document now is just a vector of terms, sometimes boolean Data Mining -Volinsky - 2011 - Columbia University

  20. Term document matrix • We have lost all semantic content • Be careful constructing your term list! • Not all words are created equal! • Words that are the same should be treated the same! • Stop Words • Stemming Data Mining -Volinsky - 2011 - Columbia University

  21. Stop words • Many of the most frequently used words in English are worthless in retrieval and text mining – these words are called stop words. • the, of, and, to, …. • Typically about 400 to 500 such words • For an application, an additional domain specific stop words list may be constructed • Why do we need to remove stop words? • Reduce indexing (or data) file size • stopwords accounts 20-30% of total word counts. • Improve efficiency • stop words are not useful for searching or text mining • stop words always have a large number of hits Data Mining -Volinsky - 2011 - Columbia University

  22. Stemming • Techniques used to find out the root/stem of a word: • E.g., • user engineering • users engineered • used engineer • using • stem: use engineer Usefulness • improving effectiveness of retrieval and text mining • matching similar words • reducing indexing size • combing words with same roots may reduce indexing size as much as 40-50%. Data Mining -Volinsky - 2011 - Columbia University

  23. Basic stemming methods • remove ending • if a word ends with a consonant other than s, followed by an s, then delete s. • if a word ends in es, drop the s. • if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. • If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. • …... • transform words • if a word ends with “ies” but not “eies” or “aies” then “ies --> y.” Data Mining -Volinsky - 2011 - Columbia University

  24. Feature Selection • Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms • Even after stemming and stopword removal. • Greedy search • Start from full set and delete one at a time • Find the least important variable • Can use Gini index for this if a classification problem • Often performance does not degrade even with orders of magnitude reductions • Chakrabarti, Chapter 5: Patent data: 9600 patents in communcation, electricity and electronics. • Only 140 out of 20,000 terms needed for classification! Data Mining -Volinsky - 2011 - Columbia University

  25. Distances in TD matrices • Given a term doc matrix represetnation, now we can define distances between documents (or terms!) • Elements of matrix can be 0,1 or term frequencies (sometimes normalized) • Can use Euclidean or cosine distance • Cosine distance is the angle between the two vectors • Not intuitive, but has been proven to work well • If docs are the same, dc =1, if nothing in common dc=0 Data Mining -Volinsky - 2011 - Columbia University

  26. We can calculate cosine and Euclidean distance for this matrix • What would you want the distances to look like? Data Mining -Volinsky - 2011 - Columbia University

  27. Document distance • Pairwise distances between documents • Image plots of cosine distance, Euclidean, and scaled Euclidean R function: ‘image’ Data Mining -Volinsky - 2011 - Columbia University

  28. Weighting in TD space • Not all phrases are of equal importance • E.g. David less important than Beckham • If a term occurs frequently in many documents it has less discriminatory power • One way to correct for this is inverse-document frequency (IDF). • Term importance = Term Frequency (TF) x IDF • Nj= # of docs containing the term • N = total # of docs • A term is “important” if it has a high TF and/or a high IDF. • TF x IDF is a common measure of term importance Data Mining -Volinsky - 2011 - Columbia University

  29. TF IDF Data Mining -Volinsky - 2011 - Columbia University

  30. Queries • A query is a representation of the user’s information needs • Normally a list of words. • Once we have a TD matrix, queries can be represented as a vector in the same space • “Database Index” = (1,0,1,0,0,0) • Query can be a simple question in natural language • Calculate cosine distance between query and the TF x IDF version of the TD matrix • Returns a ranked vector of documents Data Mining -Volinsky - 2011 - Columbia University

  31. Latent Semantic Indexing • Criticism: queries can be posed in many ways, but still mean the same • Data mining and knowledge discovery • Car and automobile • Beet and beetroot • Semantically, these are the same, and documents with either term are relevant. • Using synonym lists or thesauri are solutions, but messy and difficult. • Latent Semantic Indexing (LSI): tries to extract hidden semantic structure in the documents • Search what I meant, not what I said! Data Mining -Volinsky - 2011 - Columbia University

  32. LSI • Approximate the T-dimensional term space using principle components calculated from the TD matrix • The first k PC directions provide the best set of k orthogonal basis vectors - these explain the most variance in the data. • Data is reduced to an N x k matrix, without much loss of information • Each “direction” is a linear combination of the input terms, and define a clustering of “topics” in the data. • What does this mean for our toy example? Data Mining -Volinsky - 2011 - Columbia University

  33. Data Mining -Volinsky - 2011 - Columbia University

  34. LSI • Typically done using Singular Value Decomposition (SVD) to find principal components New orthogonal basis for the data (PC directions) - TD matrix Diagonal matrix of eigenvalues Term weighting by document - 10 x 6 For our example: S = (77.4,69.5,22.9,13.5,12.1,4.8) Fraction of the variance explained (PC1&2) = = 92.5% Data Mining -Volinsky - 2011 - Columbia University

  35. LSI Top 2 PC make new pseudo-terms to define documents… Also, can look at first two Principal components: (0.74,0.49, 0.27,0.28,0.18,0.19) -> emphasizes first two terms (-0.28,-0.24,-0.12,0.74,0.37,0.31) -> separates the two clusters Note how distance from the origin shows number of terms, And angle (from the origin) shows similarity as well Data Mining -Volinsky - 2011 - Columbia University

  36. LSI • Here we show the same plot, but with two new documents, one with the term “SQL” 50 times, another with the term “Databases” 50 times. • Even though they have no phrases in common, they are close in LSI space Data Mining -Volinsky - 2011 - Columbia University

  37. Textual analysis • Once we have the data into a nice matrix representation (TD, TDxIDF, or LSI), we can throw the data mining toolbox at it: • Classification of documents • If we have training data for classes • Clustering of documents • unsupervised Data Mining -Volinsky - 2011 - Columbia University

  38. Automatic document classification • Motivation • Automatic classification for the tremendous number of on-line text documents (Web pages, e-mails, etc.) • Customer comments: Requests for info, complaints, inquiries • A classification problem • Training set: Human experts generate a training data set • Classification: The computer system discovers the classification rules • Application: The discovered rules can be applied to classify new/unknown documents • Techniques • Linear/logistic regression, naïve Bayes • Trees not so good here due to massive dimension, few interactions Data Mining -Volinsky - 2011 - Columbia University

  39. Naïve Bayes Classifier for Text • Naïve Bayes classifier = conditional independence model • Also called “multivariate Bernoulli” • Assumes conditional independence assumption given the class: p( x | ck ) = Pp( xj | ck ) • Note that we model each term xj as a discrete random variable In other words, the probability that a bunch of words comes from a given class equals the product of the individual probabilities of those words. Data Mining -Volinsky - 2011 - Columbia University

  40. Multinomial Classifier for Text . • Multinomial Classification model • Assumes that the data are generated by a p-sided die (multinomial model) • where Nx = number of terms (total count) in document x • xj = number of times term j occurs in the document • ck = class = k • Based on training data, each class has its own multinomial probability across all words. Data Mining -Volinsky - 2011 - Columbia University

  41. Naïve Bayes vs. Multinomial • Many extensions and adaptations of both • Text mining classification models usually a version of one of these • Example: Web pages • Classify webpages from CS departments into: • student, faculty, course,project • Train on ~5,000 hand-labeled web pages from Cornell, Washington, U.Texas, Wisconsin • Crawl and classify a new site (CMU) Data Mining -Volinsky - 2011 - Columbia University

  42. NB vs. multinomial Data Mining -Volinsky - 2011 - Columbia University

  43. Highest Probability Terms in Multinomial Distributions Classifying web pages at a University: Data Mining -Volinsky - 2011 - Columbia University

  44. Document Clustering • Can also do clustering, or unsupervised learning of docs. • Automatically group related documents based on their content. • Require no training sets or predetermined taxonomies. • Major steps • Preprocessing • Remove stop words, stem, feature extraction, lexical analysis, … • Hierarchical clustering • Compute similarities applying clustering algorithms, … • Slicing • Fan out controls, flatten the tree to desired number of levels. • Like all clustering examples, success is relative Data Mining -Volinsky - 2011 - Columbia University

  45. Document Clustering • To Cluster: • Can use LSI • Another model: Latent Dirichlet Allocation (LDA) • LDA is a generative probabilistic model of a corpus. Documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words. • LDA: • Three concepts: words, topics, and documents • Documents are a collection of words and have a probability distribution over topics • Topics have a probability distribution over words • Fully Bayesian Model Data Mining -Volinsky - 2011 - Columbia University

  46. LDA • Assume data was generated by a generative process: • q is a document - made up from topics from a probability distribution • z is a topic made up from words from a probability distribution • w is a word, the only real observables (N=number of words in all documents) • Then, the LDA equations are specified in a fully Bayesian model: a=per-document topic distributions Data Mining -Volinsky - 2011 - Columbia University

  47. Which can be solved via advance computational techniques see Blei, et al 2003 Data Mining -Volinsky - 2011 - Columbia University

  48. LDA output • The result can be an often-useful classification of documents into topics, and a distribution of each topic across words: Data Mining -Volinsky - 2011 - Columbia University

  49. Another Look at LDA • Model: Topics made up of words used to generate documents Data Mining -Volinsky - 2011 - Columbia University

  50. Another Look at LDA • Reality: Documents observed, infer topics Data Mining -Volinsky - 2011 - Columbia University

More Related