Créer une présentation
Télécharger la présentation

Télécharger la présentation
## Chapter 7: Text mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Chapter 7: Text mining**Bing Liu**Text mining**• It refers to data mining using text documents as data. • There are many special techniques for pre-processing text documents to make them suitable for mining. • Most of these techniques are from the field of “Information Retrieval”. Bing Liu**Information Retrieval (IR)**• Conceptually, information retrieval (IR) is the study of finding needed information. I.e., IR helps users find information that matches their information needs. • Historically, information retrieval is about document retrieval, emphasizing document as the basic unit. • Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information. • IR has become a center of focus in the Web era. Bing Liu**Information**User Search/select Queries Stored Information Info. Needs Information Retrieval Translating info. needs to queries Matching queries To stored information Query result evaluation Does information found match user’s information needs? Bing Liu**Text Processing**• Word (token) extraction • Stop words • Stemming • Frequency counts Bing Liu**Stop words**• Many of the most frequently used words in English are worthless in IR and text mining – these words are called stop words. • the, of, and, to, …. • Typically about 400 to 500 such words • For an application, an additional domain specific stop words list may be constructed • Why do we need to remove stop words? • Reduce indexing (or data) file size • stopwords accounts 20-30% of total word counts. • Improve efficiency • stop words are not useful for searching or text mining • stop words always have a large number of hits Bing Liu**Stemming**• Techniques used to find out the root/stem of a word: • E.g., • user engineering • users engineered • used engineer • using • stem: use engineer Usefulness • improving effectiveness of IR and text mining • matching similar words • reducing indexing size • combing words with same roots may reduce indexing size as much as 40-50%. Bing Liu**Basic stemming methods**• remove ending • if a word ends with a consonant other than s, followed by an s, then delete s. • if a word ends in es, drop the s. • if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. • If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. • …... • transform words • if a word ends with “ies” but not “eies” or “aies” then “ies --> y.” Bing Liu**Frequency counts**• Counts the number of times a word occurred in a document. • Counts the number of documents in a collection that contains a word. • Using occurrence frequencies to indicate relative importance of a word in a document. • if a word appears often in a document, the document likely “deals with” subjects related to the word. Bing Liu**Vector Space Representation**• A document is represented as a vector: • (W1, W2, … … , Wn) • Binary: • Wi= 1 if the corresponding term i (often a word) is in the document • Wi= 0 if the term i is not in the document • TF: (Term Frequency) • Wi= tfi where tfi is the number of times the term occurred in the document • TF*IDF: (Inverse Document Frequency) • Wi =tfi*idfi=tfi*log(N/dfi))where dfi is the number of documents contains term i, and N the total number of documents in the collection. Bing Liu**Vector Space and Document Similarity**• Each indexing term is a dimension. A indexing term is normally a word. • Each document is a vector • Di = (ti1, ti2, ti3, ti4, ... tin) • Dj = (tj1, tj2, tj3, tj4, ..., tjn) • Document similarity is defined as Bing Liu**Query formats**• Query is a representation of the user’s information needs • Normally a list of words. • Query as a simple question in natural language • The system translates the question into executable queries • Query as a document • “Find similar documents like this one” • The system defines what the similarity is Bing Liu**An Example**• A document Space is defined by three terms: • hardware, software, users • A set of documents are defined as: • A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1) • A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1) • A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1) • If the Query is “hardware and software” • what documents should be retrieved? Bing Liu**An Example (cont.)**• In Boolean query matching: • document A4, A7 will be retrieved (“AND”) • retrieved:A1, A2, A4, A5, A6, A7, A8, A9 (“OR”) • In similarity matching (cosine): • q=(1, 1, 0) • S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0 • S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 • S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 • Document retrieved set (with ranking)= • {A4, A7, A1, A2, A5, A6, A8, A9} Bing Liu**Relevance judgment for IR**• A measurement of the outcome of a search or retrieval • The judgment on what should or should not be retrieved. • There is no simple answer to what is relevant and what is not relevant: need human users. • difficult to define • subjective • depending on knowledge, needs, time,, etc. • The central concept of information retrieval Bing Liu**Precision and Recall**• Given a query: • Are all retrieved documents relevant? • Have all the relevant documents been retrieved ? • Measures for system performance: • The first question is about the precision of the search • The second is about the completeness (recall) of the search. Bing Liu**Precision and Recall (cont)**Relevant Not Relevant Retrieved a b Not retrieved d c a a P = -------------- R = -------------- a+b a+c Bing Liu**Number of relevant documents retrieved**Precision = -------------------------------------------- Total number of documents retrieved Number of relevant documents retrieved Recall = ----------------------------------------------------- Number of all the relevant documents in the database Precision and Recall (cont) • Precision measures how precise a search is. • the higher the precision, • the less unwanted documents. • Recall measures how complete a search is. • the higher the recall, • the less missing documents. Bing Liu**Relationship of R and P**• Theoretically, • R and P not depend on each other. • Practically, • High Recall is achieved at the expense of precision. • High Precision is achieved at the expense of recall. • When will p = 0? • Only when none of the retrieved documents is relevant. • When will p=1? • Only when every retrieved documents are relevant. • Depending on application, you may want a higher precision or a higher recall. Bing Liu**P-R diagram**P 1.0 System A System B 0.5 System C 0.1 R 0.1 1.0 0.5 Bing Liu**Alternative measures**• Combining recall and precision, F score 2PR F = ------------------ R + P • Breakeven point: when p = r • These two measures are commonly used in text mining: classification and clustering. • Accuracy is not normally used in text domain because the set of relevant documents is almost always very small compared to the set of irrelevant documents. Bing Liu**Web Search as a huge IR system**• A Web crawler (robot) crawls the Web to collect all the pages. • Servers establish a huge inverted indexing database and other indexing databases • At query (search) time, search engines conduct different types of vector query matching Bing Liu**Different search engines**• The real differences among different search engines are • their indexing weight schemes • their query process methods • their ranking algorithms • None of these are published by any of the search engines firms. Bing Liu**Vector Space Representation**• Each doc j is a vector, one component for each term (= word). • Have a vector space • terms are attributes • n docs live in this space • even with stop word removal and stemming, we may have 10000+ dimensions, or even 1,000,000+ Bing Liu**Government**Science Arts Classification in Vector space • Each training doc is a point (vector) labeled by its topic (= class) • Hypothesis: docs of the same topic form a contiguous region of space • Define surfaces to delineate topics in space Bing Liu**Test doc = Government**Government Science Arts Bing Liu**Rocchio Classification Method**• Given training documents compute a prototype vector for each class. • Given test doc, assign to topic whose prototype (centroid) is nearest using cosine similarity. Bing Liu**Rocchio Classification**• Constructing document vectors into a prototype vector for each class cj. • and are parameters that adjust the relative impact of relevant and irrelevant training examples. Normally, • = 16 and = 4. Bing Liu**Naïve Bayesian Classifier**• Given a set of training documents D, • each document is considered an ordered list of words. • wdi,kdenotes the word wt in position k of document di, where each word is from the vocabulary V = < w1, w2, … , w|v| >. • Let C = {c1, c2, … , c|C|} be the set of pre-defined classes. • There are two naïve Bayesian models, • One based on multi-variate Bernoulli model (a word occurs or does not occurs in a document). • One based on the multinomial model (the number of word occurrences is considered) Bing Liu**Naïve Bayesian Classifier (multinomial model)**(1) (2) N(wt, di) is the number of times the word wt occurs in document di. P(cj|di) is in {0, 1} (3) Bing Liu**k Nearest Neighbor Classification**• To classify document d into class c • Define k-neighborhood N as k nearest neighbors of d • Count number of documents n in N that belong to c • Estimate P(c|d) as n/k • No training is needed (?). Classification time is linear in training set size. Bing Liu**Example**Government Science Arts Bing Liu**Example: k=6 (6NN)**P(science| )? Government Science Arts Bing Liu**Linear classifiers:Binary Classification**• Consider 2 class problems • Assume linear separability for now: • in 2 dimensions, can separate by a line • in higher dimensions, need hyperplanes • Can find separating hyperplane by linear programming (e.g. perceptron): • separator can be expressed as ax + by = c Bing Liu**Linear programming / Perceptron**Find a,b,c, such that ax + by c for red points ax + by c for green points. Bing Liu**Linear Classifiers (cont.)**• Many common text classifiers are linear classifiers • Despite this similarity, large performance differences • For separable problems, there is an infinite number of separating hyperplanes. Which one do you choose? • What to do for non-separable problems? Bing Liu**Which hyperplane?**In general, lots of possible solutions for a,b,c. Support Vector Machine (SVM) finds an optimal solution Bing Liu**Support vectors**Maximize margin Support Vector Machine (SVM) • SVMs maximize the margin around the separating hyperplane. • The decision function is fully specified by a subset of training samples, the support vectors. • Quadratic programming problem. • SVM: very good for text classification Bing Liu**Optimal hyperplane**• Let the training examples be (xi, ,yi) i = 1, 2,…, n, where xiis n-dimensional vector. yi is its class, -1 or 1. • The class represented by the subset with yi = -1 and the class represented by the subset with yi = +1 are linearly separable if there exists (w, b) such that • The margin of separation m is the separation between the hyperplane wTx + b = 0 and the closest data points (support vectors). • The goal of a SVM is to find the optimal hyperplane with the maximum margin of separation. • wTxi + b 0 for yi = +1 • wTxi + b< 0 for yi = -1 Bing Liu**A Geometrical Interpretation**• The decision boundary should be as far away from the data of both classes as possible • We maximize the margin, m Class 2 m Class 1 Bing Liu**SVM formulation: separable case**• Thus, support vector machines (SVM) are linear functions of the form f(x)=wTx + b, where w is the weight vector and x is the input vector. • To find the linear function: Minimize: Subject to: • Quadratic programming. Bing Liu**Non-separable caseSoft margin SVM**• To deal with cases where there may be no separating hyperplane due to noisy labels of both positive and negative training examples, the soft margin SVM is proposed: Minimize: Subject to: i 0, i = 1, …, n where C 0 is a parameter that controls the amount of training errors allowed. Bing Liu**Illustration:Non-separable case**Support Vectors: 1 margin s.v. i = 0 Correct 2 non-margin s.v. i < 1 Correct (in margin) 3 non-margin s.v. I> 1 Error 1 1 1 3 3 2 3 1 Bing Liu**f( )**f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f(.) Feature space Input space Extension to Non-linear Decision surface • In general, complex real world applications may not be expressed with linear functions. • Key idea: transform xi into a higher dimensional space to “make life easier” • Input space: the space xi are in • Feature space: the space of f(xi) after transformation Bing Liu**Kernel Trick**• The mapping function (.) is used to project data into a higher dimensional feature space. x =(x1, .., xn) (x) = (1(x), …, N(x)) • With a higher dimensional space, the data are more likely to be linearly separable. • In SVM, the projection can be done implicitly, rather than explicitly because the optimization does not actually need the explicit projection. • It only needs a way to compute inner products between pairs of training examples (e.g., x, z) Kernel:K(x, z) = <(x) (z)> • If you know how to compute K, you do not need to know . Bing Liu**Comments of SVM**• SVM are seen as best-performing method by many. • Statistical significance of most results not clear. • Kernels are an elegant and efficient way to map data into a better representation. • SVM can be expensive to train (quadratic programming). • For text classification, linear kernel is common and often sufficient. Bing Liu**Document clustering**• We can still use the normal clustering techniques, e.g., partition and hierarchical methods. • Documents can be represented using vector space model. • For distance function, cosine similarity measure is commonly used. Bing Liu**Summary**• Text mining applies and adapts data mining techniques to text domain. • A significant amount of pre-processing is needed before mining, using information retrieval techniques. Bing Liu