Supervised learning for text

Supervised learning for text

Organizing knowledge • Systematic knowledge structures • Ontologies • Dewey decimal system, the Library of Congress catalog, the AMS Mathematics Subject • Classification, and the US Patent subject classification • Web catalogs • Yahoo & Dmoz • Problem: Manual maintenance Chakrabarti & Ramakrishnan

Topic Tagging • Finding similar documents • Guiding queries • Naïve Approach: • Syntactic similarity between documents • Better approach • Topic tagging Chakrabarti & Ramakrishnan

Topic Tagging • Advantages • Increase vocabulary of classes • Hierarchical visualization and browsing aids • Applications • Email/Bookmark organization • News Tracking • Tracking authors of anonymous texts • E.g.: The Flesch-Kincaid index • classify the purpose of hyperlinks. Chakrabarti & Ramakrishnan

Supervised learning • Learning to assign objects to classes given examples • Learner (classifier) A typical supervised text learning scenario. Chakrabarti & Ramakrishnan

Difference with texts • M.L classification techniques used for structured data • Text: lots of features and lot of noise • No fixed number of columns • No categorical attribute values • Data scarcity • Larger number of class label • Hierarchical relationships between classes less systematic unlike structured data Chakrabarti & Ramakrishnan

Techniques • Nearest Neighbor Classifier • Lazy learner: remember all training instances • Decision on test document:distribution of labels on the training documents most similar to it • Assigns large weights to rare terms • Feature selection • removes terms in the training documents which are statistically uncorrelated with the class labels, • Bayesian classifier • Fit a generative term distribution Pr(d|c) to each class c of documents {d}. • Testing: The distribution most likely to have generated a test document is used to label it. Chakrabarti & Ramakrishnan

Other Classifiers • Maximum entropy classifier: • Estimate a direct distribution Pr(cjd) from term space to the probability of various classes. • Support vector machines: • Represent classes by numbers • Construct a direct function from term space to the class variable. • Rule induction: • Induce rules for classification over diverse features • E.g.: information from ordinary terms, the structure of the HTML tag tree in which terms are embedded, link neighbors, citations Chakrabarti & Ramakrishnan

Other Issues • Tokenization • E.g.: replacing monetary amounts by a special token • Evaluating text classifier • Accuracy • Training speed and scalability • Simplicity, speed, and scalability for document modifications • Ease of diagnosis, interpretation of results, and adding human judgment and feedback subjective Chakrabarti & Ramakrishnan

Benchmarks for accuracy • Reuters • 10700 labeled documents • 10% documents with multiple class labels • OHSUMED • 348566 abstracts from medical journals • 20NG • 18800 labeled USENET postings • 20 leaf classes, 5 root level classes • WebKB • 8300 documents in 7 academic categories. • Industry • 10000 home pages of companies from 105 industry sectors • Shallow hierarchies of sector names Chakrabarti & Ramakrishnan

Measures of accuracy • Assumptions • Each document is associated with exactly one class. OR • Each document is associated with a subset of classes. • Confusion matrix (M) • For more than 2 classes • M[i; j] : number of test documents belonging to class i which were assigned to class j • Perfect classifier: diagonal elements M[i; i] would be nonzero. Chakrabarti & Ramakrishnan

Evaluating classifier accuracy • Two-way ensemble • To avoid searching over the power-set of class labels in the subset scenario • Create positive and negative classes for each document d (E.g.: “Sports” and “Not sports” (all remaining documents) • Recall and precision • contingency matrix per (d,c) pair Chakrabarti & Ramakrishnan

Evaluating classifier accuracy (contd.) • micro averaged contingency matrix • micro averaged contingency matrix • micro averaged precision and recall • Equal importance for each document • Macro averaged precision and recall • Equal importance for each class Chakrabarti & Ramakrishnan

Evaluating classifier accuracy (contd.) • Precision – Recall tradeoff • Plot of precision vs. recall: Better classifier has higher curvature • Harmonic mean : Discard classifiers that sacrifice one for the other Chakrabarti & Ramakrishnan

Nearest Neighbor classifiers • Intuition • similar documents are expected to be assigned the same class label. • Vector space model + cosine similarity • Training: • Index each document and remember class label • Testing: • Fetch “k” most similar document to given document • Majority class wins • Alternative: Weighted counts – counts of classes weighted by the corresponding similarity measure • Alternative: per-class offset bc which is tuned by testing the classier on a portion of training data held out for this purpose. Chakrabarti & Ramakrishnan

Nearest neighbor classification Chakrabarti & Ramakrishnan

Pros • Easy availability and reuse of of inverted index • Collection updates trivial • Accuracy comparable to best known classifiers Chakrabarti & Ramakrishnan

Cons • Iceberg category questions • involves as many inverted index lookups as there are distinct terms in dq, • scoring the (possibly large number of) candidate documents which overlap with dq in at least one word, • sorting by overall similarity, • picking the best k documents, • Space overhead and redundancy • Data stored at level of individual documents • No distillation Chakrabarti & Ramakrishnan

Workarounds • To reducing space requirements and speed up classification • Find clusters in the data • Store only a few statistical parameters per cluster. • Compare with documents in only the most promising clusters. • Again…. • Ad-hoc choices for number and size of clusters and parameters. • k is corpus sensitive Chakrabarti & Ramakrishnan

TF-IDF • TF-IDF done for whole corpus • Interclass correlations and term frequencies unaccounted for • Terms which occur relatively frequently in some classes compared to others should have higher importance • Overall rarity in the corpus is not as important. Chakrabarti & Ramakrishnan

Feature selection • Data sparsity: • Term distribution could be estimated if training set larger than test • Not the case however……. • Vocabulary documents • For Reuters, only about 10300 documents available. • Over-fitting problem • Joint distribution may fit training instances….. • But may not fit unforeseen test data that well Chakrabarti & Ramakrishnan

Marginals rather than joint • Marginal distribution of each term in each class • Empirical distributions may not still reflect actual distributions if data is sparse • Therefore feature selection • Purposes: • Improve accuracy by avoiding over fitting • maintain accuracy while discarding as many features as possible to save a great deal of space for storing statistics • Heuristic, guided by linguistic and domain knowledge, or statistical. Chakrabarti & Ramakrishnan

Feature selection • Perfect feature selection • goal-directed • pick all possible subsets of features • for each subset train and test a classier • retain that subset which resulted in the highest accuracy. • COMPUTATIONALLY INFEASIBLE • Simple heuristics • Stop words like “a”, “an”, “the” etc. • Empirically chosen thresholds (task and corpus sensitive) for ignoring “too frequent” or “too rare” terms • Discard “too frequent” and “too rare terms” • Larger and complex data sets • Confusion with stop words • Especially for topic hierarchies • Greedy inclusion (bottom up) vs. top-down Chakrabarti & Ramakrishnan

Greedy inclusion algorithm • Most commonly used in text • Algorithm: • Compute, for each term, a measure of discrimination amongst classes. • Arrange the terms in decreasing order of this measure. • Retain a number of the best terms or features for use by the classier. • Greedy because • measure of discrimination of a term is computed independently of other terms • Over-inclusion: mild effects on accuracy Chakrabarti & Ramakrishnan

Measure of discrimination • Dependent on • model of documents • desired speed of training • ease of updates to documents and class assignments. • Observations • sets included for acceptable accuracy tend to have large overlap. Chakrabarti & Ramakrishnan

The test • Similar to the likelihood ratio test • Build a 2 x 2 contingency matrix per class-term pair • Under the independence hypothesis • Aggregates the deviations of observed values from expected values • Larger the value of , the lower is our belief that the independence assumption is upheld by the observed data. Chakrabarti & Ramakrishnan

The test • Feature selection process • Sort terms in decreasing order of their values, • Train several classifier with varying number of features • Stopping at the point of maximum accuracy. Chakrabarti & Ramakrishnan

Mutual information • Useful when the multinomial document model is used • X and Y are discrete random variables taking values x,y • Mutual information (MI) between them is defined as • Measure of extent of dependence between random variables, • Extent to which the joint deviates from the product of the marginals • Weighted with the distribution mass at (x; y) Chakrabarti & Ramakrishnan

Mutual Information • Advantages • To the extent MI(X,Y) is large, X and Y are dependent. • Deviations from independence at rare values of (x,y) are played down • Interpretations • Reduction in the entropy of Y given X. • MI(X; Y ) = H(X) – H(X|Y) = H(Y) – H(Y|X) • KL distance between no-independence hypothesis and independence hypothesis • KL distance gives the average number of bits wasted by encoding events from the `correct‘ distribution using a code based on a not-quite-right distribution Chakrabarti & Ramakrishnan

Feature selection with MI • Fix a term t and let be an event associated with that term. • E.g.: For the binary model, = 0/1, • Pr() = the empirical fraction of documents in the training set in which event it occurred. • Pr( ,c) = the empirical fraction of training documents which are in class c • Pr(c) = fraction of training documents belonging to class c. • Formula: • Problem : document lengths are not normalized. Chakrabarti & Ramakrishnan

Fisher's discrimination index • Useful when documents are scaled to constant length • Term occurrences are regarded as fractional real numbers. • E.g.: Two class case • Let X and Y be the sets of length normalized document vectors corresponding to the two classes. • Let and be centroids for each class. • Covariance matrices be Chakrabarti & Ramakrishnan

Fisher's discrimination index (contd.) • Goal : find a projection of the data sets X and Y on to a line such that • the two projected centroids are far apart compared to the spread of the point sets projected on to the same line. • Find a column vector such that • the ratio of • the square of the difference in mean vectors projected onto it • & average projected variance • is maximized. • This gives Chakrabarti & Ramakrishnan

Fisher's discrimination index • Formula • Let X and Y for both the training and test data are generated from multivariate Gaussian distributions • Let • Then this value of induces the optimal (minimum error) classier by suitable thresholding on for a test point q. • Problems • Inverting S would be unacceptably slow for tens of thousands of dimensions. • Llinear transformations would destroy already existing sparsity. Chakrabarti & Ramakrishnan

Solution • Recall: • Goal was to eliminate terms from consideration. • Not to arrive at linear projections involving multiple terms • Regard each term t as providing a candidate direction t which is parallel to the corresponding axis in the vector space model. • Compute the Fisher's index of t Chakrabarti & Ramakrishnan

FI : Solution (contd.) • Formula • For two class case • Can be generalized to a set {c}of more than two classes • Feature selection • Terms are sorted in decreasing order of FI(t) • Best ones chosen as features. Chakrabarti & Ramakrishnan

Validation • How to decide a cut-off rank ? • Validation approach • A portion of the training documents are held out • The rest is used to do term ranking • The held-out set used as a test set. • Various cut-off ranks can be tested using the same held-out set. • Leave-one-out cross-validation/partitioning data into two • An aggregate accuracy is computed over all trials. • Wrapper to search for the number of features • In decreasing order of discriminative power • Yields the highest accuracy. Chakrabarti & Ramakrishnan

Validation (contd.) • Simple search heuristic • Keep adding one feature at every step until the classifier's accuracy ceases to improve. A general illustration of wrapping for feature selection. Chakrabarti & Ramakrishnan

Validation (contd.) • For naive Bayes-like classier • Evaluation on many choices of feature sets can be done at once. • For Maximum Entropy/Support vector machines • Essentially involves training a classier from scratch for each choice of the cut-off rank. • Therefore inefficient Chakrabarti & Ramakrishnan

Validation : observations • Bayesian classifier cannot over fit much Effect of feature selection on Bayesian classifiers Chakrabarti & Ramakrishnan

Truncation algorithms • Start from the complete set of terms T • Keep selecting terms to drop • Till you end up with a feature subset • Question: When should you stop truncation ? • Two objectives • minimize the size of selected feature set F. • Keep the distorted distribution Pr(C|F) as similar as possible to the original Pr(CjT) Chakrabarti & Ramakrishnan

Truncation Algorithms: Example • Kullback-Leibler (KL) • Measures similarity or distance between two distributions • Markov Blanket • Let X be a feature in T. Let • The presence of M renders the presence of X unnecessary as a feature => M is a Markov blanket for X • Technically • M is called a Markov blanket for if X is conditionally independent of given M • eliminating a variable because it has a Markov blanket contained in other existing features does not increase the KL distance between Pr(C|T) and Pr(C|F). Chakrabarti & Ramakrishnan

Finding Markov Blankets • Absence of Markov Blanket in practice • Finding approximate Markov blankets • Purpose: To cut down computational complexity • search for Markov blankets M to those with at most k features. • given feature X, search for the members of M to those features which are most strongly correlated (using tests similar to the 2 or MI tests) with X. • Example : For Reuters dataset, over two-thirds of T could be discarded while increasing classification accuracy Chakrabarti & Ramakrishnan

Feature Truncation algorithm • whiletruncated Pr(C|F) is reasonably close to original Pr(C|T) do • foreach remaining feature X do • Identify a candidate Markov blanket M: • For some tuned constant k, find the set M of k variables in F \ X that are most strongly correlated with X • Estimate how good a blanket Mis • Estimate • end for • Eliminate the feature having the best surviving Markov blanket • end while Chakrabarti & Ramakrishnan

General observations on feature selection • The issue of document length should be addressed properly. • Choice of association measures does not make a dramatic difference • Greedy inclusion algorithms scale nearly linearly with the number of features • Markov blanket technique takes time proportional to at least . • Advantage of Markov blankets algo over greedy inclusion • Greedy algo may include features with high individual correlations even though one subsumes the other • Features individually uncorrelated could be jointly more correlated with the class • This rarely happens • Binary feature selection view may not be only view to subscribe to • Suggestion: combine features into fewer, simpler ones • E.g.: project the document vectors to a lower dimensional space Chakrabarti & Ramakrishnan

Bayesian Learner • Very practical text classifier • Assumption • A document can belong to exactly one of a set of classes or topics. • Each class c has an associated prior probability Pr(c), • There is a class-conditional document distribution Pr(djc) for each class. • Posterior probability • Obtained using Bayes Rule • Parameter set consists of all P(d|c) Chakrabarti & Ramakrishnan

Parameter Estimation for Bayesian Learner • Estimate of is based on two sources of information: • Prior knowledge on the parameter set before seeing any training documents • Terms in the training documents D. • Bayes Optimal Classifier • Taking the expectation of each parameter over Pr( |D) • Computationally infeasible • Maximum likelihood estimate • Replace the sum above with the value of the summand (Pr(c|d, )) for arg maxPr(D| ), • Works poorly Chakrabarti & Ramakrishnan

Naïve Bayes Classifier • Naïve • assumption of independence between terms, • joint term distribution is the product of the marginals. • Widely used owing to • simplicity and speed of training, applying, and updating • Two kinds of widely used marginals for text • Binary model • Multinomial model Chakrabarti & Ramakrishnan

Naïve Bayes Models • Binary Model • Each parameter indicates the probability that a document in class c will mention term t at least once. • Multinomial model • each class has an associated die with |W| faces. • each parameter denotes probability of the face turning up on tossing the die. • term t occurs n(d; t) times in document d, • document length is a random variable denoted L, • . • . Chakrabarti & Ramakrishnan

Analysis of Naïve Bayes Models • Multiply together a large number of small probabilities, • Result: extremely tiny probabilities as answers. • Solution : store all numbers as logarithms • Class which comes out at the top wins by a huge margin • Sanitizing scores using likelihood ration • Also called the logit function • . Chakrabarti & Ramakrishnan

Parameter smoothing • What if a test document contains a term t that never occurred in any training document in class c ? • Ans : will be zero • Even if many other terms clearly hint at a high likelihood of class c generating the document. • Bayesian Estimation • Estimating probability from insufficient data. • If you toss a coin n times and it always comes up heads, what is the probability that the (n + 1)thtoss will also come up heads? • posit a prior distribution on , called • E.g.: The uniform distribution • Resultant posterior distribution: Chakrabarti & Ramakrishnan

Supervised learning for text

Supervised learning for text

Presentation Transcript

Supervised Hebbian Learning

Semi-supervised Learning

Un Supervised Learning

Semi-Supervised Learning over Text

Supervised Learning

Supervised learning

Semi-Supervised Learning

Supervised Learning

Semi-Supervised Learning

Supervised Learning

A High Performance Semi-Supervised Learning Method for Text Chunking

Semi-Supervised Learning

Soft-Supervised Learning for Text Classification

Transfer for Supervised Learning Tasks

Supervised and semi-supervised learning for NLP

Supervised Learning

Pseudo-supervised Clustering for Text Documents

Semi-Supervised Learning

Semi-Supervised Learning

Supervised Learning

Supervised Learning

Supervised Learning