Download Presentation
## TEXT CLASSIFICATION

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**TEXT CLASSIFICATION**Jianping Fan Dept of Computer Science UNC-Charlotte**Text CATEGORIZATION / CLASSIFICATION**• Given: • A description of an instance, xX, where X is the instance language or instance space. • E.g: how to represent text documents. • A fixed set of categories C ={c1, c2,…, cn} • Determine: • The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C.**Sports**Business Education Science Categorization System … … Sports Business Education Text Classification • Pre-given categories and labeled document examples (Categories may form hierarchy) • Classify new documents • A standard classification (supervised learning ) problem**Arch.**Graphics Theory NLP AI A GRAPHICAL VIEW OF TEXT CLASSIFICATION**Text Classification Applications**• Applications: • Web pages • Recommending • Yahoo-like classification • Newsgroup Messages • Recommending • spam filtering • News articles • Personalized newspaper • Email messages • Routing • Prioritizing • Folderizing • spam filtering**Text Classification Applications**• Web pages organized into category hierarchies • Journal articles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.) • Responses to Census Bureau occupations • Patents archived using International Patent Classification • Patient records coded using international insurance categories • E-mail message filtering • News events tracked and filtered by topics**Cost of Manual Text Categorization**• Yahoo! • 200 (?) people for manual labeling of Web pages • using a hierarchy of 500,000 categories • MEDLINE (National Library of Medicine) • $2 million/year for manual indexing of journal articles • using MEdical Subject Headings (18,000 categories) • Mayo Clinic • $1.4 million annually for coding patient-record events • using the International Classification of Diseases (ICD) for billing insurance companies • US Census Bureau decennial census (1990: 22 million responses) • 232 industry categories and 504 occupation categories • $15 million if fully done by hand**What is so special about text?**• No obvious relation between features • High dimensionality, (often larger vocabulary, V, than the number of features!) • Importance of speed**Where we need?**• Term extraction tools • Document representation • The need for dimensionality reduction • Classifier learning methods • Topics model & semantic representation • ……….**High dimensional space,**not as high as |V| Doc1 Doc2 Doc3 … LOVE 34 0 3 Document/Term count matrix SOUL 12 0 2 SOUL SVD RESEARCH 0 19 6 LOVE SCIENCE … 0 … 16 … 1 … RESEARCH SCIENCE Latent Semantic Analysis EACH WORD IS A SINGLE POINT IN A SEMANTIC SPACE**EXAMPLES OF TEXT Classification**• LABELS=BINARY • “spam” / “not spam” • LABELS=TOPICS • “finance” / “sports” / “asia” • LABELS=OPINION • “like” / “hate” / “neutral” • LABELS=AUTHOR • “Shakespeare” / “Marlowe” / “Ben Jonson” • The Federalist papers**Text Classification: Problem Definition**• Need to assign a boolean value {0,1} to each entry of the decision matrix • C = {c1,....., cm} is a set of pre-defined categories • D = {d1,..... dn} is a set of documents to be categorized • 1 for aij : dj belongs to ci • 0 for aij : dj does not belong to ci A Tutorial on Automated Text Categorisation, Fabrizio Sebastiani, Pisa (Italy)**Methods (1)**• Manual classification • Used by Yahoo!, Looksmart, about.com, ODP, Medline • very accurate when job is done by experts • consistent when the problem size and team is small • difficult and expensive to scale • Automatic document classification • Hand-coded rule-based systems • Reuters, CIA, Verity, … • Commercial systems have complex query languages (everything in IR query languages + accumulators)**Methods (2)**• Supervised learning of document-label assignment function: Autonomy, Kana, MSN, Verity, … • Naive Bayes (simple, common method) • k-Nearest Neighbors (simple, powerful) • Support-vector machines (new, more powerful) • … plus many other methods • No free lunch: requires hand-classified training data • But can be built (and refined) by amateurs**Bayesian Methods**• Learning and classification methods based on probability theory (see spelling / POS) • Bayes theorem plays a critical role • Build a generative model that approximates how data is produced • Uses prior probability of each category given no information about an item. • Categorization produces a posterior probability distribution over the possible categories given a description of an item.**Maximum likelihood Hypothesis**If all hypotheses are a priori equally likely,we only need to consider the P(D|h) term:**Naive Bayes Classifiers**Task: Classify a new instance based on a tuple of attribute values**Naïve Bayes Classifier: Assumptions**• P(cj) • Can be estimated from the frequency of classes in the training examples. • P(x1,x2,…,xn|cj) • Need very, very large number of training examples Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.**Flu**X1 X2 X3 X4 X5 runnynose sinus cough fever muscle-ache The Naïve Bayes Classifier • Conditional Independence Assumption: features are independent of each other given the class:**C**X1 X2 X3 X4 X5 X6 Learning the Model • Common practice: maximum likelihood • simply use the frequencies in the data**Using Naive Bayes Classifiers to Classify Text: Basic method**• Attributes are text positions, values are words. • Still too many possibilities • Assume that classification is independent of the positions of the words • Use same parameters for each position**Text Classification Algorithms: Learning**• From training corpus, extract Vocabulary • Calculate required P(cj)and P(xk | cj)terms • For each cj in Cdo • docsjsubset of documents for which the target class is cj • Textj single document containing all docsj • for each word xkin Vocabulary • nk number of occurrences ofxkin Textj**Text Classification Algorithms: Classifying**• positions all word positions in current document which contain tokens found in Vocabulary • Return cNB, where**Naïve Bayes Posterior Probabilities**• Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate. • However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not. • Output probabilities are generally very close to 0 or 1.**Feature selection via Mutual Information**• We might not want to use all words, but just reliable, good discriminators • In training set, choose k words which best discriminate the categories. • One way is in terms of Mutual Information: • For each word w and each category c**OTHER APPROACHES TO FEATURE SELECTION**• T-TEST • CHI SQUARE • TF/IDF (CFR. IR lectures) • Yang & Pedersen 1997: eliminating features leads to improved performance**NAÏVE BAYES NOT SO NAIVE**• Naïve Bayes: First and Second place in KDD-CUP 97 competition, among 16 (then) state of the art algorithms Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Instead Decision Trees & Nearest-Neighbor methods can heavily suffer from this. • Very good in Domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data • A good dependable baseline for text classification (but not the best)! • Optimal if the Independence Assumptions hold: • If assumed independence is correct, then it is the Bayes Optimal Classifier for problem • Very Fast: • Learning with one pass over the data; testing linear in the number of attributes, and document collection size • Low Storage requirements • Handles Missing Values**PANTEL AND LIN: SPAMCOP**• Uses a Naïve Bayes classifier • M is spam if P(Spam|M) > P(NonSpam|M) • Method • Tokenize message using Porter Stemmer • Estimate P(W|C) using m-estimate (a form of smoothing) • Remove words that do not satisfy certain conditions • Train: 160 spams, 466 non-spams • Test: 277 spams, 346 non-spams • Results: ERROR RATE of 4.33% • Worse results using trigrams**OTHER CLASSIFICATION METHODS**• K-NN • DECISION TREES • LOGISTIC REGRESSION • SUPPORT VECTOR MACHINES**Support Vector Machine**• SVM: A Large-Margin Classifier • Linear SVM • Kernel Trick • Fast implementation: SMO • SVM for Text Classification • Multi-class Classification • Multi-label Classification • Our Hierarchical Classification Tool**Class 2**Class 1 What is a Good Decision Boundary? • Consider a two-class, linearly separable classification problem • Many decision boundaries! • The Perceptron algorithm can be used to find such a boundary • Are all decision boundaries equally good?**Examples of Bad Decision Boundaries**Class 2 Class 2 Class 1 Class 1**Large-margin Decision Boundary**• The decision boundary should be as far away from the data of both classes as possible • We should maximize the margin, m Class 2 m Class 1**Finding the Decision Boundary**• Let {x1, ..., xn} be our data set and let yiÎ {1,-1} be the class label of xi • The decision boundary should classify all points correctly Þ • The decision boundary can be found by solving the following constrained optimization problem • The Lagrangian of this optimization problem is**The Dual Problem**• By setting the derivative of the Lagrangian to be zero, the optimization problem can be written in terms of ai (the dual problem) • This is a quadratic programming (QP) problem • A global maximum of ai can always be found • w can be recovered by If the number of training examples is large, SVM training will be very slow because the number of parameters Alpha is very large in the dual problem.**KTT Condition**• The QP problem is solved when for all i,**Characteristics of the Solution**• KTT condition indicates many of the ai are zero • w is a linear combination of a small number of data points • xi with non-zero ai are called support vectors (SV) • The decision boundary is determined only by the SV • Let tj (j=1, ..., s) be the indices of the s support vectors. We can write • For testing with a new data z • Compute and classify z as class 1 if the sum is positive, and class 2 otherwise.**A Geometrical Interpretation**Class 2 a10=0 a8=0.6 a7=0 a2=0 a5=0 a1=0.8 a4=0 a6=1.4 a9=0 a3=0 Class 1**Non-linearly Separable Problems**• We allow “error”xi in classification Class 2 Class 1**Soft Margin Hyperplane**• By minimizing åixi, xi can be obtained by xi are “slack variables” in optimization; xi=0 if there is no error for xi, and xi is an upper bound of the number of errors • We want to minimize C : tradeoff parameter between error and margin • The optimization problem becomes**The Optimization Problem**• The dual of the problem is • w is recovered as • This is very similar to the optimization problem in the linear separable case, except that there is an upper bound C on ai now • Once again, a QP solver can be used to find ai**Extension to Non-linear Decision Boundary**• So far, we only consider large-margin classifier with a linear decision boundary, how to generalize it to become nonlinear? • Key idea: transform xi to a higher dimensional space to “make life easier” • Input space: the space the point xi are located • Feature space: the space of f(xi) after transformation • Why transform? • Linear operation in the feature space is equivalent to non-linear operation in input space • Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable**f( )**f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Transforming the Data • Computation in the feature space can be costly because it is high dimensional • The feature space is typically infinite-dimensional! • The kernel trick comes to rescue f(.) Feature space Input space**The Kernel Trick**• Recall the SVM optimization problem • The data points only appear as inner product • As long as we can calculate the inner product in the feature space, we do not need the mapping explicitly • Many common geometric operations (angles, distances) can be expressed by inner products • Define the kernel function K by**An Example for f(.) and K(.,.)**• Suppose f(.) is given as follows • An inner product in the feature space is • So, if we define the kernel function as follows, there is no need to carry out f(.) explicitly • This use of kernel function to avoid carrying out f(.) explicitly is known as the kernel trick**Kernel Functions**• In practical use of SVM, only the kernel function (and not f(.)) is specified • Kernel function can be thought of as a similarity measure between the input objects • Not all similarity measure can be used as kernel function, however Mercer's condition states that any positive semi-definite kernel K(x, y), i.e. can be expressed as a dot product in a high dimensional space.**Examples of Kernel Functions**• Polynomial kernel with degree d • Radial basis function kernel with width s • Closely related to radial basis function neural networks • Sigmoid with parameter k and q • It does not satisfy the Mercer condition on all k and q**Modification Due to Kernel Function**• Change all inner products to kernel functions • For training, Original With kernel function