Automated Text Categorization System

TEXT CLASSIFICATION-----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Text CATEGORIZATION / CLASSIFICATION • Given: • A description of an instance, xX, where X is the instance language or instance space. • E.g: how to represent text documents. • A fixed set of categories C ={c1, c2,…, cn} • Determine: • The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C.

Sports Business Education Science Categorization System … … Sports Business Education Text Classification • Pre-given categories and labeled document examples (Categories may form hierarchy) • Classify new documents • A standard classification (supervised learning ) problem

Arch. Graphics Theory NLP AI A GRAPHICAL VIEW OF TEXT CLASSIFICATION

Text Classification Applications • Applications: • Web pages • Recommending • Yahoo-like classification • Newsgroup Messages • Recommending • spam filtering • News articles • Personalized newspaper • Email messages • Routing • Prioritizing • Folderizing • spam filtering

Text Classification Applications • Web pages organized into category hierarchies • Journal articles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.) • Responses to Census Bureau occupations • Patents archived using International Patent Classification • Patient records coded using international insurance categories • E-mail message filtering • News events tracked and filtered by topics

Cost of Manual Text Categorization • Yahoo! • 200 (?) people for manual labeling of Web pages • using a hierarchy of 500,000 categories • MEDLINE (National Library of Medicine) • $2 million/year for manual indexing of journal articles • using MEdical Subject Headings (18,000 categories) • Mayo Clinic • $1.4 million annually for coding patient-record events • using the International Classification of Diseases (ICD) for billing insurance companies • US Census Bureau decennial census (1990: 22 million responses) • 232 industry categories and 504 occupation categories • $15 million if fully done by hand

What is so special about text? • No obvious relation between features • High dimensionality, (often larger vocabulary, V, than the number of features!) • Importance of speed

Where we need? • Term extraction tools • Document representation • The need for dimensionality reduction • Classifier learning methods • Topics model & semantic representation • ……….

High dimensional space, not as high as |V| Doc1 Doc2 Doc3 … LOVE 34 0 3 Document/Term count matrix SOUL 12 0 2 SOUL SVD RESEARCH 0 19 6 LOVE SCIENCE … 0 … 16 … 1 … RESEARCH SCIENCE Latent Semantic Analysis EACH WORD IS A SINGLE POINT IN A SEMANTIC SPACE

EXAMPLES OF TEXT Classification • LABELS=BINARY • “spam” / “not spam” • LABELS=TOPICS • “finance” / “sports” / “asia” • LABELS=OPINION • “like” / “hate” / “neutral” • LABELS=AUTHOR • “Shakespeare” / “Marlowe” / “Ben Jonson” • The Federalist papers

Text Classification: Problem Definition • Need to assign a boolean value {0,1} to each entry of the decision matrix • C = {c1,....., cm} is a set of pre-defined categories • D = {d1,..... dn} is a set of documents to be categorized • 1 for aij : dj belongs to ci • 0 for aij : dj does not belong to ci A Tutorial on Automated Text Categorisation, Fabrizio Sebastiani, Pisa (Italy)

Methods (1) • Manual classification • Used by Yahoo!, Looksmart, about.com, ODP, Medline • very accurate when job is done by experts • consistent when the problem size and team is small • difficult and expensive to scale • Automatic document classification • Hand-coded rule-based systems • Reuters, CIA, Verity, … • Commercial systems have complex query languages (everything in IR query languages + accumulators)

Methods (2) • Supervised learning of document-label assignment function: Autonomy, Kana, MSN, Verity, … • Naive Bayes (simple, common method) • k-Nearest Neighbors (simple, powerful) • Support-vector machines (new, more powerful) • … plus many other methods • No free lunch: requires hand-classified training data • But can be built (and refined) by amateurs

Support Vector Machine • SVM: A Large-Margin Classifier • Linear SVM • Kernel Trick • Fast implementation: SMO • SVM for Text Classification • Multi-class Classification • Multi-label Classification • Hierarchical Classification Tool

Class 2 Class 1 What is a Good Decision Boundary? • Consider a two-class, linearly separable classification problem • Many decision boundaries! • The Perceptron algorithm can be used to find such a boundary • Are all decision boundaries equally good?

Examples of Bad Decision Boundaries Class 2 Class 2 Class 1 Class 1

Large-margin Decision Boundary • The decision boundary should be as far away from the data of both classes as possible • We should maximize the margin, m Class 2 m Class 1

Finding the Decision Boundary • Let {x1, ..., xn} be our data set and let yiÎ {1,-1} be the class label of xi • The decision boundary should classify all points correctly Þ • The decision boundary can be found by solving the following constrained optimization problem • The Lagrangian of this optimization problem is

The Dual Problem • By setting the derivative of the Lagrangian to be zero, the optimization problem can be written in terms of ai (the dual problem) • This is a quadratic programming (QP) problem • A global maximum of ai can always be found • w can be recovered by If the number of training examples is large, SVM training will be very slow because the number of parameters Alpha is very large in the dual problem.

KTT Condition • The QP problem is solved when for all i,

Characteristics of the Solution • KTT condition indicates many of the ai are zero • w is a linear combination of a small number of data points • xi with non-zero ai are called support vectors (SV) • The decision boundary is determined only by the SV • Let tj (j=1, ..., s) be the indices of the s support vectors. We can write • For testing with a new data z • Compute and classify z as class 1 if the sum is positive, and class 2 otherwise.

A Geometrical Interpretation Class 2 a10=0 a8=0.6 a7=0 a2=0 a5=0 a1=0.8 a4=0 a6=1.4 a9=0 a3=0 Class 1

Non-linearly Separable Problems • We allow “error”xi in classification Class 2 Class 1

Soft Margin Hyperplane • By minimizing åixi, xi can be obtained by xi are “slack variables” in optimization; xi=0 if there is no error for xi, and xi is an upper bound of the number of errors • We want to minimize C : tradeoff parameter between error and margin • The optimization problem becomes

The Optimization Problem • The dual of the problem is • w is recovered as • This is very similar to the optimization problem in the linear separable case, except that there is an upper bound C on ai now • Once again, a QP solver can be used to find ai

Extension to Non-linear Decision Boundary • So far, we only consider large-margin classifier with a linear decision boundary, how to generalize it to become nonlinear? • Key idea: transform xi to a higher dimensional space to “make life easier” • Input space: the space the point xi are located • Feature space: the space of f(xi) after transformation • Why transform? • Linear operation in the feature space is equivalent to non-linear operation in input space • Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable

f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Transforming the Data • Computation in the feature space can be costly because it is high dimensional • The feature space is typically infinite-dimensional! • The kernel trick comes to rescue f(.) Feature space Input space

The Kernel Trick • Recall the SVM optimization problem • The data points only appear as inner product • As long as we can calculate the inner product in the feature space, we do not need the mapping explicitly • Many common geometric operations (angles, distances) can be expressed by inner products • Define the kernel function K by

An Example for f(.) and K(.,.) • Suppose f(.) is given as follows • An inner product in the feature space is • So, if we define the kernel function as follows, there is no need to carry out f(.) explicitly • This use of kernel function to avoid carrying out f(.) explicitly is known as the kernel trick

Kernel Functions • In practical use of SVM, only the kernel function (and not f(.)) is specified • Kernel function can be thought of as a similarity measure between the input objects • Not all similarity measure can be used as kernel function, however Mercer's condition states that any positive semi-definite kernel K(x, y), i.e. can be expressed as a dot product in a high dimensional space.

Examples of Kernel Functions • Polynomial kernel with degree d • Radial basis function kernel with width s • Closely related to radial basis function neural networks • Sigmoid with parameter k and q • It does not satisfy the Mercer condition on all k and q

Modification Due to Kernel Function • Change all inner products to kernel functions • For training, Original With kernel function

Modification Due to Kernel Function • For testing, the new data z is classified as class 1 if f ³0, and as class 2 if f <0 Original With kernel function

Why SVM Works? • The feature space is often very high dimensional. Why don’t we have the curse of dimensionality? • A classifier in a high-dimensional space has many parameters and is hard to estimate • Vapnik argues that the fundamental problem is not the number of parameters to be estimated. Rather, the problem is about the flexibility of a classifier • Typically, a classifier with many parameters is very flexible, but there are also exceptions • Let xi=10i where i ranges from 1 to n. The classifier can classify all xi correctly for all possible combination of class labels on xi • This 1-parameter classifier is very flexible

Why SVM Works? • Vapnik argues that the flexibility of a classifier should not be characterized by the number of parameters, but by the capacity of a classifier • This is formalized by the “VC-dimension”of a classifier • The addition of ½||w||2 has the effect of restricting the VC-dimension of the classifier in the feature space • The SVM objective can also be justified by structural risk minimization: the empirical risk (training error), plus a term related to the generalization ability of the classifier, is minimized • Another view: the SVM loss function is analogous to ridge regression. The term ½||w||2“shrinks” the parameters towards zero to avoid overfitting

Choosing the Kernel Function • Probably the most tricky part of using SVM. • The kernel function is important because it creates the kernel matrix, which summarize all the data • Many principles have been proposed (diffusion kernel, Fisher kernel, string kernel, …) • There are even research to estimate the kernel matrix from available information • In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try for most applications. • It was said that for text classification, linear kernel is the best choice, because of the already-high-enough feature dimension

Strengths and Weaknesses of SVM • Strengths • Training is relatively easy • No local optimal, unlike in neural networks • It scales relatively well to high dimensional data • Tradeoff between classifier complexity and error can be controlled explicitly • Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors • By performing logistic regression (Sigmoid) on the SVM output of a set of data can map SVM output to probabilities. • Weaknesses • Need to choose a “good” kernel function.

Summary: Steps for Classification • Prepare the pattern matrix • Select the kernel function to use • Select the parameter of the kernel function and the value of C • You can use the values suggested by the SVM software, or you can set apart a validation set to determine the values of the parameter • Execute the training algorithm and obtain the ai • Unseen data can be classified using the ai and the support vectors

Fast SVM Implementations • SMO: Sequential Minimal Optimization • SVM-Light • LibSVM • BSVM • ……

SMO: Sequential Minimal Optimization • Key idea • Divide the large QP problem of SVM into a series of smallest possible QP problems, which can be solved analytically and thus avoids using a time-consuming numerical QP in the loop (a kind of SQP method). • Space complexity: O(n). • Since QP is greatly simplified, most time-consuming part of SMO is the evaluation of decision function, therefore it is very fast for linear SVM and sparse data.

SMO • At each step, SMO chooses 2 Lagrange multipliers to jointly optimize, find the optimal values for these multipliers and updates the SVM to reflect the new optimal values. • Three components • An analytic method to solve for the two Lagrange multipliers • A heuristic for choosing which multipliers to optimize • A method for computing b at each step, so that the KTT conditions are fulfilled for both the two examples

Choosing Which Multipliers to Optimize • First multiplier • Iterate over the entire training set, and find an example that violates the KTT condition. • Second multiplier • Maximize the size of step taken during joint optimization. • |E1-E2|, where Ei is the error on the i-th example.

Text Categorization • Typical features • Term frequency • Inverse document frequency • TC is a typical multi-class multi-label classification problem. • SVM, with some additional heuristic, has been regarded as one of the best classification scheme for text data, based on many benchmark evaluations. • TC is a high-dimensional sparse problem • SMO is a very good choice in this case.

Multi-Class SVM Classification • 1-vs-rest • 1-vs-1 • MaxWin • DB2 • Error Correcting Output Coding • K-class SVM

1-vs-rest • For any class C, train a binary classifier to distinguish C from C. • For an unseen sample, find the binary classifier with highest confidence score for the final decision.

1-vs-1 • Train CN2 classifiers, which distinguish one class from another one. • Pairwise: • MaxWin (CN2 tests) • Error-correcting output code • DAG: • Pachinko-machine (N tests)

Error Correcting Output Coding • Code Matrix (MNxK) – N classes, K classifiers • Hamming Distance • Class Ci with Minimum Error wins

C3 2~3 2~1 1~3 3 2 1 C2 1~3 1~2 C1 1~2 1~3 3~2 2~3 1 2 2 3 1 3 Intransitivity of DAG • For C1、C2、C3, if , then , we say is transitive.

Divided-by-2 (DB2) • Hierarchically divide the data into two subsets until every subset consists of only one class.

Automated Text Categorization System

Automated Text Categorization System

Presentation Transcript

Automatic Text Classification

Music Classification Using SVM

Support Vector Machine (SVM) Classification

Text Classification

A Semantic Text Classification Based on DBpedia

TEXT CLASSIFICATION

Support Vector Machine (SVM) Classification

On Compression-Based Text Classification

Text Classification

Text Classification

Text Classification

Text Classification

Classification of Drugs by SVM

Text Classification

Text Classification

Classification Text

Text Classification using SVM-light

Text Classification

Support Vector Machine (SVM) Classification

TEXT CLASSIFICATION