Create Presentation
Download Presentation

Download Presentation
## Text Classification and Naïve Bayes

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Text Classification and Naïve Bayes**• An example of text classification • Definition of a machine learning problem • A refresher on probability • The Naive Bayes classifier**Different ways for classification**• Human labor (people assign categories to every incoming article) • Hand-crafted rules for automatic classification • If article contains: stock, Dow, share, Nasdaq, etc. Business • If article contains: set, breakpoint, player, Federer, etc. Tennis • Machine learning algorithms**What is Machine Learning?**Definition: A computer program is said to learn from experience E when its performance P at a task T improves with experience E. Tom Mitchell, Machine Learning, 1997 • Examples: • Learning to recognize spoken words • Learning to drive a vehicle • Learning to play backgammon**Components of a ML System (1)**• Experience (a set of examples that combines together input and output for a task) • Text categorization: document + category • Speech recognition: spoken text + written text • Experience is referred to as Training Data. When training data is available, we talk of Supervised Learning. • Performance metrics • Error or accuracy in the Test Data • Test Data are not present in the Training Data • When there are few training data, methods like ‘leave-one-out’ or ‘ten-fold cross validation’ are used to measure error.**Components of a ML System (2)**Task • Type of knowledge to be learned (known as the target function, that will map between input and output) • Representation of the target function • Decision trees • Neural networks • Linear functions • The learning algorithm • C4.5 (learns decision trees) • Gradient descent (learns a neural network) • Linear programming (learns linear functions)**Defining Text Classification**the document in the multi-dimensional space a set of classes (categories, or labels) the training set of labeled documents Target function: Learning algorithm: “Beijing joins the World Trade Organization”, China China**Naïve Bayes Learning**Learning Algorithm: Naïve Bayes Target Function: The generative process: a priori probability, of choosing a category the cond. prob. of generating d, given the fixed c a posteriori probability that c generated d**Visualizing probability**• A is a random variable that denotes an uncertain event • Example: A = “I’ll get an A+ in the final exam” • P(A) is “the fraction of possible worlds where A is true” Event space of all possible worlds. Its area is 1. Worlds in which A is true P(A) = Area of the blue circle. Worlds in which A is false Slide: Andrew W. Moore**Axioms and Theorems of Probability**• Axioms: • 0 <= P(A) <= 1 • P(True) = 1 • P(False) = 0 • P(A or B) = P(A) + P(B) – P(A and B) • Theorems: • P(not A) = P(~A) = 1 – P(A) • P(A) = P(A ^ B) + P(A ^ ~B)**Conditional Probability**• P(A|B) = the probability of A being true, given that we know that B is true H = “I have a headache” F = “Coming down with flu” P(H) = 1/10 P(F) = 1/40 P(H/F) = 1/2 F H Headaches are rare and flu even rarer, but if you got that flu, there is a 50-50 chance you’ll have a headache. Slide: Andrew W. Moore**Deriving the Bayes Rule**Conditional Probability: Chain rule: Bayes Rule:**Deriving the Naïve Bayes**(Bayes Rule) and the document Given two classes We are looking for a that maximizes the a-posteriori (the denominator) is the same in both cases Thus:**Estimating parameters for the target function**We are looking for the estimates and P(c) is the fraction of possible worlds where c is true. N – number of all documents Nc – number of documents in class c is a vector in the space where each dimension is a term: By using the chain rule: we have:**Naïve assumptions of independence**• All attribute values are independent of each other given the class. (conditional independence assumption) • The conditional probabilities for a term are the same independent of position in the document. We assume the document is a “bag-of-words”. Finally, we get the target function of Slide 8:**Again about estimation**For each term, t, we need to estimate P(t|c) Tct is the count of term t in all documents of class c Because an estimate will be 0 if a term does not appear with a class in the training data, we need smoothing: Laplace Smoothing |V| is the number of terms in the vocabulary**Example 13.1 (Part 1)**Two classes: “China”, “not China” V = {Beijing, Chinese, Japan, Macao, Tokyo} N = 4**Example 13.1 (Part 1)**Estimation Classification**Summary: Miscellanious**• Naïve Bayes is linear in the time is takes to scan the data • When we have many terms, the product of probabilities with cause a floating point underflow, therefore: • For a large training set, the vocabulary is large. It is better to select only a subset of terms. For that is used “feature selection” (Section 13.5).