Notes 9: Probability and Classification

Notes 9: Probability and Classification ICS 171, Winter 2001

Outline • Reasoning with Probabilities • Bayes’ rule • Classification • classifying objects into a discrete set of classes • applications • Probabilistic Classification: “first-order” Bayes classifier

Derivation of Bayes’ Rule • From the definition of conditional probability we have • p(a|b) = p(a,b) / p(b) (1) • from the same definition we also have • p(b|a) = p(a,b) / p(a) (2) • So, p(a,b) = p(a|b) p(b) (from (1)) = p(b|a) p(a) (from (2)) • Rearranging the 2 right hand sides, we get: p(a|b) = p(b|a) p(a) p(b) (This is Bayes’ rule)

Example of using Bayes’ Rule • Say we know p(d) = p(disease) = 0.001, p(s) = p(symptom) = 0.01 and p(symptom s | d) = 0.9If someone has the symptom what is the probability they have the disease?We need to find p(d|s) from the information above p(d|s) = p(s|d) p(d) = 0.9 x 0.001 = 0.09 p(s) 0.01 This is Bayes’ Rule: we can “update” p(d) to p(d|s) given new information about the symptom s => simple reasoning mechanism

Why is Bayes’ Rule useful? • In practice an agent must reason as follows effects -> causes e.g., sensor signals -> road conditions e.g., symptoms -> diseases • But normally we build models in the “forward” causal direction causes -> effects e.g., road conditions -> sensor signals e.g., diseases -> symptoms • Bayes rule allows us to work”backward” using the output of the forward model to to infer causes (inputs) • => very useful in applications involving diagnosis • can involve complex probabilistic reasoning

acoustic data Word model 1 acoustic data Word model 2 Speech Recognition using Bayes’ rule Forward Acoustic Models for each word provides p(acoustic data | word j) Word model 5000 acoustic data Grammar of language provides p(word j) Bayes rule is used to calculate p(word j| acoustic data) = p(acoustic data|word j) p(word j) p(acoustic data) Then pick the maximum over j of p(word j| acoustic data) , i.e., recognition! This is how real speech recognition systems work in practice => they combine grammar cues and audio cues using Bayes’ rule

Classification • Classification is an important application for autonomous agents • We have a special random variable called the Class, C • C takes values in {c1, c2, ...... ck} • Problem is to decide what class an object is in given information about other variables, A, B, etc • e.g., C is road condition, given A = temperature, B = humidity, etc • e.g., C is health, given A = blood pressure, B = test result, etc • e.g., C is credit status, given A = bank balance, B = age, etc • e.g., C is a word, given A = acoustic information,... • Notation: • C is the class variable • A, B, etc (the measurements) are called the attributes

Classification Algorithms or Mappings Attribute Values (which are known, measured) Predicted Class Value (true class is unknown to the classifier) a b c Classifier d z We want to learn a mapping or function which takes any combination of values (a, b, d, ..... z) and will produce a prediction c Note: if we know the distribution P(C | A, B, D,....... Z), then this is the true mapping (i.e., it contains all the information we need) The problem is that we don’t know this function: we have to learn it from data!

Applications of Classification • Problem and Fault Diagnosis • Microsoft support, Air Force, etc • Medical Diagnosis • classification of cancerous cells • Credit card and Loan approval • Most major banks • Speech recognition • Dell, Dragon Systems, AT&T, Gateway, Microsoft, etc • Optical Character/Handwriting Recognition • Post Offices, Banks, Gateway, Motorola, Microsoft, Xerox, etc • Astronomy • Caltech/NASA • Many, many more applications • one of the most successful applications of AI technology

Probabilistic Classification We can build classifiers using probability models: Lets say we are trying to find the probability C = c given values of the attributes A, B, D, .... Z, i.e., given {a, b, d, .... z} p(c | a and b and d .. and z) = p( a and b and d .. and z | c) p(c) p(a and b and d .. and z) (This equation is the definition of Bayes rule) Observation: -> we will calculate p(C= c1|....) p(C=c2|....),, etc in this manner and pick the maximum -> So the value of p(a and b and d .. and z) in the denominator can be ignored (it is the same (constant) for all of the c values) So all we need to compute is the 2 terms in the numerator: p(c | a and b and d .. and z) = p( a and b and d .. and z | c) p(c) some constant and then pick the maximum of these values over C

Probabilistic Classification in Practice We need to calculate 2 quantities: 1. p(c1), p(c2), ... This is easy, its just m numbers 2. p( a and b and d .. and z | c ) How many numbers are needed here? Lets say each attribute can take k values and there are d attributes -> we need a table of kd numbers per class value e.g., k = 4, d = 10, we need about 1 million numbers! -> this is impractical in practice - why? it would be too difficult to accurately learn all these different numbers

An Approximation: “First-Order” Classification For each value of the class variable C we need p( a and b and d .. and z | c ) What happens if we assume that attributes are independent given the class? Independence means that p(a and b) = p(a) p(b) or equivalently p(a|b) = p(a), i.e., knowing b tells us nothing about a Lets say we assume that the attributes are conditionally independent given the class i.e., we approximate p( a and b and d .. and z | c ) by p(a|c) p(b|c) p(d|c).... p(z|c), i.e., a product of single conditional probabilities What is the advantage? The product approximation needs only k d numbers per class compared to the full method which needs kd per class

f(i) p(ci| a and b and ...z) = m f(j) j=1 The “First-Order” Probabilistic Classifier Let there be m values for the class variable C, namely, c1,...ci,...cm Let the observed attribute values be (A=a, B=b, ....Z=z) = (a,b,.....z) 1. Calculate the product f(i) = p(a|ci) p(b|ci) ........ p(z|ci) p(ci) for each i = 1 to m 2. Choose the largest (among i) of the f(i): this is our classification decision, i.e., the class value with the maximum f(i). Why? according to our assumptions, and given the data, this value of ci is the most likely 3. We can also calculate the class probabilities ( “posterior probabilities”) , for i = 1 to m (still assuming conditional independence)

Learning a Classifier:1: Establishing a Training Data Set • A “Training Data Set” is a file of M records (or “samples) • Each record is made up of: • 1. a d-tuple of d attribute values • (an attribute is just another name for a variable) • e.g., for a PC trouble-shooting problem there might be 3 variables: Manufacturer, Operating System, Symptom • a particular set of attribute values could be (Dell, Windows95, can’t print), for the 3 attributes • 2. One class label from a set of possible class labels • The class variable is a special variable • it is the variable whose value we want to predict • e.g., for the PC problem the class could be Problem_Cause taking values in {hardware, software configuration,printer driver, unknown}

Learning a Classifier:1: An Example Training Data Set(a set of records from a trouble-reporting database) CLASS ATTRIBUTES ID Manufacturer Operating Symptom Problem System Cause 1678 Dell Windows95 Can’t print Driver 7262 Compaq Windows95 Can’t print Driver 1716 Dell Windows95 Can’t print Driver 6353 Gateway Linux Can’t print Driver 5242 Dell Windows95 Can’t print Driver 1425 Compaq Windows95 No display Hardware 3435 Gateway Linux Can’t print Hardware 6953 Dell Windows95 No display Hardware 9287 Compaq Windows95 No display Hardware 6252 Compaq Windows95 Can’t print Hardware

Homework Problem • For the table of attributes on the previous overhead: • how many probability values do we need to model the full joint distribution of these attributes if we make no independence assumptions • how many probability values do we need if we make the first-order conditional independence assumption?

Learning a Classifier2. Learning Probabilities from the Training Data Set We need the functions f(i) for each class From earlier, we have (by assumption) that f(i) = p(a|ci) p(b|ci) ........ p(z|ci) p(ci) Where do these probabilities come from? From the training data set: 1. p(ci) = number of records with class label i total number of records 2. p(A = a|ci) = number of records with class label ci and attribute A = a number of records with attribute class label ci (for each i, i ranges over the class values)

Example Notation The class takes 2 values: d = driver, h = hardware There are 3 attributes: m is the value for the manufacturer attribute o is the value for the operating system attribute s is the value for the symptom attribute From before, the f(i)’s are: (where now i=d, and i =h) f(Class = d) = p(m | d) p(o | d) p( s | d) p(Class = d) f(Class = h) = p(m | h) p(o | h) p( s | h) p(Class = h) Procedure for classification Given a vector (m, o, s) of attribute values, and a model in the form of tables for the above probabilities - Calculate both f(Class=d) and f(Class=h) as above - Choose the larger of the 2 as the classification decision

Summary of Classification Learning • Phase 1: Learning the model • the p(ci) and p(attribute|ci) tables are built from a database • in this database the class labels are known • it is called the training data • this “training” occurs “offline” • Phase 2: Using the model for Classification • after the model is built in Phase 1..... • one gets a new record of attribute values, [a, b, d, ..... z] • the goal is to predict a class value, since the class value for this record is unknown • Calculate the f(i) functions for each possible class value • The maximum is the most likely given the model and the data and we choose that as the predicted value

Worked Classification Example (continued) How can we build the model (i.e., generate the entries for the conditional probability tables) using the database? 1. Say we want to calculate p(driver) p(driver) = number of records with “driver” number of total records = 5 / 10 = 0.5 2. Say we want to calculate p(Dell|driver) p(Dell|driver) = number of records with “driver” and “Dell” number of records with “driver” = 3/5 = 0.6

Example: Performing Classification Say we get a new record whose class is unknown: e.g., (Manufacturer = Dell, OS = Linux, Symptom=Can’t print, Class = ?) NOTE: There are no records exactly like this in the database: ...but we can still use the probabilistic classifier model to predict the class! f(d) = p(Dell | d) p(Linux | d) p(CP | d) p(d) = 0.6 0.2 1 0.5 = 0.06

Classification Example (continued) f(h) = p(Dell |h) p(Linux | h) p(CP | h) p(h) = 0.2 0.2 0.4 0.5 = 0.008 So, max (f(d), f(h)) = max(0.06, 0.008) = 0.06 => d (driver) is the most likely class value Finally we can can also get the actual class probability for d, p(d|Gateway, Linux, Can’tPrint) = f(d)/(f(d) + f(h)) = 0.06/ 0.068 = 0.88 => we classify the record as “driver problem” with probability 0.88 The classifier has “learned” how to classify this record even though it never saw it before! (this is known as “generalization”)

Procedure for Probabilistic Classification • 1. Build a “first-order” Probabilistic Classifier • get a training data set with attributes and a class variable • for each of the class values • calculate the p(ci) from the training data • for each of the attribute and class values • calculate the p(a|ci) numbers from the training data • Store all of these numbers in a table • 2. Use the Classifier for Prediction • get a new record with attribute values and no class label • calculate the functions f(i) for each class value • the predicted class is the one with the largest f value • (optional) calculate the “normalized f” = probabilities

Summary • Probabilities and random variables • joint probabilities • conditional probabilities • Bayes’ rule • Classification • classification is an important practical application of AI • classifies objects into a discrete set of classes • can learn the classifier from labeled data • Probabilistic Classification • “first-order” probabilistic classifier • approximates the full probabilistic classifier • a simple example of a working machine learning system

Notes 9: Probability and Classification