AMCS/CS 340: Data Mining

ClassificationIII AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Classification Techniques 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Classification 3 • A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities • Foundation: Based on Bayes’ Theorem. • Performance: A simple Bayesian classifier, Naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayes Classifier 4 • A probabilistic framework for solving classification problems • Conditional Probability: • Bayes theorem: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Example of Bayes Theorem 5 • Given: • A doctor knows that meningitis causes stiff neck 50% of the time P(S|M) • Prior probability of any patient having meningitis is 1/50,000 P(M) • Prior probability of any patient having stiff neck is 1/2 P(S) • If a patient has stiff neck, what’s the probability he/she has meningitis? • Informally, this can be written as posteriori = likelihood x prior / evidence Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Classifiers 6 • Consider each attribute and class label as random variables • Given a record with attributes (A1, A2,…,An) • Goal is to predict class C, C=c1, or c2, or ….. • Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) • Can we estimate P(C| A1, A2,…,An ) directly from data? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Classifiers 7 • Approach: • compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem • Choose value of C that maximizes P(C | A1, A2, …, An) • Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) • How to estimate likelihood P(A1, A2, …, An | C )? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Naïve Bayes Classifier 8 • Assume independence among attributes Ai when class is given: • P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) • greatly reduces the computation cost: Only counts the class distribution • Can estimate P(Ai|Cj) for all Ai and Cj. • New point is classified to Cj if P(Cj)  P(Ai|Cj) is maximal. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to Estimate Probabilities from Data? 9 • For Class: P(C) = Nc/N e.g., P(No) = 7/10, P(Yes) = 3/10 • For discrete attributes: P(Ai | Ck) = |Aik|/ Nck where |Aik| is number of instances having attribute Ai and belongs to class Ck • Examples: P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to Estimate Probabilities from Data? 10 For continuous attributes: • Probability density estimation: • Assume attribute follows a normal distribution • Use data to estimate parameters of distribution (e.g., mean μ and standard deviation σ) • Once probability distribution is known, can use it to estimate the conditional probability P(Ai|Ci) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to Estimate Probabilities from Data? 11 • Normal distribution: One for each (Ai,Ci) pair • e.g, for (Income, Class=No): • If Class=No • sample mean = 110 • sample variance = 2975 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Avoiding the 0-Probability Problem • E.g. Suppose a dataset with 1000 tuples, • income=low (0), • income= medium (990), • income = high (10), • Use Laplacian correction (or Laplacian estimator) • Adding 1 to each case, c = 3 Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 c: number of classes p: prior probability m: parameter 13 • If one of the conditional probability is zero, then the entire expression becomes zero • Probability estimation: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Naïve Bayesian Classifier: Comments • Advantages • Easy to implement • Good results obtained in most of the cases • Robust to isolated noise points • Robust to irrelevant attributes • Handle missing values by ignoring the instance during probability estimate calculations 14 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Naïve Bayesian Classifier: Comments • Disadvantages • Independence assumption may not hold for some attributes • Practically, dependencies exist among variables • e.g., hospitals: patients: Profile: age, family history, etc. • Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. • Dependencies among these cannot be modeled by Naïve Bayesian Classifier • loss of accuracy • How to deal with these dependencies? • Bayesian Belief Networks (BBN) 15 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Y Z P Bayesian Belief Networks • Bayesian belief network allows a subset of the variables conditionally independent • A graphical model of causal relationships (directed acyclic graph) • Represents dependency among the variables • Gives a specification of joint probability distribution • Nodes: random variables • Links: dependency • X and Y are the parents of Z, and Y is the parent of P • No dependency between Z and P • Has no loops or cycles X 16 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Belief Network: An Example Family History Smoker The conditional probability table (CPT) for variable LungCancer: (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.7 0.5 0.1 ~LC 0.2 0.5 0.3 0.9 LungCancer Emphysema CPT shows the conditional probability for each possible combination of its parents P(LungCancer = YES | FM = YES, S = YES) =0.8 P(LungCancer = NO | FM = NO, S = NO) =0.9 Derivation of the probability of a particular combination of test tuple withvalues (x1, … , xn) from CPT: PositiveXRay Dyspnea Bayesian Belief Networks 17 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bayesian Belief Network: An Example If CPT is known, BBN can be used to Compute the joint probability of a tuple : P (FS=Y, S=Y, LC=Y, E=N, PXR=Y, D=N) Take a node as an “output”, representing a class label attribute e.g., PositiveXRay class attribute Predict the class of a tuple e.g., PXR= ? given FS=N, S=Y, LC=N compute P(PXY=Y | FS=N, S=Y, LC=N) =a P(PXY=N | FS=N, S=Y, LC=N) =b if a > b PositiveXRay= Yes Family History Smoker LungCancer Emphysema PositiveXRay Dyspnea Class attribute Bayesian Belief Networks 18 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Training Bayesian Networks by training data instances • Several scenarios: • Given both the network structure and all variables observable: learn only the CPTs • Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, EM algorithm • Network structure unknown, all variables observable: search through the model space to reconstruct network topology • Unknown structure, all hidden variables: No good algorithms known for this purpose 19 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Classification Techniques 20 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Artificial Neural Networks (ANN) Output Y is 1 if at least two of the three inputs are equal to 1. 21 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

A single-layer perceptron 22 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

A single-layer perceptron • Model is an assembly of inter-connected nodes and weighted links • Output node sums up each of its input value according to the weights of its links • Compare output node against some threshold t Perceptron Model 23 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Perceptron learning algorithm • Learning weight wi: delta rule • (gradient descent learning rule) • Initialize wi= 0 or a random value • For each example X=[X1 X2 X3] • Calculate the actual output • Adapt weights • Repeat step 2 until the error between and Y is less than a given threshold, or the number iteration reached a given threshold Perceptron Model 24 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

General Structure of ANN Training ANN means learning the weights of the neurons 25 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Algorithm for learning ANN • Initialize the weights (w0, w1, …, wk) • Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples • Objective function: • Find the weights wi that minimize the above objective function • e.g., back-propagation algorithm 26 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Idea of BP Algorithm 27 Update weights by delta rules (as in a single layer net using sum square error) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Idea of BP Algorithm Update weights by delta rules ? NO ! Desired values for hidden nodes are unknown 28 Update weights by delta rules (as in a single layer net using sum square error) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Idea of BP Algorithm weight updated by delta rule based on errors computed on hidden nodes 29 Propagatingerrors to hidden nodes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Idea of BP Algorithm weight updated by delta rule based on errors computed on hidden nodes • BACKPROPAGATION (BP) learning • Werbos (1974), • Rumelhart, Hinton, and Williams (1986) 30 Propagatingerrors to hidden nodes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustration of BP Algorithm A 3-layer neural network with 2 inputs 1 output 31 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustration of BP Algorithm The neuron activation function f1(w1·x) 32 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustration of BP Algorithm The neuron activation function f4(w4·ylayer0) 35 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustration of BP Algorithm The neuron activation function f5(w5·ylayer0) 36 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustration of BP Algorithm Output function y=f6(w6·yhidden) 37 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustration of BP Algorithm output signal of the network y is compared with the desired output value (the target), which is found in training data set. The difference is called error signal dof output layer neuron. 38 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustration of BP Algorithm propagate error signal d(computed in single teaching step) back to all neurons The weightswmn used to propagate errors back are equal to this used during computing output value. Only the direction of data flow is changed (signals are propagated from output to inputs one after the other). 39 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustration of BP Algorithm propagate error signal d(computed in single teaching step) back to all neurons 40 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustration of BP Algorithm propagated errors came from several neurons they are added Errors for other neurons are computed in the same way 41 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustration of BP Algorithm the weights of each input node are modified, where df(e)/de represents derivative of neuron activation function 42 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustration of BP Algorithm Same to the weights from input nodes to hidden nodes 43 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustration of BP Algorithm Same to the weights from hidden nodes to output node Coefficient (learning rate) ŋ affects the speed of network learning start with large value (fast), then gradually decrease to a small value 44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Backpropagation Using Gradient Descent • Advantages • Relatively simple implementation • Standard method and generally works well • Disadvantages • Slow and inefficient • Can get stuck in local minima resulting in sub-optimal solutions Local Minimum Global Minimum 45 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Alternatives To Gradient Descent • Simulated Annealing • Advantages • Can guarantee optimal solution (global minimum) • Disadvantages • May be slower than gradient descent • Much more complicated implementation • Genetic Algorithms/Evolutionary Strategies • Advantages • Faster than simulated annealing • Less likely to get stuck in local minima • Disadvantages • Slower than gradient descent • Memory intensive for large nets 46 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Neural Network as a Classifier Strength High tolerance to noisy data Ability to classify untrained patterns Well-suited for continuous-valued inputs and outputs Successful on a wide array of real-world data Algorithms are inherently parallel Weakness Long training time Require a number of parameters typically best determined empirically, e.g., the network topology or ``structure“. Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of ``hidden units" in the network 47 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Learning Parameters • Learning Rate – small value (0.1 - 0.5), • if too large will not converge or be less accurate, • if too small is slower with no accuracy improvement • Momentum: speed up learning process, reduces oscillation and helps attain convergence, e.g., =0.9 for taking a fraction of the previous weight change in new weight change. • Connectivity: typically fully connected between layers • Number of hidden nodes: too many nodes make learning slower, could overfit (OK if using a reasonable stopping criteria), too few can underfit • Number of layers: usually 1 or 2 hidden layers

Learning modes • Batch Learning • Training is performed “epoch-by-epoch” • Errors are summed across all training patterns • During each epoch, the weights will be updated once. • Require more memory capacity • Online Learning: • Training is performed “example-by-example” • Error is provided directly to the backward pass • Require more updates Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

References Bayesian networks --BOOK: D. Heckerman, Bayesian networks for data mining -- A Tutorial on Learning With Bayesian Networks http://research.microsoft.com/pubs/69588/tr-95-06.pdf -- Kevin Murphy, 1998: A Brief Introduction to Graphical Models and Bayesian Networks http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html Neural networks -- CS-449: Neural Networks (Fall 99) Prof. Genevieve Orr, Willamette University http://www.willamette.edu/~gorr/classes/cs449/intro.html -- Neural Networks, Dr. Christos Stergiou and Dr. DimitriosSiganos, Imperial College London http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

AMCS/CS 340: Data Mining