Osztályozás

Osztályozás

Célja • Az osztályozás célja új dokumentumot, szavakat előre megadott csoportok valamelyikéhez rendelni oly módon, hogy az legjobban illeszkedjen a csoport elemeivel • előre definiált csoportok vannak • felügyelt tanulás • hozzárendelési szabályt állít elő

Introduction to Classification Applications • Classification = to learn a function that classifies the data into a set of predefined classes. • predicts categorical class labels (i.e., discrete labels) • classifies data (constructs a model) based on the training set and on the values (class labels) in a classifying attribute; and then uses the model to classify new database entries. Example: A bank might want to learn a function that determines whether a customer should get a loan or not. Decision trees and Bayesian classifiers are examples of classification algorithms. This is called Credit Scoring. Other applications: Credit approval; Target marketing; Medical diagnosis; Outcome (e.g., Treatment) analysis. UMUC Data Mining Lecture 4

Classification - a 2-Step Process • Model Construction (Description): describing a set of predetermined classes = Build the Model. • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction = the training set • The model is represented by classification rules, decision trees, or mathematical formulae • Model Usage (Prediction): for classifying future or unknown objects, or for predicting missing values = Apply the Model. • It is important to estimate the accuracy of the model: • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is chosen completely independent of the training set, otherwise over-fitting will occur UMUC Data Mining Lecture 4

When to use Classification Applications? • If you do not know the types of objects stored in your database, then you should begin with a Clustering algorithm, to find the various clusters (classes) of objects within the DB. This is Unsupervised Learning. • If you already know the classes of objects in your database, then you should apply Classification algorithms, to classify all remaining (or newly added) objects in the database using the known objects as a training set. This is Supervised Learning. • If you are still learning about the properties of known objects in the database, then this is Semi-Supervised Learning, which may involve Neural Network techniques. UMUC Data Mining Lecture 4

Dokumentumosztályozás • A dokumentumot az előre ismert osztályok egyikéhez (vagy csoportjához) rendeljük • Szóhalmaz Kategória • A leképzés tanító mintán alapuló statisztikai módszerekkel történik • Bayes • Döntésifa • K legközelebbiszomszéd • SVM

Issues in Classification - 1 • Data Preparation: • Data cleaning • Preprocess data in order to reduce noise and handle missing values • Relevance analysis (feature selection) • The“interestingness problem” • Remove the irrelevant or redundant attributes UMUC Data Mining Lecture 4

Issues in Classification - 3 • Robustness: • Handling noise and missing values • Speed and scalability of model • time to construct the model • time to use the model • Scalability of implementation • ability to handle ever-growing databases • Interpretability: • understanding and insight provided by the model • Goodness of rules • decision tree size • compactness of classification rules • Predictive accuracy UMUC Data Mining Lecture 4

Issues in Classification - 4 • Overfitting • Definition: If your classifier (machine learning model) fits noise (i.e., pays attention to parts of the data that are irrelevant), then it is overfitting. GOOD BAD UMUC Data Mining Lecture 4

Bayes

Bayesian Methods • Learning and classification methods based on probability theory (see spelling / POS) • Bayes theorem plays a critical role • Build a generative model that approximates how data is produced • Uses prior probability of each category given no information about an item. • Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Bayesian Classifiers • Bayes Theorem: P(C|X) = P(X|C) P(C) / P(X) which states … posterior = (likelihood x prior) / evidence • P(C) = prior probability = probability that any given sample data is in class C, estimated before we have measured the sample data. • We wish to determine the posterior probability P(C|X) that estimates whether C is the correct class for a given set of sample data X. UMUC Data Mining Lecture 4

Estimating Bayesian Classifiers • P(C|X) = P(X|C) P(C) / P(X) … • Estimate P(Cj) by counting the frequency of occurrence of each class Cj in the training data set.* • Estimate P(Xk) by counting the frequency of occurrence of each attribute value Xk in the data.* • Estimate P(Xk | Cj) by counting how often the attribute value Xk occurs in class Cj in the training data set.* • Calculate the desired end-result P(Cj | Xk) which is the classification = the probability that Cj is the correct class for a data item having attribute Xk. (*Estimating these probabilities can be computationally very expensive for very large data sets.) UMUC Data Mining Lecture 4

Example of Bayes Classification • Show sample database • Show application of Bayes theorem: • Use sample database as the “set of priors” • Use Bayes results to classify new data UMUC Data Mining Lecture 4

Example of Bayesian Classification : • Suppose that you have a database D that contains characteristics of a large number of different kinds of cars that are sorted according to each car’s manufacturer = the car’s classification C. • Suppose one of the attributes X in D is the car’s “color”. • Measure P(C) from the frequency of different manufacturers in D. • Measure P(X) from the frequency of different colors among the cars in D. (This estimate is made independent of manufacturer.) • Measure P(X|C) from frequency of cars with color X made by manufacturer C. • Okay, now you see a red car flying down the beltway. What is the car’s make (manufacturer)? You can estimate the likelihood that the car is from a given manufacturer C by calculating P(C|X) via Bayes Theorem: • P(C|X) = P(X|C) P(C) / P(X) (Class is “C” when P(C|X) is a maximum.) • With only one attribute, this is a trivial result, and not very informative. However, using a larger set of attributes (e.g., two-door, with sun roof) leads to a much better classification estimator : example of a Bayes Belief Network. UMUC Data Mining Lecture 4

Sample Database for Bayes Classification Example x = car colorC = class of car (manufacturer)Car Database: Tuple x C 1 red honda2 blue honda3 white honda4 red chevy5 blue chevy6 white chevy7 red toyota8 white toyota9 white toyota10 red chevy11 white ford12 white ford13 blue ford14 red chevy15 red dodge Some statistical results: x1 = red P(x1) = 6/15x2 = white P(x2) = 6/15x3 = blue P(x3) = 3/15C1 = chevy P(C1) = 5/15C2 = honda P(C2) = 3/15C3 = toyota P(C3) = 3/15C4 = ford P(C4) = 3/15C5 = dodge P(C5) = 1/15 UMUC Data Mining Lecture 4

Results from Bayes Example #1 • Therefore, the red car is most likely a Chevy (maybe a Camaro or Corvette? ). • The red car is unlikelyto be a Ford. • We choose the most probable classas the Classification of the new data item (red car): therefore, Classification = C1 (Chevy). UMUC Data Mining Lecture 4

Results from Bayes Example #2 • Therefore, the white car is equally likely to be a Ford or a Toyota. • The white car is unlikely to be a Dodge. • If we choose the most probable class as the Classification, we have a tie. You can either pick one of the two classes randomly (if you must pick). Or else weight each class 0.50 in the output classification (C3, C4), if a probabilistic classification is permitted. UMUC Data Mining Lecture 4

Why Use Bayesian Classification? • Probabilistic Learning: Allows you to calculate explicit probabilities for a hypothesis -- “learn as you go”. This is among the most practical approaches to certain types of learning problems (e.g., e-mail Spam detection). • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. • Data-Driven: Prior knowledge can be combined with observed data. • Probabilistic Prediction: Allows you to predict multiple hypotheses, each weighted by their own probabilities. • The Standard: Bayesian methods provide a standard of optimal decision-making against which other methods can be compared. UMUC Data Mining Lecture 4

Naïve Bayesian Classification • Naïve Bayesian Classification assumes that all classes C(i) are independent of one another. • Naïve Bayes assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) (= a simple product of probabilities) • P(xi|C) is estimated as the relative frequency of samples in class C for which their attribute “i” has the value “xi”. • This assumes that there is no correlation in the attribute values x1,…,xk (attribute independence) UMUC Data Mining Lecture 4

The Independence Hypothesis… • … makes the computation possible (tractable) • … yields optimal classifiers when satisfied • … but is seldom satisfied in practice, as attributes (variables) are often correlated. • Some approaches to overcome this limitation: • Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes • Decision trees, that reason on one attribute at a time, considering most important attributes first UMUC Data Mining Lecture 4

DöntésiFák

Decision Tree Based Classification • Advantages: • Inexpensive to construct • Extremely fast at classifying unknown records • Easy to interpret for small-sized trees • Accuracy is comparable to other classification techniques for many simple data sets

Root node • Nodes of the tree • Leaves (terminal nodes) of the tree • Branches (decision point) of the tree C A A B B B B Decision trees • Decision trees are popular for pattern recognition because the models they produce are easier to understand.

Weather Data: Play or not Play? Note: Outlook is the Forecast, no relation to Microsoft email program

Example Tree for “Play?” Outlook sunny rain overcast Yes Humidity Windy high normal false true No Yes No Yes

Building Decision Tree [Q93] • Top-down tree construction • At start, all training examples are at the root. • Partition the examples recursively by choosing one attribute each time. • Bottom-up tree pruning • Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.

Choosing the Splitting Attribute • At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose. • Typical goodness functions: • information gain (ID3/C4.5) • information gain ratio • gini index witten&eibe

Which attribute to select? witten&eibe

A criterion for attribute selection • Which is the best attribute? • The one which will result in the smallest tree • Heuristic: choose the attribute that produces the “purest” nodes • Popular impurity criterion: information gain • Information gain increases with the average purity of the subsets that an attribute produces • Strategy: choose attribute that results in greatest information gain witten&eibe

Computing information • Information is measured in bits • Given a probability distribution, the info required to predict an event is the distribution’s entropy • Entropy gives the information required in bits (this can involve fractions of bits!) • Formula for computing the entropy: witten&eibe

Alternative Splitting Criteria based on INFO • Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t). • Measures homogeneity of a node. • Maximum (log nc) when records are equally distributed among all classes implying least information • Minimum (0.0) when all records belong to one class, implying most information • Entropy based computations are similar to the GINI index computations

Examples for computing Entropy P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0– 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6)– (5/6) log2 (1/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6)– (4/6) log2 (4/6) = 0.92

Example: attribute “Outlook”, 1 witten&eibe

Example: attribute “Outlook”, 2 • “Outlook” = “Sunny”: • “Outlook” = “Overcast”: • “Outlook” = “Rainy”: • Expected information for attribute: Note: log(0) is not defined, but we evaluate 0*log(0) as zero witten&eibe

Computing the information gain • Information gain: (information before split) – (information after split) • Compute for attribute “Humidity” witten&eibe

Example: attribute “Humidity” • “Humidity” = “High”: • “Humidity” = “Normal”: • Expected information for attribute: • Information Gain:

Computing the information gain • Information gain: (information before split) – (information after split) • Information gain for attributes from weather data: witten&eibe

Continuing to split witten&eibe

The final decision tree • Note: not all leaves need to be pure; sometimes identical instances have different classes  Splitting stops when data can’t be split any further witten&eibe

Highly-branching attributes • Problematic: attributes with a large number of values (extreme case: ID code) • Subsets are more likely to be pure if there is a large number of values • Information gain is biased towards choosing attributes with a large number of values • This may result in overfitting (selection of an attribute that is non-optimal for prediction) witten&eibe

Weather Data with ID code

Split for ID Code Attribute Entropy of split = 0 (since each leaf node is “pure”, having only one case. Information gain is maximal for ID code witten&eibe

Gain ratio • Gain ratio: a modification of the information gain that reduces its bias on high-branch attributes • Gain ratio should be • Large when data is evenly spread • Small when all data belong to one branch • Gain ratio takes number and size of branches into account when choosing an attribute • It corrects the information gain by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to) witten&eibe

Gain Ratio and Intrinsic Info. • Intrinsic information: entropy of distribution of instances into branches • Gain ratio (Quinlan’86) normalizes info gain by:

Computing the gain ratio • Example: intrinsic information for ID code • Importance of attribute decreases as intrinsic information gets larger • Example of gain ratio: • Example: witten&eibe

More on the gain ratio • “Outlook” still comes out top • However: “ID code” has greater gain ratio • Standard fix: ad hoc test to prevent splitting on that type of attribute • Problem with gain ratio: it may overcompensate • May choose an attribute just because its intrinsic information is very low • Standard fix: • First, only consider attributes with greater than average information gain • Then, compare them on gain ratio witten&eibe

*CART Splitting Criteria: Gini Index • If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. gini(T) is minimized if the classes in T are skewed.

Osztályozás

Osztályozás

Presentation Transcript