1 / 77

Data Mining

Data Mining. What Is Data Mining?. Data Mining is the discovery of hidden knowledge, unexpected patterns and new rules in large databases Automating the process of searching patterns in the data. Objective of Data Mining. Any of four types of relationships are sought:

malana
Télécharger la présentation

Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Flisha Fernandez

  2. What Is Data Mining? • Data Mining is the discovery of hidden knowledge, unexpected patterns and new rules in large databases • Automating the process of searching patterns in the data Emily Thomas, State University of New York

  3. Objective of Data Mining • Any of four types of relationships are sought: • Classes – stored data is used to locate data in predetermined groups • Clusters – data items grouped according to logical relationships or consumer preferences • Associations – Walmart example (beer and diapers) • Sequential Patterns – data mined to anticipate behavior patterns and trends anderson.ucla.edu

  4. Data Mining Techniques • Rule Induction – extraction of if-then rules • Nearest Neighbors • Artificial Neural Networks – models that learn • Clustering • Genetic Algorithms – concept of natural evolution • Factor Analysis • Exploratory • Stepwise Regression • Data Visualization – usage of graphic tools • Decision Trees Emily Thomas, State University of New York

  5. Classification • A major data mining operation • Given one attribute (e.g. Wealth), try to to predict the value of new people’s wealths by means of some of the other available attributes • Applies to categorical outputs • Categorical attribute: an attribute which takes on two or more discrete values, also knows as a symbolic attribute • Real attribute: a column of real numbers, also known as a continuous attribute Flisha Fernandez

  6. Dataset real symbol classification Kohavi 1995

  7. Decision Trees Flisha Fernandez

  8. Decision Trees • Also called classification trees or regression trees • Based on recursive partitioning of the sample space • Tree-shaped structures that represent sets of decisions/data which generate rules for the classification of a (new, unclassified) dataset • The resulting classification tree becomes an input for decision making anderson.ucla.edu / Wikipedia

  9. Decision Trees • A Decision Tree is where: • The nonleaf nodes are labeled with attributes • The arcs out of a node labeled with attribute Aare labeled with each of the possible values of the attribute A • The leaves of the tree are labeled with classifications Learning Decision Trees

  10. Advantages • Simple to understand and interpret • Requires little data preparation (other techniques need data normalization, dummy variables, etc.) • Can support both numerical and categorical data • Uses white box model (explained by boolean logic) • Reliable – possible to validate model using statistical tests • Robust – large amounts of data can be analyzed in a short amount of time Wikipedia

  11. Decision Tree Example Flisha Fernandez

  12. David’s Debacle • Status Quo: Sometimes people play golf, sometimes they do not • Objective: Come up with optimized staff schedule • Means: Predict when people will play golf, when they will not From Wikipedia

  13. David’s Dataset Quinlan 1989

  14. David’s Diagram outlook? From Wikipedia humid? windy?

  15. David’s Decision Tree Root Node Non-leaf Nodes Labeled with Attributes Arcs Labeled with Possible Values Sunny Rain Overcast Flisha Fernandez Normal High False True Leaves Labeled with Classifications

  16. David’s Decision • Dismiss staff when it is • Sunny AND Hot • Rainy AND Windy • Hire extra staff when it is • Cloudy • Sunny AND Not So Hot • Rainy AND Not Windy From Wikipedia

  17. Decision Tree Induction Flisha Fernandez

  18. Decision Tree Induction Algorithm • Basic Algorithm (Greedy Algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a certain measure (e.g., information gain) Simon Fraser University, Canada

  19. Issues • How should the attributes be split? • What is the appropriate root? • What is the best split? • When should we stop splitting? • Should we prune? Decision Tree Learning

  20. Two Phases • Tree Growing (Splitting) • Splitting data into progressively smaller subsets • Analyzing the data to find the independent variable (such as outlook, humidity, windy) that when used as a splitting rule will result in nodes that are most different from each other with respect to the dependent variable (play) • Tree Pruning DBMS Data Mining Solutions Supplement

  21. Splitting Flisha Fernandez

  22. Splitting Algorithms • Random • Information Gain • Information Gain Ratio • GINI Index DBMS Data Mining Solutions Supplement

  23. Dataset AIXploratorium

  24. Random Splitting AIXploratorium

  25. Random Splitting • Disadvantages • Trees can grow huge • Hard to understand • Less accurate than smaller trees AIXploratorium

  26. Splitting Algorithms • Random • Information Gain • Information Gain Ratio • GINI Index DBMS Data Mining Solutions Supplement

  27. Information Entropy • A measure of the uncertainty associated with a random variable • A measure of the average information content the recipient is missing when they do not know the value of the random variable • A long string of repeating characters has an entropy of 0, since every character is predictable • Example: Coin Toss • Independent fair coin flips have an entropy of 1 bit per flip • A double-headed coin has an entropy of 0 – each toss of the coin delivers no information Wikipedia

  28. Information Gain • A good measure for deciding the relevance of an attribute • The value of Information Gain is the reduction in the entropy of X achieved by learning the state of the random variable A • Can be used to define a preferred sequence (decision tree) of attributes to investigate to most rapidly narrow down the state of X • Used by ID3 and C4.5 Wikipedia

  29. Calculating Information Gain • First compute information content • Ex. Attribute Thread = New, Skips = 3, Reads = 7 • -0.3 * log 0.3 - 0.7 * log 0.7 • = 0.881 (using log base 2) • Ex. Attribute Thread = Old, Skips = 6, Reads = 2 • -0.6 * log 0.6 - 0.2 * log 0.2 • = 0.811 (using log base 2) • Information Gain is... • Of 18, 10 threads are new and 8 threads are old • 1.0 - (10/18)*0.881 + (8/18)*0.811 • 0.150 Flisha Fernandez

  30. Information Gain AIXploratorium

  31. Test Data AIXploratorium

  32. Drawbacks of Information Gain • Prefers attributes with many values (real attributes) • Prefers AudienceSize {1,2,3,..., 150, 151, ..., 1023, 1024, ... } • But larger attributes are not necessarily better • Example: credit card number • Has a high information gain because it uniquely identifies each customer • But deciding how to treat a customer based on their credit card number is unlikely to generalize customers we haven't seen before Wikipedia / AIXploratorium

  33. Splitting Algorithms • Random • Information • Information Gain Ratio • GINI Index DBMS Data Mining Solutions Supplement

  34. Information Gain Ratio • Works by penalizing multiple-valued attributes • Gain ratio should be • Large when data is evenly spread • Small when all data belong to one branch • Gain ratio takes number and size of branches into account when choosing an attribute • It corrects the information gain by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to) http://www.it.iitb.ac.in/~sunita

  35. Dataset http://www.it.iitb.ac.in/~sunita

  36. Calculating Gain Ratio • Intrinsic information: entropy of distribution of instances into branches • Gain ratio (Quinlan’86) normalizes info gain by http://www.it.iitb.ac.in/~sunita

  37. Computing Gain Ratio • Example: intrinsic information for ID code • Importance of attribute decreases as intrinsic information gets larger • Example of gain ratio: http://www.it.iitb.ac.in/~sunita

  38. Dataset http://www.it.iitb.ac.in/~sunita

  39. Information Gain Ratios http://www.it.iitb.ac.in/~sunita

  40. More on Gain Ratio • “Outlook” still comes out top • However, “ID code” has greater gain ratio • Standard fix: ad hoc test to prevent splitting on that type of attribute • Problem with gain ratio: it may overcompensate • May choose an attribute just because its intrinsic information is very low • Standard fix: • First, only consider attributes with greater than average information gain • Then, compare them on gain ratio http://www.it.iitb.ac.in/~sunita

  41. Information Gain Ratio • Below is the decision tree AIXploratorium

  42. Test Data AIXploratorium

  43. Splitting Algorithms • Random • Information • Information Gain Ratio • GINI Index DBMS Data Mining Solutions Supplement

  44. GINI Index • All attributes are assumed continuous-valued • Assumes there exist several possible split values for each attribute • May need other tools, such as clustering, to get the possible split values • Can be modified for categorical attributes • Used in CART, SLIQ, SPRINT Flisha Fernandez

  45. GINI Index • If a data set T contains examples from n classes, gini index, gini(T) is defined aswhere pj is the relative frequency of class j in node T • If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as • The attribute that provides the smallest ginisplit(T) is chosen to split the node khcho@dblab.cbu.ac.kr

  46. GINI Index • Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information • Minimum (0.0) when all records belong to one class, implying most interesting information khcho@dblab.cbu.ac.kr

  47. Examples for Computing Gini P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 khcho@dblab.cbu.ac.kr P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

  48. Stopping Rules • Pure nodes • Maximum tree depth, or maximum number of nodes in a tree • Because of overfitting problems • Minimum number of elements in a node considered for splitting, or its near equivalent • Minimum number of elements that must be in a new node • A threshold for the purity measure can be imposed such that if a node has a purity value higher than the threshold, no partitioning will be attempted regardless of the number of observations DBMS Data Mining Solutions Supplement / AI Depot

  49. Overfitting • The generated tree may overfit the training data • Too many branches, some may reflect anomalies due to noise or outliers • Result is in poor accuracy for unseen samples • Two approaches to avoid overfitting • Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold • Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees • Use a set of data different from the training data to decide which is the “best pruned tree” Flisha Fernandez

  50. Pruning • Used to make a tree more general, more accurate • Removes branches that reflect noise DBMS Data Mining Solutions Supplement

More Related