Data mining and Machine Learning

Data mining and Machine Learning Sunita Sarawagi Sunita@iitb.ac.in

Data Mining • Data mining is the process of semi-automatically analyzing large databases to find useful patterns • Prediction based on past history • Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past history • Predict if a pattern of phone calling card usage is likely to be fraudulent • Some examples of prediction mechanisms: • Classification • Given a new item whose class is unknown, predict to which class it belongs • Regression formulae • Given a set of mappings for an unknown function, predict the function result for a new parameter value

Data Mining (Cont.) • Descriptive Patterns • Associations • Find books that are often bought by “similar” customers. If a new such customer buys one such book, suggest the others too. • Associations may be used as a first step in detecting causation • E.g. association between exposure to chemical X and cancer, • Clusters • E.g. typhoid cases were clustered in an area surrounding a contaminated well • Detection of clusters remains important in detecting epidemics

Data mining • Data: of various shapes and sizes • Patterns/Model: of various shapes and sizes • Abstraction of data into some understandable and useful • Basic structure of data • Set of instances/objects/cases/rows/points/examples • Each instance: fixed set of attributes/dimensions/columns • Continuous • Categorical • Patterns: • Express one attribute as a function of another: • Classification, regression • Group together related instances: clustering, projection, factorization, itemset mining

Classification • Given old data about customers and payments, predict new applicant’s loan eligibility. Model Previous customers Classifier Decision rules Age Salary Profession Location Customer type Salary > 5 L Good/ bad Prof. = Exec Class label Deployment Training New customer’s data Labeled data Unlabeled data

Applications • Ad placement in search engines • Book recommendation • Citation databases: Google scholar, Citeseer • Resume organization and job matching • Retail data mining • Banking: loan/credit card approval • predict good customers based on old customers • Customer relationship management: • identify those who are likely to leave for a competitor. • Targeted marketing: • identify likely responders to promotions • Machine translation • Speech and handwriting recognition • Fraud detection: telecommunications, financial transactions • from an online stream of event identify fraudulent events

Applications (continued) • Medicine: disease outcome, effectiveness of treatments • analyze patient disease history: find relationship between diseases • Molecular/Pharmaceutical: identify new drugs • Scientific data analysis: • identify new galaxies by searching for sub clusters • Image and vision: • Object recognition from images • Remove noise from images • Identifying scene breaks

The KDD process • Problem fomulation • Data collection • subset data: sampling might hurt if highly skewed data • feature selection: principal component analysis, heuristic search • Pre-processing: cleaning • name/address cleaning, different meanings (annual, yearly), duplicate removal, supplying missing values • Transformation: • map complex objects e.g. time series data to features e.g. frequency • Choosing mining task and mining method: • Result evaluation and Visualization: Knowledge discovery is an iterative process

Mining products Preprocessing utilities Mining operations Data warehouse Extract data via ODBC • Sampling • Attribute transformation Visualization Tools • Commercial Tools • SAS Enterprise miner • SPSS • IBM Intelligent miner • Microsoft SQL Server Data mining services • Oracle data mining (ODM) • Scalable algorithms • association • classification • clustering • sequence mining Free Weka Individual algorithms

Classification Regression Classification trees Neural networks Bayesian learning Nearest neighbour Radial basis functions Support vector machines Meta learning methods Bagging,boosting Clustering hierarchical EM density based Mining operations Sequence mining • Time series similarity • Temporal patterns Itemset mining • Association rules • Causality Sequential classification • Graphical models • Hidden Markov Models

Classification methods Goal: Predict class Ci = f(x1, x2, .. Xn) • Regression: (linear or any other polynomial) • Decision tree classifier: divide decision space into piecewise constant regions. • Neural networks: partition by non-linear boundaries • Probabilistic/generative models • Lazy learning methods: nearest neighbor • Support vector machines: boundary to maximally separate classes

Decision tree learning

Decision tree classifiers • Widely used learning method • Easy to interpret: can be re-represented as if-then-else rules • Approximates function by piece wise constant regions • Does not require any prior knowledge of data distribution, works well on noisy data. • Has been applied to: • classify medical patients based on the disease, • equipment malfunction by cause, • loan applicant by likelihood of payment. • lots and lots of other applications..

Good Good Bad Bad Decision trees • Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. Salary < 1 M Prof = teaching Age < 30

Training Dataset This follows an example from Quinlan’s ID3

Output: A Decision Tree for “buys_computer” age? <=30 overcast >40 30..40 student? credit rating? yes no yes fair excellent no yes no yes

Weather Data: Play or not Play? Note: Outlook is the Forecast, no relation to Microsoft email program

Example Tree for “Play?” Outlook sunny rain overcast Yes Humidity Windy high normal false true No Yes No Yes

Topics to be covered • Tree construction: • Basic tree learning algorithm • Measures of predictive ability • High performance decision tree construction: Sprint • Tree pruning: • Why prune • Methods of pruning • Other issues: • Handling missing data • Continuous class labels • Effect of training size

Tree learning algorithms • ID3 (Quinlan 1986) • Successor C4.5 (Quinlan 1993) • CART • SLIQ (Mehta et al) • SPRINT (Shafer et al)

Basic algorithm for tree building • Greedy top-down construction. Gen_Tree (Node, data) Yes make node a leaf? Stop Selection criteria Find best attribute and best split on attribute Partition data on split condition For each child j of node Gen_Tree (node_j, data_j)

Split criteria • Select the attribute that is best for classification. • Intuitively pick one that best separates instances of different classes. • Quantifying the intuitive: measuring separability: • First define impurity of an arbitrary set S consisting of K classes • Smallest when consisting of only one class, highest when all classes in equal number. • Should allow computations in multiple stages. 1

Measures of impurity • Entropy • Gini

Information gain • Information gain on partitioning S into r subsets • Impurity (S) - sum of weighted impurity of each subset 1 0.5 Gini Entropy 0 0 1 1 p1

Information gain: example K= 2, |S| = 100, p1= 0.6, p2= 0.4 E(S) = -0.6 log(0.6) - 0.4 log (0.4)=0.29 S | S1 | = 70, p1= 0.8, p2= 0.2 E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21 | S2| = 30, p1= 0.13, p2= 0.87 E(S2) = -0.13log0.13 - 0.87 log 0.87=.16 S1 S2 Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1

Weather Data: Play or not Play?

Which attribute to select? witten&eibe

Example: attribute “Outlook” • “Outlook” = “Sunny”: • “Outlook” = “Overcast”: • “Outlook” = “Rainy”: • Expected information for attribute: Note: log(0) is not defined, but we evaluate 0*log(0) as zero witten&eibe

Computing the information gain • Information gain: (information before split) – (information after split) • Information gain for attributes from weather data: witten&eibe

Continuing to split witten&eibe

The final decision tree • Note: not all leaves need to be pure; sometimes identical instances have different classes  Splitting stops when data can’t be split any further witten&eibe

Preventing overfitting • A tree T overfits if there is another tree T’ that gives higher error on the training data yet gives lower error on unseen data. • An overfitted tree does not generalize to unseen instances. • Happens when data contains noise or irrelevant attributes and training size is small. • Overfitting can reduce accuracy drastically: • 10-25% as reported in Minger’s 1989 Machine learning • Example of over-fitting with binary data.

Training Data Vs. Test Data Error Rates • Compare error rates measured by • learn data • large test set • Learn R(T) always decreases as tree grows (Q: Why?) • Test R(T) first declines then increases (Q: Why?) • Overfitting is the result tree of too much reliance on learn R(T) • Can lead to disasters when applied to new data No. Terminal Nodes R(T) Rts(T) 71 .00 .42 63 .00 .40 58 .03 .39 40 .10 .32 34 .12 .32 19 .20 .31 **10 .29 .30 9 .32 .34 7 .41 .47 6 .46 .54 5 .53 .61 2 .75 .82 1 .86 .91 Digit recognition dataset: CART book

Overfitting example • Consider the case where a single attribute xj is adequate for classification but with an error of 20% • Consider lots of other noise attributes that enable zero error during training • This detailed tree during testing will have an expected error of (0.8*0.2 + 0.2*0.8) = 32% whereas the pruned tree with only a single split on xj will have an error of only 20%.

Approaches to prevent overfitting • Two Approaches: • Stop growing the tree beyond a certain point • Tricky, since even when information gain is zero an attribute might be useful (XOR example) • First over-fit, then post prune. (More widely used) • Tree building divided into phases: • Growth phase • Prune phase

Criteria for finding correct final tree size: • Three criteria: • Cross validation with separate test data • Statistical bounds: use all data for training but apply statistical test to decide right size. (cross-validation dataset may be used to threshold) • Use some criteria function to choose best size • Example: Minimum description length (MDL) criteria

Cross validation • Partition the dataset into two disjoint parts: • 1. Training set used for building the tree. • 2. Validation set used for pruning the tree: • Rule of thumb: 2/3rds training, 1/3rd validation • Evaluate the tree on the validation set and at each leaf and internal node keep count of correctly labeled data. • Starting bottom-up, prune nodes with error less than its children. • What if training data set size is limited? • n-fold cross validation: partition training data into n parts D1, D2…Dn. • Train n classifiers with D-Di as training and Di as test instance. • Pick average. (how?)

Extracting Classification Rules from Trees • Represent the knowledge in the form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules are easier for humans to understand • Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”

Rule-based pruning • Tree-based pruning limits the kind of pruning. If a node is pruned all subtrees under it has to be pruned. • Rule-based: For each leaf of the tree, extract a rule using a conjuction of all tests upto the root. • On the validation set, independently prune tests from each rule to get the highest accuracy for that rule. • Sort rule by decreasing accuracy..

Regression trees • Decision tree with continuous class labels: • Regression trees approximates the function with piece-wise constant regions. • Split criteria for regression trees: • Predicted value for a set S = average of all values in S • Error: sum of the square of error of each member of S from the predicted average. • Pick smallest average error. • Splits on categorical attributes: • Can it be better than for discrete class labels? • Homework.

Other types of trees • Multi-way trees on low-cardinality categorical data • Multiple splits on continuous attributes [Fayyad 93, Multi-interval discretization of continuous attributes] • Multi attribute tests on nodes to handle correlated attributes • multivariate linear splits [Oblique trees, Murthy 94]

Issues • Methods of handling missing values • assume majority value • take most probable path • Allowing varying costs for different attributes

Pros and Cons of decision trees • Cons • Not effective for very high dimensional data where information about the class is spread in small ways over many correlated features • Example: words in text classification • Not robust to dropping of important features even when correlated substitutes exist in data • Pros • Reasonable training time • Fast application • Easy to interpret • Easy to implement • Intuitive

The k-Nearest Neighbor Algorithm • All instances correspond to points in the n-D space. • The nearest neighbor are defined in terms of Euclidean distance. • The target function could be discrete- or real- valued. • For discrete-valued, the k-NN returns the most common value among the k training examples nearest toxq. • Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples. . _ _ . _ _ + . . + . _ + xq . _ + From Jiawei Han's slides

Other lazy learning methods • Locally weighted regression: • learn a new regression equation by weighting each training instance based on distance from new instance • Radial Basis Functions • Pros • Fast training • Cons • Slow during application. • No feature selection. • Notion of proximity vague

Bayesian learning • Assume a probability model on generation of data. • Apply bayes theorem to find most likely class as: • Naïve bayes: Assume attributes conditionally independent given class value • Easy to learn probabilities by counting, one pass counting • Useful in some domains e.g. text. • Numeric attributes must be discretized

Bayesian belief network • Find joint probability over set of variables making use of conditional independence whenever known • Learning parameters hard when hidden units: use gradient descent / EM algorithms • Learning structure of network harder a d ad ad adad b b 0.1 0.2 0.3 0.4 Variable e independent of d given b b 0.3 0.2 0.1 0.5 e C

Neural networks • Useful for learning complex data like handwriting, speech and image recognition Decision boundaries: Linear regression Classification tree Neural network

Pros and Cons of Neural Network • Cons • Slow training time • Hard to interpret • Hard to implement: trial and error for choosing number of nodes • Pros • Can learn more complicated class boundaries • Fast application • Can handle large number of features Conclusion: Use neural nets only if decision trees/NN fail.

Linear discriminants

Data mining and Machine Learning

Data mining and Machine Learning

Presentation Transcript

Data Mining (and machine learning)

Data Mining and Machine Learning

Data Mining and Machine Learning

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining and Machine Learning

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining and Machine Learning

Data Mining and Machine Learning

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining and Machine Learning

Data Mining (and machine learning)