Final Review

Final Review This is not a comprehensive review but highlights certain key areas

Top-Level Data Mining Tasks • At highest level, data mining tasks can be divided into: • Prediction Tasks (supervised learning) • Use some variables to predict unknown or future values of other variables • Classification • Regression • Description Tasks (unsupervised learning) • Find human-interpretable patterns that describe the data • Clustering • Association Rule Mining

Classification: Definition • Given a collection of records (training set ) • Each record contains a set of attributes, one of the attributes is the class, which is to be predicted. • Find a model for class attribute as a function of the values of other attributes. • Model maps record to a class value • Goal: previously unseen records should be assigned a class as accurately as possible. • A test setis used to determine accuracy of the model • Can you think of classification tasks?

Classification • Simple linear • Decision trees (entropy, GINI) • Naïve Bayesian • Nearest Neighbor • Neural Networks

Regression • Predict a value of a given continuous (numerical) variable based on the values of other variables • Greatly studied in statistics • Examples: • Predicting sales amounts of new product based on advertising expenditure. • Predicting wind velocities as a function of temperature, humidity, air pressure, etc. • Time series prediction of stock market indices

Clustering • Given a set of data pointsfind clusters so that • Data points in same cluster are similar • Data points in different clusters are dissimilar You try it on the Simpsons. How can we cluster these 5 “data points”?

Association Rule Discovery • Given a set of records each of which contain some number of items from a given collection • Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Diapers beer

Attribute Values • Attribute values are numbers or symbols assigned to an attribute • Distinction between attributes and attribute values • Same attribute can be mapped to different attribute values • Example: height can be measured in feet or meters • Different attributes can be mapped to the same set of values • Example: Attribute values for ID and age are integers • But properties of attribute values can be different • ID has no limit but age has a maximum and minimum value

Types of Attributes • There are different types of attributes • Nominal (Categorical) • Examples: ID numbers, eye color, zip codes • Ordinal • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} • Interval • Examples: calendar dates, temperatures in Celsius or Fahrenheit. • Ratio • Examples: temperature in Kelvin, length, time, counts

Decision Tree Representation • Each internal node tests an attribute • Each branch corresponds to attribute value • Each leaf node assigns a classification outlook sunny overcast rain humidity yes wind weak normal strong high no yes no yes

How do we construct the decision tree? • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they can be discretized in advance) • Examples are partitioned recursively based on selected attributes. • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left • Pre-pruning/post-pruning

How To Split Records • Random Split • The tree can grow huge • These trees are hard to understand. • Larger trees are typically less accurate than smaller trees. • Principled Criterion • Selection of an attribute to test at each node - choosing the most useful attribute for classifying examples. • How? • Information gain • measures how well a given attribute separates the training examples according to their target classification • This measure is used to select among the candidate attributes at each step while growing the tree

Advantages/Disadvantages of Decision Trees • Advantages: • Easy to understand (Doctors love them!) • Easy to generate rules • Disadvantages: • May suffer from overfitting. • Classifies by rectangular partitioning (so does not handle correlated features very well). • Can be quite large – pruning is necessary. • Does not handle streaming data easily

on training data on test data Overfitting (another view) • Learning a tree that classifies the training data perfectly may not lead to the tree with the best generalization to unseen data. • There may be noise in the training data that the tree is erroneously fitting. • The algorithm may be making poor decisions towards the leaves of the tree that are based on very little data and may not reflect reliable trends. accuracy hypothesis complexity/size of the tree (number of nodes)

Notes on Overfitting • Overfitting results in decision trees (models in general) that are more complex than necessary • Training error no longer provides a good estimate of how well the tree will perform on previously unseen records • Need new ways for estimating errors

Evaluation • Accuracy • Recall/Precision/F-measure

Bayes Classifiers • That was a visual intuition for a simple case of the Bayes classifier, also called: • Idiot Bayes • Naïve Bayes • Simple Bayes • We are about to see some of the mathematical formalisms, and more examples, but keep in mind the basic idea. • Find out the probability of the previously unseen instance belonging to each class, then simply pick the most probable class. • Go through all the examples on the slides and be ready to generate tables similar to the ones presented in class and the one you created for your HW assignment. • Smoothing

Bayesian Classifiers • Bayesian classifiers use Bayes theorem, which says p(cj| d ) = p(d | cj) p(cj) p(d) • p(cj| d) = probability of instance d being in class cj, This is what we are trying to compute • p(d | cj) = probability of generating instance d given class cj, We can imagine that being in class cj, causes you to have feature d with some probability • p(cj) = probability of occurrence of class cj, This is just how frequent the class cj, is in our database • p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes

Bayesian Classification • Statistical method for classification. • Supervised Learning Method. • Assumes an underlying probabilistic model, the Bayes theorem. • Can solve diagnostic and predictive problems. • Particularly suited when the dimensionality of the input is high • In spite of the over-simplified assumption, it often performs better in many complex real-world situations

Advantages/Disadvantages of Naïve Bayes • Advantages: • Fast to train (single scan). Fast to classify • Not sensitive to irrelevant features • Handles real and discrete data • Handles streaming data well • Disadvantages: • Assumes independence of features

Nearest-Neighbor Classifiers • Requires three things • The set of stored records • Distance metric to compute distance between records • The value of k, the number of nearest neighbors to retrieve • To classify an unknown record: • Compute distance to other training records • Identify k nearest neighbors • Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance, however this need not be the case… Max (p=inf) Manhattan (p=1) Weighted Euclidean Mahalanobis

Strengths and Weaknesses • Strengths: • Simple to implement and use • Comprehensible – easy to explain prediction • Robust to noisy data by averaging k-nearest neighbors • Distance function can be tailored using domain knowledge • Can learn complex decision boundaries • Much more expressive than linear classifiers & decision trees • More on this later • Weaknesses: • Need a lot of space to store all examples • Takes much more time to classify a new example than with a parsimonious model (need to compare distance to all other examples) • Distance function must be designed carefully with domain knowledge

Perceptrons • The perceptron is a type of artificial neural network which can be seen as the simplest kind of feedforward neural network: a linear classifier • Introduced in the late 50s • Perceptron convergence theorem (Rosenblatt 1962): • Perceptron will learn to classify any linearly separable set of inputs. • Perceptron is a network: • single-layer • feed-forward: data only travels in one direction XOR function (no linear separation)

Perceptron: Artificial Neuron Model Model network as a graphwith cells as nodes and synaptic connections as weighted edges from node i to node j, wji The input value received of a neuron is calculated by summing the weighted input values from its input links threshold threshold function Vector notation:

Examples(step activation function) w0 – t

Summary of Neural Networks When are Neural Networks useful? Instances represented by attribute-value pairs Particularly when attributes are real valued The target function is Discrete-valued Real-valued Vector-valued Training examples may contain errors Fast evaluation times are necessary When not? Fast training times are necessary Understandability of the function is required

Types of Clusterings • A clustering is a set of clusters • Important distinction between hierarchical and partitionalsets of clusters • Partitional Clustering • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree

K-means Clustering • Partitional clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid • Number of clusters, K, must be specified • The basic algorithm is very simple • K-means tutorial available from http://maya.cs.depaul.edu/~classes/ect584/WEKA/k-means.html

K-means Clustering • Ask user how many clusters they’d like. (e.g. k=3) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. • Each Center finds the centroid of the points it owns… • …and jumps there • …Repeat until terminated! 5 4 3 2 1 0 0 1 2 3 4 5

k1 k2 k3 K-means Clustering: Step 1 5 4 3 2 1 0 0 1 2 3 4 5

k1 k2 k3 K-means Clustering 5 4 3 2 1 0 0 1 2 3 4 5

k1 k2 k3 K-means Clustering

Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

Hierarchical Clustering • Two main types of hierarchical clustering • Agglomerative: • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left • Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters) • Agglomerative is most common

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix

DBSCAN • DBSCAN is a density-based algorithm. • Density = number of points within a specified radius (Eps) • A point is a core point if it has more than a specified number of points (MinPts) within Eps • These are points that are at the interior of a cluster • A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point • A noise point is any point that is not a core point or a border point.

What Is Association Mining? • Association rule mining: • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Applications: • Market Basket analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.

Association Rule Mining • We are interested in rules that are • non-trivial (and possibly unexpected) • actionable • easily explainable

Support and Confidence Customer buys diaper Customer buys both • Find all the rules X  Y with minimum confidence and support • Support = probability that a transaction contains {X,Y} • i.e., ratio of transactions in which X, Y occur together to all transactions in database. • Confidence = conditional probability that a transaction having X also contains Y • i.e., ratio of transactions in which X, Y occur together to those in which X occurs. Customer buys beer In general confidence of a rule LHS => RHS can be computed as the support of the whole itemset divided by the support of LHS: Confidence (LHS => RHS) = Support(LHS È RHS) / Support(LHS)

Definition: Frequent Itemset • Itemset • A collection of one or more items • Example: {Milk, Bread, Diaper} • k-itemset • An itemset that contains k items • Support count () • Frequency of occurrence of itemset • E.g. ({Milk, Bread,Diaper}) = 2 • Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset • An itemset whose support is greater than or equal to a minsup threshold

The Apriori algorithm • The best known algorithm • Two steps: • Find all itemsets that have minimum support (frequent itemsets, also called large itemsets). • Use frequent itemsets to generate rules. • E.g., a frequent itemset {Chicken, Clothes, Milk} [sup = 3/7] and one rule from the frequent itemset Clothes  Milk, Chicken [sup = 3/7, conf = 3/3] CS583, Bing Liu, UIC

Associations: Pros and Cons • Pros • can quickly mine patterns describing business/customers/etc. without major effort in problem formulation • virtual items allow much flexibility • unparalleled tool for hypothesis generation • Cons • unfocused • not clear exactly how to apply mined “knowledge” • only hypothesis generation • can produce many, many rules! • may only be a few nuggets among them (or none)

Association Rules • Association rule types: • Actionable Rules – contain high-quality, actionable information • Trivial Rules – information already well-known by those familiar with the business • Inexplicable Rules – no explanation and do not suggest action • Trivial and Inexplicable Rules occur most often

Final Review

Final Review

Presentation Transcript

Final Review

Final Review

Final Review

Final Review

Final review

Final review

Final Review

Final Review

Final Review

Final Review

Final Review

Final review

Final Review

Final Review

Final Review

FINAL REVIEW

Final Review