introduction to biomedical informatics data mining predictive modeling n.
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to Biomedical Informatics Data Mining: Predictive Modeling PowerPoint Presentation
Download Presentation
Introduction to Biomedical Informatics Data Mining: Predictive Modeling

Introduction to Biomedical Informatics Data Mining: Predictive Modeling

170 Vues Download Presentation
Télécharger la présentation

Introduction to Biomedical Informatics Data Mining: Predictive Modeling

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Introduction to Biomedical InformaticsData Mining: Predictive Modeling

  2. Outline Predictive Modeling Focus on classification Logistic regression Decision trees Key Concepts Overfitting and generalization Optimization algorithms for fitting models Model evaluation Additional Resources and Recommended Reading Discussion of Project Proposals
  3. Sample Data Set Notation: Columns may be called “measurements”, “variables”, “features”, “attributes”, “fields”, etc Rows may be individuals, entities, objects, samples, etc
  4. Prediction Build a model that can predict this variable given the values of all the other variables
  5. Notation for Prediction y is the “target variable”: the variable whose value we wish to predict x is a vector of input variables used to predict Y Mathematically we wish to build models Y = f(x, q) where f is the functional form of the model q are the parameters of the model e.g., y = a x1+ b x2 + c, q= {a, b, c} Regression -> y is real-valued Classification -> y is categorical, e.g., y = {has disease, does not}
  6. Training and Test Data Use this data to fit the parameters of your model Training Data Use this data to get a true estimate of how well the model will perform in practice Test Data
  7. Basic Components of Predictive Modeling Training data: set of x’s and y’s Goal is to learn a model that predicts y given x Model A functional form f(x) that we will use to predict y The function has parameters that we can vary Could be a simple linear function…or a complicated nonlinear function Objective or Error Function How well does a particular model fit the data? Optimization algorithm An algorithm that finds the parameters that minimize the error on the training data
  8. Simple Example: Fitting a Line Training data: set of x’s and y’s Model: f(x) = a x + b (we have 2 parameters, a and b) Squared Error Function E = S [ y– f(x) ]2 = S [ y– ax - b]2 Important: note that E is a function of a and b , i.e., its really E(a, b) Optimization? (minimizing E) Here we can solve directly for the optimal a and b (usually not this easy..)
  9. Simple Example: Fitting a Line Weight Gain Drug Dosage
  10. Simple Example: Fitting a Line The a and b for this line will have a high error value E Weight Gain Drug Dosage
  11. Simple Example: Fitting a Line This is a much better fit, lower E value Weight Gain Drug Dosage
  12. f(x|a) E =distance(predictions,targets) e.g, = S (y – f(x|a) )2 Learning: minimizing E as a function of a targets,y predictions inputs,x
  13. 2: MODEL f(x|a) 1: DATA 3: OBJECTIVE FUNCTION E =distance(predictions,targets) e.g, = S (y – f(x|a) )2 4: OPTIMIZATION Learning: minimizing E as a function of a targets,y predictions inputs,x
  14. The Importance of Optimization Typical optimization problem in machine learning: Minimize E(a) = Si d (yi, f( xi | a ) ) where d is some distance function like squared error As a function of a, this is a p-dimensional optimization problem Usually no direct solution – many iterative gradient-based techniques
  15. Iterative Optimization For many optimization problems the solution (the location of the point in parameter space with minimum error) cannot be calculated directly A common approach is iterative “local search” optimization, e.g., Start in some random location For each iteration Compute what direction is “downhill” from this point E.g., we can use the gradient of the function if we can compute it Repeat, unless there is no downhill direction from the current location This simple heuristic search method can be very effective. Often referred to as “hill climbing” or “gradient descent”
  16. Optimization E(a) Easy (convex) a
  17. Optimization E(a) Easy (convex) a E(a) Hard (non-convex) a
  19. Classification Problems Training data e.g., {x, y} pairs, where y = binary target 0/1, x = input vector Prediction Model Goal is to learn a function that can classify future x’s as either y=0 or y=1 We are often interested in modeling p(y=1 |x) so that we have an idea of how likely it is that y=1 given the values x
  20. Examples of Classification Problems
  21. Examples of Classifiers Naïve Bayes: simple, but often effective in high dimensions Logistic regression: simple, linear in “odds” space, widely used in industry Neural network: non-linear extension of logistic, can be difficult to work with Support vector machines: generalization of linear discriminants, can be quite effective, computational complexity can be an issue k-nearest neighbor: simple, can scale poorly in high dimensions Decision trees: often effective in high dimensions, but biased
  22. Nearest Neighbor Classifiers kNN: select the k nearest neighbors to x from the training data and select the majority class from these neighbors k is a parameter: Small k: “noisier” estimates, Large k: “smoother” estimates Best value of k often chosen by cross-validation Comments Virtually assumption free Gives piecewise linear boundaries (i.e., non-linear overall) Disadvantages Can scale poorly with number of variables: sensitive to distance metric Requires fast lookup at run-time to do classification with large n Does not provide any interpretable “model”
  23. Logistic Regression Target variable y is binary, e.g., does this person have disease Y or not? Candidate model: f(x | a) = a 0 + a 1 x1 + a 2 x2 + …. + a dxd Problem: this can give predictions that are negative, > 1, etc.
  24. Logistic Regression Target variable y is binary, e.g., does this person have disease Y or not? Candidate model: f(x | a) = a 0 + a 1 x1 + a 2 x2 + …. + a dxd Problem: this can give predictions that are negative, > 1, etc. Better approach: f(x) = g (a 0 + a 1 x1 + a 2 x2 + …. + a dxd) = g (S ajxj ) where g(z) = 1 / [ 1 + e-z ] This is known as a logistic regression model
  25. Another View of Logistic Regression f(x) = 1/ [ 1 + exp( - S ajxj) ] What happens as S ajxj goes to + infinity? What happens as S ajxj goes to - infinity? What happens when S ajxj= 0 ?
  26. Another View of Logistic Regression f(x) = 1/ [ 1 + exp( - S ajxj) ] What happens as S ajxj goes to + infinity? What happens as S ajxj goes to - infinity? What happens when S ajxj= 0 ? As the weighted sum S ajxjvaries from negative to positive values, the f(x) function, our model for p(y=1|x) varies between 0 and 1
  27. More on Logistic Regression p(y=1|x) = f(x) = 1/ [ 1 + exp( -) ] We can rewrite this as: log [ p(y=1|x)/p(y=0|x) ] = S ajxj Log-odds Weighted sum
  28. Objective Function for Logistic Regression Objective function = log probability of the training data P (training data) = Pp ( y |x ) = P f(x) Log P (training data) = S log p(y|x) = S log f(x) Why is this useful? Seek the model (weights) that give the highest probability to the observed data In statistics this is known as maximum likelihood estimation Has a number of good properties product/sum is over x’s in the training data
  29. Learning/Optimization for Logistic Regression The log probability, for the logistic regression model, is convex So there is only 1 global maximum, makes optimization easy An iterative algorithm called IRLS is often used in practice Start with random weights, move downhill, repeat until convergence - LogP a
  30. Two Applications of Logistic Regression
  31. L. Backstrom, invited keynote talk at ESWC 2012 Online video and slides at L. Backstrom and J. Leskovec Supervised Random Walks: Predicting and Recommending Links in Social Networks ACM Conference on Web Search and Data Mining (WSDM), 2011 EXAMPLE: Recommending friends on facebook
  32. Learning to Suggest Friends on Facebook Problem: automatically suggest friends Restrict to “friends of friends” Still leaves 40,000 possibilities on average
  33. Learning to Suggest Friends Solution: learn a prediction model Target: user clicks or not on the recommendation Features: mutual friends, age, geography, etc Models: decision trees, logistic regression Significant engineering: feature computation in real-time, plus real-time feedback
  34. Learning to Suggest Friends Significant improvement in click-through rate when system went live
  35. J. Ginsberg, M. Mohebbi, R. Patel, L. Brammer, M. Smolinski, L. Brilliant Detecting influenza epidemics using search engine query data Nature, Februrary 2009 EXAMPLE: Detecting Flu Outbreaks from Search Engine Data
  36. Detecting Flu Outbreaks Problem: Influenza epidemics cause 250k to 500k deaths per year (worldwide) New strains can emerge quickly and be very dangerous A key factor in reducing risk is quick detection of an outbreak How are flu outbreaks currently detected? Center for Disease Control (CDC) gathers counts of “influenza-like illness” (ILI) physician visits Surveillance data published nationally and regionally, weekly basis Problem: 1 to 2 week reporting lag Other approaches Monitor over-the-counter flu medication sales Monitor calls to health advice lines
  37. Using Search Queries Idea: Discover influenza related search queries and use these to predict ILI counts Motivation: should be much faster than 1-2 week lag of CDC data Data: Look at all search queries in Google from 2003 to 2008 Several hundred billion individual searches in the United States Keep track of only the 50 million most common queries Keep a weekly count for each query Also keep counts of each query by geographic region (requires use of geo-location from IP addresses: >95% accurate) So counts for 50 million queries x 170 weeks x 9 regions
  38. Building a Predictive Model Target variable to be predicted: For each week, for each region I(t) = percentage of physician visits that are ILI (as compiled by CDC) Input variables: Q(t) = highest correlated queries / total number of queries that week Logistic Model:log( I(t) / [1 – I(t)] ) = a log ( Q(t)/ [1 – Q(t) ] ) + noise
  39. Addition of terms that generally correlate with flu season, but not with specific outbreaks, e.g., “high school basketball”
  40. Evaluating the Model Model fit to weekly data between 2003 and 2007 128 weeks Predictions made on 42 weeks of unseen test data in 2007-2008 9 regions Correlations per region of predicted ILI with actual ILI Mean = 0.97 Min = 0.92 Max = 0.99
  41. Key point in these graphs is that the CDC data was lagging Google predictions by 1 to 2 weeks
  42. Decision Tree Classification Algorithms
  43. Decision Tree Classifiers Widely used in practice Can handle both real-valued and categorical variables (unusual) Useful with large numbers of variables Popular in areas like medical diagnosis due to interpretability historically, developed both in statistics and computer science Statistics: Breiman, Friedman, Olshen and Stone, CART, 1984 Computer science: Quinlan, ID3, C4.5 (1980’s-1990’s)
  44. Decision Tree Example Blood Pressure Temperature
  45. Decision Tree Example Blood Pressure Temperature > t1 Temperature t1
  46. Decision Tree Example Blood Pressure Temperature > t1 t2 Blood Pressure > t2 Temperature t1
  47. Decision Tree Example Blood Pressure Temperature > t1 t2 Blood Pressure > t2 Temperature t3 t1 Temperature>t3
  48. Decision Tree Pseudocode node = tree-design(Data = {X,C}) For i = 1 to d quality_variable(i) = quality_score(Xi, C) end node = {X_split, Threshold } for max{quality_variable} {Data_right, Data_left} = split(Data, X_split, Threshold) if node == leaf? return(node) else node_right = tree-design(Data_right) node_left = tree-design(Data_left) end end
  49. How to Choose the Right-Sized Tree? Predictive Error Error on Test Data Error on Training Data Size of Decision Tree Ideal Range for Tree Size
  50. Example with Real Data: Accuracy versus Tree Size From Tom Mitchell, Machine Learning, 1997
  51. Choosing a Good Tree for Prediction General idea grow a large tree prune it back to create a family of subtrees “weakest link” pruning score the subtrees and pick the best one Massive data sizes (e.g., n ~ 100k data points) use training data set to fit a set of trees use a validation data set to score the subtrees Smaller data sizes (e.g., n ~1k or less) use cross-validation use explicit penalty terms (e.g., Bayesian methods)
  52. Example: Spam Email Classification Data Set: (from the UCI Machine Learning Archive) 4601 email messages from 1999 Manually labelled as spam (60%), non-spam (40%) 54 features: percentage of words matching a specific word/character Business, address, internet, free, george, !, $, etc Average/longest/sum lengths of uninterrupted sequences of CAPS Error Rates (Hastie, Tibshirani, Friedman, 2001) Training: 3056 emails, Testing: 1536 emails Decision tree = 8.7% Logistic regression: error = 7.6% Naïve Bayes = 10% (typically)
  53. Data Mining Lectures Lectures 9/10: Classification Padhraic Smyth, UC Irvine
  54. Other Aspects of Classification Trees Why use binary splits? Multiway splits can be used, but cause fragmentation Linear combination splits? can produces small improvements optimization is much more difficult (need weights and split point) Trees are much less interpretable Model instability A small change in the data can lead to a completely different tree Model averaging techniques (like bagging) can be useful Tree “bias” Poor at approximating non-axis-parallel boundaries Producing rule sets from tree models (e.g., c5.0)
  55. Why Trees are useful in Practice Can handle high dimensional data builds a model using 1 dimension at time Can handle any type of input variables categorical, real-valued, etc most other methods require data of a single type (e.g., only real-valued) Trees are (somewhat) interpretable domain expert can “read off” the tree’s logic
  56. Limitations of Trees High Bias classification: piecewise linear boundaries, parallel to axes regression: piecewise constant surfaces Trees do not scale well to massive data sets (e.g., N in millions) repeated (unpredictable) access of subsets of the data High Variance trees can be “unstable” as a function of the sample e.g., small change in the data -> completely different tree causes two problems 1. High variance contributes to prediction error 2. High variance reduces interpretability Trees are good candidates for model combining Often used with boosting and bagging
  57. Variants of Decision Trees Random forests Boosted trees Bagging
  58. Ch. 15 Linear Classifiers: Which Hyperplane? Lots of possible solutions for weights. Some methods find a separating hyperplane, but not an optimal one [according to some criterion of expected goodness] Support Vector Machine (SVM) finds a “maximum margin” solution Maximizes the distance between the hyperplane and the “difficult points” close to decision boundary One intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions
  59. Sec. 15.1 Support Vector Machine (SVM) Support vectors SVMs maximize the margin around the separating hyperplane. a.k.a. large margin classifiers The decision function is fully specified by a subset of training samples, the support vectors. Solving SVMs is a quadratic programmingoptimization problem (convex) Seen by many as the most successful current text classification method* Maximizes margin Narrower margin *but other discriminative methods often perform very similarly
  60. SVM Optimization Problem SVM optimization is a quadratic programming optimization problem Good news: convex function of unknowns, unique optimum Variety of well-known algorithms for finding this optimum Bad news: Quadratic programming in general scales as O(n3) In practice takes O(na), where a ~ 1.6 to 2 - e.g., Mining the Web: Discovering Knowledge from Hypertext data, S. Chakrabarti, Chapter 5, p166) Faster methods also available, specialized for SVMs E.g., cutting plane method of Joachims, 2006
  61. Simple models (can be effective) Logistic regression Naïve Bayes K nearest-neighbors Decision trees Good for high-dimensional problems with different data types More sophisticated: Support vector machines Boosting (e.g., boosting with naïve Bayes or with decision stumps) Many tradeoffs in interpretability, score functions, etc Overall Summary on Classification Algorithms
  62. Model Averaging Can average over parameters and models E.g., weighted linear combination of predictions from multiple models y = Swkyk Why? Any predictions from a point estimate of parameters or a single model has only a small chance of the being the best Averaging makes our predictions more stable and less sensitive to random variations in a particular data set (good for less stable models like trees)
  63. Additional Reading Elements of Statistical Learning, T. Hastie, R. Tibshirani, and J. Friedman, Springer Verlag, 2009 (2nded) Classification Trees, Breiman, Friedman, Olshen, and Stone, Wadsworth Press, 1984. SVMs T. Joachims, Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
  64. Model Selection
  65. Quote from G. P. BoxAll models are wrong but some are useful
  66. Example: selecting the best subset of k predictors Linear regression: find the best subset of k variables to put in model This is a generic problem when p is large(arises with all types of models, not just linear regression) Now we have models with different complexity.. E.g., p models with a single variable p(p-1)/2 models with 2 variables, etc… 2p possible models in total Can think of space of models as a lattice Note that when we add or delete a variable, the optimal weights on the other variables will change in general k best is not the same as the best k individual variables Aside: what does “best” mean here? (will return to this shortly…)
  67. Search Problem How can we search over all 2p possible models? exhaustive search is clearly infeasible Heuristic search is used to search over model space: Forward search (greedy) Backward search (greedy) Generalizations (add or delete) Think of operators in search space Branch and bound techniques This type of variable selection problem is common to many data mining algorithms Outer loop that searches over variable combinations Inner loop that evaluates each combination
  68. Empirical Learning Squared Error score (as an example: we could use other scores)E(q) = Si[y(i)– f(x(i) ; q)]2 where E(q) is defined on the training data D We are really interested in finding the f(x; q) that best predicts y on futuredata Empirical learning Minimize E(q) on the training data Dtrain If Dtrain is large and model is simple we are assuming that the best f on training data is also the best predictor f on future test data Dtest
  69. Complexity versus Goodness of Fit Training data y x
  70. Complexity versus Goodness of Fit Too simple? Training data y y x x
  71. Complexity versus Goodness of Fit Too simple? Training data y y x x Too complex ? y x
  72. Complexity versus Goodness of Fit Too simple? Training data y y x x Too complex ? About right ? y y x x
  73. Complexity and Generalization ErrorFunction e.g., squared error Etest(q) Etrain(q) Complexity = number of free parameters in the model Optimal model complexity
  74. Bias and Variance in Models True Model Best Model in our set Set of all models we are considering
  75. Bias and Variance in Models True Model Best Model in our set This distance between the true model and our model is the bias Set of all models we are considering
  76. Bias and Variance in Models In practice, given different data sets of finite size we could find other suboptimal models. This is model variance True Model Best Model in our set This distance between the true model and our model is the bias Set of all models we are considering
  77. What happens if we decrease the Bias? True Model The variance has now increased significantly There is a much greater chance that we will overfit Set of all models we are considering
  78. Complexity and Generalization Error Function e.g., squared error Etest(q) Etrain(q) High bias Low variance Low bias High variance
  79. Complexity versus Goodness of Fit Too simple? Training data y y x x Too complex ? About right ? y y x x
  80. Defining what “best” means How do we measure “best” empirically? Best performance on the training data? K = p will be best (i.e., use all variables), e.g., p=10,000 So this is not useful in general Performance on the training data will in general be optimistic Practical Alternatives: Measure performance on a single validation set Measure performance using multiple validation sets Cross-validation Add a penalty term to the score function that “corrects” for optimism E.g., “regularized” regression: SSE + l sum of weights squared
  81. Cross Validation Methodology If you have a relatively small data set, dividing the data into train, validation, and test sets may be inefficient and noisy An alternative is “V-fold cross-validation”, e.g., V =10 Partition the data into 10 disjoint subsets, points assigned randomly to each subset Train the model 10 times, each time leaving out one of the subsets For each trained model, evaluate accuracy on the left-out subset Cross-validated accuracy = average accuracy on the 10 left-out subsets In effect this simulated the train/test idea, but does it multiple times to combat the effects of noise in the training and test sets Widely used in practice. Computationally expensive.
  82. Evaluating Predictive Models
  83. Evaluating Classifiers Evaluate on independent test data (as with regression) Measures of performance on test data: Classification accuracy (or error) or cost function if “costs” of errors are not symmetric Confusion matrices: K x K matrix where entry(i,j) contains number of test examples that were predicted to be class i, and truly belonged to class j Diagonal elements = examples classified correctly Off-diagonal elements = misclassified examples Useful with more than 2 classes for figuring out which classes are most “confused” Log-probability score on test data Useful if we want to measure how good (well-callibrated) p(c|x) estimates are Ranking performance How well does a classifier rank new examples? Receiver-operating characteristics Lift curves
  84. Imbalanced Class Distributions Common in data mining to have one class be much less likely than the others e.g., 0.1% of examples are fraudulent or have a disease If we train a standard classifier on a random sample of data it is very difficult to beat the “majority classifier” in terms of accuracy Approaches: Stratified sampling: artificially create training data with 50% of each class being present, and then “correct” for this in prediction E.g., learn p(x|c) on stratified data and use true p( c ) when predicting with a probabilistic model Use a different score function: We are often interested in scoring/screening/ranking cases when using the model Thus, scores such as “how many of the class of interest are ranked in the top 1% of predictions” may be more relevant than overall accuracy (e.g., in document retrieval)
  85. Ranking and Lift Curves Many problems where we are interested in ranking examples in terms of how likely they are to the “positive” class E.g., credit scoring, fraud detection, medical screening, document retrieval E.g., use classifier to rank N test examples according to p(c|x) and then pick the top K, where K is much smaller than N Lift curve n = number of true positives that appear in top K% of ranked list r = number of true positives that would appear if we ranked randomly n/r is the “lift” provided by the classifier for top K% e.g., K = 10%, r = 200, n = 300, lift = 1.5, or 50% increase in lift Random ranking gives lift = 1, or 0% increase in lift
  86. Target variable = response/no-response from customers Training and test sets each of size 250k Standard model had 80 variables: variable selection reduced this to 7
  87. Receiver Operating Characteristic (ROC) plots Rank the N test examples by p(c|x) or whatever real-number our classifier produces that indicates likelihood of belonging to class 1 Let k = number of true class 1 examples, and m = number of true class 0 examples, and k+m = N For all possible thresholds t for this ranked list count number of true positives kt true positive rate = kt /k count number of “false alarms”, mt false positive rate = mt /m ROC plot = plot of true positive rate kt v false positive rate mt
  88. ROC Example N = 10 examples, k = 6 true class 1’s, m = 4 class 0’s The first column is an example of a ranking produced by a classifier
  89. ROC Plot Area under curve (AUC) often used as a metric to summarize ROC Online examples at Diagonal line corresponds to random ranking
  90. Calibration In addition to ranking we may be interested in how accurate our estimates of p(c|x) are, i.e., if the model says p(c|x) = 0.9, how accurate is this number? Calibration: a model is well-calibrated if its probabilistic predictions match real-world empirical frequencies i.e., if a classifier predicts p(c|x) = 0.9 for 100 examples, then on average we would expect about 90 of these examples to belong to class c, and 10 not to. We can estimate calibration curves by binning a classifier’s probabilistic predictions, and measuring how many
  91. Example of Calibration in Probabilistic Prediction
  92. General Comments on Data Mining
  93. Myths and Legends in Data Mining “Data analysis can be fully automated” human judgement is critical in almost all applications “semi-automation” is however very useful
  94. Myths and Legends in Data Mining “Data analysis can be fully automated” human judgement is critical in almost all applications “semi-automation” is however very useful “With massive data sets you don’t need statistics” massiveness brings heterogeneity and structure even more statistics!
  95. Myths and Legends in Data Mining “Data analysis can be fully automated” human judgement is critical in almost all applications “semi-automation” is however very useful “With massive data sets you don’t need statistics” massiveness brings heterogeneity and structure even more statistics! “Data mining is a silver bullet for every open problem….” data mining has it strengths and weaknesses main strength: leverages data effectively main weakness: works independently of prior knowledge
  96. Data Mining Projects in Practice 6 0 5 0 4 0 Effort (%) 3 0 2 0 1 0 0 B u s i n e s s D a t a P r e p a r a t i o n D a t a M i n i n g A n a l y s i s & O b j e c t i v e s A s s i m i l a t i o n D e t e r m i n a t i o n
  97. What I did not mention….. Awkward questions…. should algorithms interface directly with relational databases? how to handle missing data? how do you define what the variables are? how do you validate data when you have 5000 different variables? how to match a model/algorithm to a given algorithm? how to update a predictive model over time? when is it ok to subsample to fit the data in main memory? when should one stop trying to improve the model? how can one explain the results to the customer? does it really require a team of 6 Phds to build each successful application?
  98. Conferences Related to Data Mining ACM SIGKDD Conference, IEEE ICDM Conference more focus on algorithms and applications than statistics ICML, Machine Learning Conference algorithm focused, less applications ACM SIGMOD/VLDB Conferences Database-view of data mining NIPS more statistical/probabilistic than the other conferences
  99. Journals Related to Data Mining Journal of Data Mining and Knowledge Discovery Journal of Machine Learning Research electronic journal, Machine Learning Journal IEEE Transactions on Pattern Analysis and Machine Intelligence Journal of Computational Graphics and Statistics
  100. Web Resources on Data Mining Knowledge Discovery Nuggets Web site the most authoritative site on data mining biweekly email newsletter with 10k subscribers on topics broadly related to data mining extensive set of Web links to FAQs, publications, conferences, commercial companies, data sites, jobs, etc UC Irvine Machine Learning and KDD Data Archives Benchmark data archives for research purposes
  101. General References on Data Mining Basic ideas: Programming Collective Intelligence T. Segaran, O’ Reilly Media, 2007 …a bit more academic Principles of Data MiningHand, Mannila, Smyth, MIT Press, 2001 … more advanced The Elements of Statistical Learning Hastie, Tibshirani, Friedman, Springer Verlag, 2009 (2nded)
  102. Software Packages Commercial packages SAS Statistica Many others – see Free packages Weka Programming environments R MATLAB (commercial) Python
  103. Class Projects Project proposal due Wednesday May 2nd Final project report due Monday May 21st Details at We will discuss in more detail next week
  105. optional Binary split selection criteria Total error across 2 branches Q(t) = N1Q1(t) + N2Q2(t), where t is the threshold and Q(t) is the error in a branch Select the threshold t such that the total error is minimized Examples of error criteria (2-class case for simplicity) Let p1 be the proportion of positive class data points for branch 1 Misclassification Error Q1(t) = max { 1 - p1, p1 } Gini index: Q1(t) = p1 (1 - p1) Cross-entropy: Q1(t) = p1 log p1 + (1 - p1 ) log (1 - p1 ) Cross-entropy and Gini work better in practice than direct minimization of classification error at each node
  106. optional Computational Complexity for a Binary Tree At the root node, for each of p variables Sort all values, compute quality for each split O(pN log N) time for real-valued or ordinal variables Subsequent internal node operations each take O(N’ log N’) This assumes data are in main memory If data are on disk then repeated access of subsets at different nodes may be very slow (impossible to pre-index) Note: time difference between retrieving data in RAM and data on disk may be O(103) or more.
  107. optional Splitting on a nominal attribute Nominal attribute with m values e.g., the name of a state or a city in marketing data 2m-1 possible subsets => exhaustive search is O(2m-1) For small m, a simple approach is to branch on specific values But for large m this may not work well Neat trick for the 2-class problem: For each predictor value calculate the proportion of class 1’s Order the m values according to these proportions Now treat as an ordinal variable and select the best split (linear in m) This gives the optimal split for the Gini index, among all possible 2m-1 splits (Breiman et al, 1984).
  108. Multivariate Linear Regression Task: predict real-valued Y, given real-valued vector X Objective function, e.g., least squares is often used E(q) = Si[y(i)– f(x(i) ; q)]2 Model structure: linearf(x ; q) = a0 + Sajxj Model parameters = q = {a0, a1, …… ap} predicted value target value
  109. optional Note that we can write E(q) = Si [y(i)– Sajxj]2 = Si ei2 = e’ e where e = y – X q = (y – X q)’ (y – X q) (p+1) x 1 vector of parameter values y = N x 1 vector of target values N x (p+1) vector of input values
  110. optional E(q) = S e2 = e’ e = (y – X q)’ (y – X q) = y’ y – q’ X’ y – y’ X q + q’ X’ X q = y’ y – 2 q’ X’ y + q’ X’ X q Taking derivative of E(q) with respect to the components of q gives…. dE/dq = -2 X’ y + 2 X’ X q Set this to 0 to find the minimum of E as a function of q …
  111. optional Set to 0 to find the minimum of Eas a function of q … - 2 X’ y + 2 X’ X q = 0 X’ X q = X’ y (known in statistics as the Normal Equations) Letting X’ X = C, and X’ y = b, we have C q = b, i.e., a set of linear equations We could solve this directly, e.g., by matrix inversion q = C-1 b = ( X’ X )-1 X’ y
  112. optional Solving for the q’s Problem is equivalent to inverting X’ X matrix Inverse does not exist if matrix is not of full rank E.g., if 1 column is a linear combination of another (collinearity) Note that X’X is closely related to the covariance of the X data So we are in trouble if 2 or more variables are perfectly correlated Numerical problems can also occur if variables are almost collinear Equivalent to solving a system of p linear equations Many good numerical methods for doing this, e.g., Gaussian elimination, LU decomposition, etc These are numerically more stable than direct inversion Alternative: gradient descent Compute gradient and move downhill ..this is better than direct solutions for some problems (e.g., very sparse data)
  113. Comments on Multivariate Linear Regression Prediction model is a linear function of the parameters Error function: quadratic in predictions and parameters Derivative of score is linear in the parameters Leads to a linear algebra optimization problem, i.e., C q = b Model structure is simple…. p-1 dimensional hyperplane in p-dimensions Linear weights => interpretability Often useful as a baseline model e.g., to compare more complex models to Note: even if it’s the wrong model for the data (e.g., a poor fit) it can still be useful for prediction
  114. Model Evaluation Let MSEtest be the mean-square error of our learned predictor function, evaluated on test data Useful to report MSEtest / MSEbaseline e.g., where MSEbaseline = Si [y(i)– my]2 (on test data points) where my = mean of y values on the training data ideally we would like MSEtest / MSEbaseline to be much less than 1. Can also plot histograms of individual errors: MSE might be dominated by outliers
  115. Non-linear (both model and parameters) We can generalize further to models that are nonlinear in all aspects f(x ; q) = a0 + Sak gk(bk0 +Sbkj xj) where the g’s are non-linear functions with fixed functional forms. In machine learning this is called a neural network In statistics this might be referred to as a generalized linear model or projection-pursuit regression For almost any score function of interest, e.g., squared error, the score function is a non-linear function of the parameters. Closed form (analytical) solutions are rare. Thus, we have a multivariate non-linear optimization problem (which may be quite difficult!)
  116. Other non-linear models Splines “patch” together different low-order polynomials over different parts of the x-space Works well in 1 dimension, less well in higher dimensions Memory-based modelsy’ = S w(x’,x) y, where y’s are from the training data and where w(x’,x) = function of distance of x from x’ Local linear regression y’ = a0 + Sajxj , where the alpha’s are fit at prediction time just to the (y,x) pairs that are close to x’
  117. Another Approach with Many Predictors: Regularization Modified score function: Sl(q) = Si [y(i)– f(x(i) ; q) ]2 + lSqj2 The second term is for “regularization” When we minimize -> encourages keeping the qj‘s near 0 Bayesian interpretation: minimizing - log P(data|q) - log P(q) L1 regularization Sl(q) = Si [y(i)– f(x(i) ; q) ]2 + lS | qj| (basis of popular “Lasso” method, e.g., see Rob Tibshirani’s page on lasso methods:
  118. Time-series prediction as regression Measurements over time x1,…… xt We want to predict xt+1 given x1,…… xt Autoregressive model xt+1 = f( x1,…… xt ; q ) = Sak xt-k Number of coefficients K = memory of the model Can take advantage of regression techniques in general to solve this problem (e.g., linear in parameters, score function = squared error, etc) Generalizations Vector x Non-linear function instead of linear Add in terms for time-trend (linear, seasonal), for “jumps”, etc
  119. Other aspects of regression Diagnostics Useful in low dimensions Weighted regression Useful when rows have different weights Different score functions E.g. absolute error, or additive noise varies as a function of x Predicting y values constrained to a certain range, e.g., y > 0, or 0 < y < 1 Predicting binary y values Regression as a generalization of classification
  120. Decision Trees are not stable Moving just one example slightly may lead to quite different trees and space partition Lack of stability against small perturbation of data. Figure from Duda, Hart & Stork, Chap. 8