1 / 19

Data Mining

Data Mining . Joyeeta Dutta-Moscato July 10, 2013. Data Mining. Wherever we have large amounts of data , we have the need for building systems capable of learning information from the data – predictions in medicine – text and web page classification – speech recognition.

nowles
Télécharger la présentation

Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Joyeeta Dutta-Moscato July 10, 2013

  2. Data Mining Wherever we have large amounts of data, we have the need for building systems capable of learning information from the data – predictions in medicine – text and web page classification – speech recognition Learning underlying patterns useful to – to predict the presence of a disease for future patients, – describe the dependencies between diseases and symptoms Data Mining focuses on the discovery of (previously) unknown properties from data, using techniques from Machine Learning.

  3. Data • 4 attributes / features • Each attribute has values • 3 × 3 × 2 × 2 = 36 possible combinations • 14 combinations present in this example

  4. Data  Prediction A set of rules to predict whether we will get to play could look like this: If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes  A decision list

  5. Decision Tree Learning F = { <Outlook, Humidity, Wind, Temp>  Play Tennis?} The goal is to create a model that predicts the value of a target variable based on several input variables.

  6. Decision Tree Learning • Problem Setting • • Set of possible instances X • Each instance x in X is a feature vector x = < x1, x2, ... xn> • • Unknown target function f: XY • Yis discrete valued • • Set of function hypotheses H = { h | h : X Y } • Each hypothesis h is a decision tree • Input: • • Training examples {<x(i),y(i)>} of unknown target function f • Output • • Hypothesis h ∈ H that best approximates target function f

  7. Supervised Learning Given a set of training examples of the form: {(x1, y1), … (xn, yn)} a learning algorithm seeks a function: g : X Y Where Xis the input space and Yis the output space. Example: - Classify the universe of music into ‘like’ & ‘dislike’ for one person - Training set: A list of songs that the person heard, and marked as ‘like’ or ‘dislike’ - Task: Infer a function of features (of these songs) to predict what other songs the person will like

  8. Supervised Learning Given a model family, we are interested in finding the best model parameters, such that the misfit (measured by an error function) between the data and the model is minimized. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.

  9. Supervised Learning Considerations: • The learning algorithm must generalize from the training data to unseen situations in a "reasonable" way: Avoid overfitting • Bias-variance tradeoff • Number of training examples versus model complexity

  10. Supervised Learning Common methods of supervised learning: • Regression • X discrete or continuous → Y continuous • Examples: • – debt, equity, orders, sales → stock price • – age, height, weight, race, VKORC1 genotype, CYP2C9 genotype → warfarin dose • Classification • X discrete or continuous → Y discrete • Examples: • - family history, history of head trauma, age, gender, race, • APOE status → Alzheimer’s disease • - arrangement of pixels in handwritten digit → “3”

  11. Supervised Learning • Linear Regression • Fitting the data to the model • Object: Minimize mean square error

  12. Regression Is a mean square error of 0 (i.e. no difference between prediction and target) mean this is the best model? • Overfitting • Real test of ‘best model’ is performance on data it has not been trained on

  13. Regression What does this mean about the relationship between x and y?

  14. Classification • Linear classifier • Hard threshold • Logistic regression • Uses the logistic function, which goes between 0 and 1 • Soft threshold

  15. Other common methods in Supervised Learning More sophisticated algorithms are needed for data that are not linearly separable • Support Vector machines • Artificial Neural Networks (can also be unsupervised) • K-nearest neighbor • Graphical models, Bayesian models

  16. Unsupervised Learning Learn relationships among the inputs, x1 , … xn. No y is given. Clustering – Group inputs based on some measure of similarity - Common “first pass” exploratory data mining technique

  17. Hierarchical Clustering A method of cluster analysis which aims to partition into groups that are “close” to each other according to some distance metric.

  18. k-means Clustering A method of cluster analysis which aims to partition the data into k clusters in which each observation belongs to the cluster with the nearest mean.

  19. Acknowledgments Shyam Visweswaran, Dept. of Biomedical Informatics Tom Mitchell, Dept. of Machine Learning, CMU “Data Mining: Practical Machine Learning Tools and Techniques” Ian H. Witten, Eibe Frank, Mark A. Hall

More Related