1 / 77

Outline

Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology. Outline. What is data mining? - Industry – why data mining? Data mining projects - E-support system - Detecting patterns in multimedia data

miyo
Télécharger la présentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining with Statistical LearningTheodoros Evgeniou Massachusetts Institute of Technology

  2. Outline • What is data mining? - Industry – why data mining? • Data mining projects - E-support system - Detecting patterns in multimedia data • Mathematics for complex data mining - Statistical Learning Theory - Data mining tools • Concluding remarks

  3. Part I • What is data mining? - Industry – why data mining? • Data mining projects - E-support system - Detecting patterns in multimedia data • Mathematics for complex data mining - Statistical Learning Theory - Data mining tools • Concluding remarks

  4. What is Data Mining? Goal:To classifyor find trends in data in order to improve future decisions Examples: - financial data modeling - forecasting - customer profiling - fraud detection

  5. data mining ? FRAUD? fraud system Age:.. Occ:.. OK? Example: Fraud Detection ………. …. Age: 24, Occ.: student, Spend: $100, Buy: … Age: 39, Occ.: engineer, Spend: $5000, Buy: … Age: 27, Occ.: ???????, Spend: $400, Buy: … Age: 53, Occ.: small b., Spend: $1300, Buy: … OK OK FRAUD OK …. ……….

  6. Example: Customer Profiling ………. …. Age: 24, Occ.: student, Spend: $100, Buy: … Age: 39, Occ.: engineer, Spend: $5000, Buy: … Age: 27, Occ.: ???????, Spend: $400, Buy: … Age: 53, Occ.: small b., Spend: $1300, Buy: … NO NO BUY NO …. ………. data mining ? BUY? profiling system Age:.. Occ:.. NO?

  7. Data Mining: More Examples • Sales analysis for inventory control • Diagnostics (manufacturing, health, …) • Information filtering/retrieval (e.g. emails, multimedia) • E-Customer Relationship Management : E-customer profiling (personalization, marketing…) E-customer support

  8. US 1999: $12b credit card fraud, 50% on internet (IDC)(insurance, telecom…) Fraud detection using data mining: HNC/eHNC : 1999: ~ $500m M.C. 2000: ~ $2b M.C. • 20% of e-companies use internet customer info, 70% by 2001 (Forrester R.) Personalization, targeted marketing, collaborative filtering … (privacy?) engage, netperceptions…:~$10b Market Interest • Only 30% of Fortune 500 using email respond to it on time (IDC) Email filtering/response software: $20M now, $350M in 2003 (IDC) Kana, eGain, aptex…: ~$10b

  9. Part II • What is data mining? - Industry – why data mining? • Data mining projects - E-support system - Detecting patterns in multimedia data • Mathematics for complex data mining - Statistical Learning Theory - Data mining tools • Concluding remarks

  10. An E-Support System companies need to respond efficiently and accurately to customers’ emails… …how can they manage this when they receive thousands of emails a day ? 1 trillion emails/year in 1999, 5 trillion by 2003 (IDC)

  11. data mining ? PROBLEM … new email e-support ACCOUNT An Email Classification System ………. …. …bought a piece of… some broken part… …would like to return… not satisfied with…. …send a receipt… previous payment… …request a copy of the report… balance of… PROBLEM PROBLEM ACCOUNT ACCOUNT ………. ….

  12. An Image Mining System How can we detect objects in an image?

  13. . . . ? . . . An Image Mining System data mining Pedestrian new image ….. Image System Car

  14. ? General System Architecture Example data data mining Decision A new data ….. System Decision B

  15. STEP 1: Represent data in numerical form (feature vectors) Raw Data Features extraction Feature vector text images (Problem Specific) (12, 3, …) A Data Mining Process Data exist in many different forms (text, images, web clicks …)

  16. Regression Clustering Classification A Data Mining Process (cont.) • STEP 2 : Statistical analysis of numerical data Numerical Data (featurevectors)

  17. WHAT IS THE REPRESENTATION? • Bag of words • Bag of combinations of words • Natural language processing features • Yang, McCallum, Joachims, … Step 1: Text Representation …drive..far..see.. later… left.. drive.. (2, 0, 1, 1, 1, 1 , ….)

  18. (12, 92, 74, 0, 12, …., 124) • WHAT IS THE REPRESENTATION? • Pixel Values • Projections on filters (Wavelets) • PCA • Feature selection Step 1: Image Representation (Papageorgiou et al, 1999, Evgeniou et al, 2000)

  19. decision surface Step 2: “Learn” a Decision Surface (4,24,…) (7,33,…) (1,13,…) (41,11,…) (4,71,…) (92,10,…) (19,3,…)

  20. Learning Methods • Other approaches: • Bayesian methods • Nearest Neighbor • Neural Networks • Decision Trees • Expert systems • New approach: • The Statistical Learning approach

  21. Part III • What is data mining? - Industry – why data mining? • Data mining projects - E-support system - Detecting patterns in multimedia data • Mathematics for complex data mining - Statistical Learning Theory - Data mining tools • Concluding remarks

  22. Roadmap • Formal setting of learning from examples • Standard learning methods • The Statistical Learning approach • Tools and contributions

  23. Formal Setting of the Problem Given a set of l examples(data) Question: find function f such that is agood predictorof y for a future input x

  24. The Ideal Solution What is “good predictor”? If data (x,y) appear according to an (unknown) probability distribution P(x,y), then we want our solution to: V(y, f (x)) : Loss function measuring “cost” from predicting f(x) instead of y (e.g. (y - f(x))2 )

  25. Where? (I) Empirical Error Minimization We only have example data, so go for the obvious… … and hope that the solution has a small expected error

  26. (II) Function Space Where do we choose f from? fcan be any constant function? f can be any polynomial?

  27. Standard Learning Methods A standard way of building learning methods: • Step 1: define a function space H • Step 2: define the loss function V(y, f(x)) • Step 3: find fin H that minimizes the empirical error

  28. Standard Learning Methods A standard way of building learning methods: • Step 1: define a function space HHow? • Step 2: define the loss function V(y, f(x)) • Step 3: find fin H that minimizes the empirical error Ok? Enough ?

  29. The Central Questions • How do we choose the function space H ? • What if there are many solutions in H minimizing the empirical error (ill-posed problem) ? • Does a function f that minimizes the empirical error also minimize the expected error in H ?

  30. Statistical Learning Approach (Vapnik, Chervonenkis, 1968- ) • Choose function space H according to its complexity. Formal measures are provided (i.e. VC-dimension). • With appropriate control of the complexity of the function space, the problem becomes well-posed : there is a unique solution. • The theory provides necessary and sufficient conditions for the uniform convergenceof the empirical error to the expected error in a function space in terms of the complexity of the space.

  31. Important Bound (Vapnik, Chervonenkis, 1971) The theory provides bounds on the distance between the expected and empirical error : These bounds can be used to choose the function space H

  32. Using the Bound Underfit Overfit

  33. Using the Bound Error Expected Empirical hopt Complexity h

  34. Standard Approaches A standard way of building learning methods: • Step 1: define a function space HHow? • Step 2: define the loss function V(y, f(x)) • Step 3: find fin H that minimizes the empirical error Ok? Enough ?

  35. The new way of building learning methods: Minimize:Empirical Error + Complexity By trying many H The Statistical Learning Approach

  36. The Statistical Learning Approach Solves the problems of the standard methods • Step 1: define a function space H • Step 2: define the loss function V(y, f(x)) • Step 3: find fin H that minimizes the empirical error

  37. Example

  38. aka Perceptron (Neural Network)

  39. Statistical Learning Approach What if we restrict the set of lines - function space? (therefore control complexity)

  40. Statistical Learning Approach

  41. Benefits of Statistical Learning • The problem becomes well-posed • Solution has smaller expected error

  42. Empirical Error vs Complexity What if we further restrict complexity?

  43. Benefits of Statistical Learning Avoid overfitting (Important for large dimensional data!)

  44. Empirical Error Complexity Support Vector Machines (Vapnik, Cortes, 1995)

  45. Non-linear Function Spaces Generally f can be any “linear” function in some very complex feature space:

  46. Example: Second Order Features

  47. Second Order Polynomials Using more complex features (second order features)

  48. Reproducing Kernel Hilbert Space RKHS: A space of linear functions in a feature space satisfying some conditions (functional analysis…) Examples:

  49. Support Vector Machines: General Empirical Error Complexity Training: Quadratic Programming

  50. Kernel Machines Empirical Error Complexity Choices to make: V , f , l

More Related