1 / 27

LIACS Data Mining course

LIACS Data Mining course. Arno Knobbe Arne Koopman. an introduction. Course Textbook. Data Mining Practical Machine Learning Tools and Techniques second edition, Morgan Kaufmann, ISBN 0-12-088407-0 by Ian Witten and Eibe Frank. Course Information. New Website:

susankeller
Télécharger la présentation

LIACS Data Mining course

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LIACS Data Mining course Arno Knobbe Arne Koopman an introduction

  2. Course Textbook Data Mining Practical Machine Learning Tools and Techniques second edition, Morgan Kaufmann, ISBN 0-12-088407-0 by Ian Witten and Eibe Frank

  3. Course Information • New Website: http://www.liacs.nl/~akoopman/DaMi/ • Old website discontinued: http://www.liacs.nl/~joost/DM/CollegeDataMining.htm • New lecturers  new style • Old book. This may change next year • Updated slides + some new material • Practical exercises • New style of exam • fewer definitions, more understanding and applying • old exams should not be used • exam preparation (Dec 8) important

  4. Course Outline 08-Sep Knobbe today 15-Sep Knobbe 22-Sep Koopman 29-Sep Knobbe 06-Oct Knobbe practical exercise 13-Oct Koopman 20-Oct Koopman 27-Oct Knobbe 03-Nov Knobbe practical exercise 10-Nov Koopman practical exercise 17-Nov no lecture! 24-Nov Koopman start at 9:00! 01-Dec Koopman 08-Dec Koopman,Knobbe exam preparation! 14-Jan, 10:00-13:00 exam

  5. Introduction Data Mining an overview and some examples

  6. Data Mining definitions Data Mining: the concept of extracting previously unknown and potentially useful information from large sets of data. secondary statistics: analyzing data that wasn’t originally collected for analysis.

  7. Data Mining, the big idea • Organizations collect large amounts of data • Often for administrative purposes • Large body of experience • Learning from experience • Goals • Prediction • Optimization • Forecasting • Diagnostics • …

  8. 2 Streams • Mining for insight • Understanding a domain • Finding regularities between variables • Goal of Data Mining is mostly undefined • Interpretable models • Examples: Medicine, production, maintenance • ‘Black-box’ Mining • Don’t care how you do it, just do it well • Optimization • Examples: Marketing, forecasting (financial, weather)

  9. Data Mining customer model example: Direct Mail Optimize the response to a mailing, by targeting only those that are likely to respond: • more response • fewer letters Customer information response 3% test mailing Customer information response 30% final mailing remainder

  10. example: Bioinformatics • Find genes involved in disease (Parkinson’s, Celiac, Neuroblastoma) • Measurements from patients (1) and controls (0) • Gene expression: measurements of 20k genes • dataset 20,001 x 100 • Challenges • many variables • few examples (patients), testing is expensive • interactions between genes

  11. Data Mining paradigms • Classification • binary class variable • predict class of future cases • most popular paradigm • Clustering • divide dataset into groups of similar cases • Regression • numeric target variable • Association • find dependencies between variables • basket analysis, …

  12. 0.64 Yes Age < 35 0.4 Rent 0.25 No Age ≥ 35 0.51 Yes Price < 200K 0.1 Buy 0.01 No Price ≥ 200K 0.07 No Other Classification Predict the class (often 0/1) of an object on the basis of examples of other objects (with a class given). 0.2

  13. Age < 35 Rent Age ≥ 35 Price < 200K Buy Price ≥ 200K Other Building (inducing) a decision tree Age Gender House Price Mortgage? 21 M Rent - No 30 F Rent - Yes 40 M Rent - No 32 F Buy 300K No 30 F Rent - Yes 55 M Buy 260K No 25 F Buy 180K Yes … Age Gender House Price Mortgage? 21 M Rent - No 30 F Rent - Yes 40 M Rent - No 32 F Buy 300K No 30 F Rent - Yes 55 M Buy 260K No 25 F Buy 180K Yes …

  14. Yes Age < 35 Rent No Age ≥ 35 Yes Price < 200K Buy No Price ≥ 200K No Other Applying a classifier (decision tree) New customer: (House = Rent, Age = 32, …) prediction = Yes

  15. y + + + + + + + x < t - + - + y < t’ - + - x  t + - - - y  t’ + - 0 x Graphical interpretation • dataset with two variables + 1 class (+/-) • graphical interpretation of decision tree

  16. y + + + + + + + - + - + - + - + - - - + - 0 x Graphical interpretation • dataset with two variables + 1 class (+/-) • other classifiers Support Vector Machine Neural Network

  17. Applications of DM • Marketing • outgoing • incoming • Bioinformatics & Medicine • Fraud detection • Risk management • Insurance • Enterprise resource planning

  18. Rhinoplastic surgery

  19. InfraWatch: monitoring of infrastructure Continuous monitoring of a large bridge ‘Hollandse Brug’ • 145 sensors • time-dependent, at frequencies up to 100Hz • multi-modal (sensor, video, differen freq.) • managing large data quantities, >1 Gb per day

  20. InfraWatch: monitoring of infrastructure • 34 `geo-phones' (vibration sensors) • 44 embedded strain-gauges, 47 gauges outside • 20 thermometers • video camera • weather station

  21. InfraWatch sensors

  22. Real-world application:Maintenance planning at KLM • Routine checks of aircrafts • Maintenance requires up to 10k different parts • Ordering parts incurs delay (costs)… • … but so does stocking • In theory 10k individual predictions • Input • maintenance history • flight history, Sahara/North Pole • Only few parts predictable

  23. Cashflow Online • Online personal finance overview • All bank transactions are loaded into the application • transactions are classified into different categories • Data Mining predicts category

  24. 67 Categories Gas Water Licht Onderhoud huis en tuin Telefoon + Internet + TV Contributie (sport-)verenigingen Levensverzekering / Lijfrente Rente ontvangen Boodschappen Hypotheekrente Naar spaarrekening Geldopname/chipknip Verzekeringen overig Loterijen Cadeau's Interne boeking Vakantie & Recreatie Uitgaan, hobby's en sport Creditcard Ziektekostenverzekering Brandstof Woonhuis / Opstalverzekering Huishouden overig School- en Studiekosten Inkomsten overig Kleding & Schoenen Lenen Openbaar vervoer/Taxi …

  25. Fragmented results: Boodschappen (groceries) Contributie

  26. Decision Tree over all categories true false

  27. Data Mining at LIACS • Applications • bioinformatics (LUMC) • law enforcement (KLPD, NFI) • rhinoplastic surgery (NKI) • Hollandse Brug (Strukton, RWS, Reef Infra) • … • Complex data • graphical data (molecules) • relational data (criminal careers) • stream data (sensor-data, click-streams) • …

More Related