1 / 104

Data Mining: How to make islands of knowledge emerging out of oceans of data

Data Mining: How to make islands of knowledge emerging out of oceans of data. Hugues Bersini IRIDIA - ULB. PLAN. Rapid intro to data warehouse data mining: two super techniques of data mining incomprehensible :. Understand and predict. Lazy for time series prediction.

salaam
Télécharger la présentation

Data Mining: How to make islands of knowledge emerging out of oceans of data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining:How to make islands ofknowledge emerging out of oceans of data Hugues Bersini IRIDIA - ULB

  2. PLAN • Rapid intro to data warehouse • data mining: • two super techniques of data mining incomprehensible: Understand and predict Lazy for time series prediction Bagfs for classification

  3. The Data Miner Steps • Data Warehousing • Data Preparation • Cleaning + Homogeneisation • Transformation - Composition • Reduction • For time series: time adjustment • Data Modelling : What researchers are mainly interested in.

  4. Data Warehouse

  5. Re-organization of data • Subject oriented • integrated • transversals • with history • non volatile • from production data ---> to decision-based data

  6. Data Mining Uunderstand and predict

  7. Modelling the data: only if structure and regularities in the dataData mining IS NOT OLAP WHY ?? To predict new data To understand the data

  8. The main techniques of data-mining • Clustering • Outlier detection • Association analysis • Forecasting • Classification

  9. Data Mining: to understand and/or to predict discovering structure in data discovering I/O relationship in data

  10. Nothing new under the sun • New methods extending old ones in the domain of non-linear (NN) and symbolic (decision tree) • Exponential explosion of data • Extracting from huge data base More sensitive than ever

  11. Exploit Decisions Data store Data Store Data Store Data Store • Data volume doubles every 18 months • world-wide • Problem • How to extract relevant knowledge for • our decisions from such amounts of data? • Solutions • Throw it away before using it (most popular) • Query it (Query and OLAP tools) • Summarize it: extract essence from the bulk • according to targeted decision (Data Mining) CEDITI September 2, 1998 • 3

  12. Discovering structure in data • When in a space with a metric • Hierarchical clustering • K-Means • NN clustering - Kohonen’s map • In space without any metric but a cost function: • Grouping Genetic Algorithms ....

  13. Clustering and outlier

  14. Market Basket Analysis: Association analysis Quantity bought

  15. Calcul of Improvement

  16. Discovering I/O relationship in data ? y x(t) ?? x t classification time series prediction O = the class I = (x,y) O = x(t+1) I = x(t) understanding I/O relationship Predicting which O for new I

  17. Le CV d’IRIDIA en data mining • Reconnaissance de défauts vitreux chez Glaverbel • Prediction de fluctuations boursières avec MasterFood et dieteren • Reconnaissance d’incidents et prédiction de charge électrique avec Tractebel • Analyse des retards aériens avec Eurocontrôle • Modélisation de Processus Industriel avec Honeywell, FAFER et Siemens • Moteur de recherche Internet convivial avec la Region Wallonne • Classification de pixels pour les images de satelittes

  18. Financial prediction Task: predict the future trends of the financial series. Goal: automatic trading system to anticipate the fluctuations of the market.

  19. Economic variables Task: predict how many cars will be matriculated next year. Goal: support the marketing campaign of a car dealer.

  20. Modeling of industrial plants Rolling steel mill Task: predict the flow stress of the steel plate as a function of the chemical and physical properties of the material. Goal: cope with different types of metals, reduce the production time and improve final quality.

  21. Control Waste water treatment plant Task: model the dynamics of the plant on the basis of accessible information. Goal: control the level of water pollutants.

  22. Environmental problems Algae summer blooming Task: predicting the biological state (e.g. density of algae communities) as a function of chemicals. Goal: make automatic the analysis of the state of the river by monitoring chemical concentrations.

  23. In the medical domain • automatic diagnosis of cancer • detection of respiratory problems • electrocardiogram analysis • help to paraplegic

  24. APPLICATION DU DATA MINING DANS LE DOMAINE DU CANCER: Application à l'aide au diagnostic et au pronosticen pathologie tumorale. En collaboration avec le Laboratoire d'Histopathologie (R. Kiss), Faculté de Médecine, U.L.B.

  25. critères histologiques: - perte de différenciation - invasion critères cytologiques: - taille des noyaux - mitoses - plages d’hyperchromatisme bilan clinique patient tumeur chirurgie DIAGNOSTIC (pathologistes) traitement adjuvant faible, modéré, élevé Amélioration du diagnostic Adéquation du traitement Augmentation de la survie

  26. Exemple:Tumeurs primitives cérébrales (adultes): GLIOMES

  27. “Objectivation” d’éléments diagnostiques quantification de critères (cytologiques et histologiques) microscopie assistée par ordinateur traitement des données Extraction d’informations diagnostiques et/ou prognostiques fiables et reproductibles

  28. 500 à 1000 noyaux • par tumeurs. • 30 variables tumorales: • moyenne • déviation standard

  29. Application to teledetection

  30. Bagfs

  31. On internet • The Hyperprisme project • Text Mining • Automatic profiling of users • Key words: positif, negatif,… • Automatic grouping of users on the basis of their profiles • See Web

  32. Different approaches Data Model Non readable Accuracy of prediction Non comprehensible Comprehensible SVM Local Global

  33. Understanding and Predicting Building Models A model needs data to exist but, once it exists, it can exist without the data. Structure To fit the data Model Parameters Linear, NN, Fuzzy, ID3, Wavelet, Fourier, Polynomes,...

  34. From data to prediction RAW DATA TRAINING DATA PREPROCESSING MODEL LEARNING PREDICTION

  35. Supervised learning input PHENOMENON output error OBSERVATIONS MODEL prediction • Finite amount of noisy observations. • No a priori knowledge of the phenomenon.

  36. Model learning MODEL GENERATION PARAMETRIC IDENTIFICATION MODEL VALIDATION STRUCTURAL IDENTIFICATION MODEL SELECTION

  37. The Practice of Modelling Accurate Simple Robust Understandable good for decision Data + Optimisation Methods THE MODEL Physical Knowledge Engineering Models Rules of Thumb Linguistic Rules

  38. Comprehensible models • Decision trees • Qualitative attributes • Force the attributes to be treated separately • classification surfaces parallel to the axes • good for comprehension because they select and separate the variables

  39. Decision trees • Very used in practice. One of the favorite data mining methods • Work with noisy data (statistical approaches) can learn logical model out of data expressed by and/or rules • ID3, C4.5 ---> Quinlan • Favoring little trees --> simple models

  40. At every stage the most discriminant attribute • The tree is being constructed top-down adding a new attribute at each level • The choice of the attribute is based on a statistical criteria called : “the information gain” • Entropie = -pouilog2poui - pnonlog2pnon • Entropie = 0 if Poui/non = 1 • Entropie = 1 if Poui/non = 1/2

  41. Information gain • S = set of instances, A set of attributes and v set of values of attributes A • Gain (S,A) = Entropie(S)-Sv|Sv|/|S|*Entropie(Sv) • the best A is the one that maximises the Gain • The algorithm runs in a recursive way • The same mechanism is reapplied at each level

  42. Mais !!!! Remboursement d’emprunt Is a good client if (x - y)>30000 . Salaire mensuel 30000

  43. Other comprehensible models • Fuzzy logic • Realize an I/O mapping with linguistic rules • If I eat “a lot” then I take weight “a lot”

  44. Exemple trivial Linéaire, optimal automatique, simple Y X

  45. Le flou

More Related