1 / 63

DATA MINING: Algorithms, Applications and Beyond

DATA MINING: Algorithms, Applications and Beyond. Chandan K. Reddy Department of Computer Science Wayne State University, Detroit, MI – 48202. Organization. Introduction Basic components Fundamental Topics Classification Clustering Association Analysis Research Topics

Télécharger la présentation

DATA MINING: Algorithms, Applications and Beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DATA MINING:Algorithms, Applications and Beyond Chandan K. ReddyDepartment of Computer ScienceWayne State University, Detroit, MI – 48202.

  2. Organization • Introduction • Basic components • Fundamental Topics • Classification • Clustering • Association Analysis • Research Topics • Probabilistic Graphical Models • Boosting Algorithms • Active Learning • Mining under Constraints • Teaching

  3. Lots of Data …. • Customer Transactions • Bioinformatics • Banking • Internet / Web • Biomedical Imaging

  4. So What ????? • Computers have become cheaper and more powerful, so storage is not an issue • There is often information “hidden” in the data that is not readily evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all We are drowning in data, but starving for knowledge!!!

  5. Data Mining is … • “the nontrivial extraction of implicit, previously unknown, and potentially usefulinformation from data” • “the science of extracting useful information from large data sets or databases” -Wikipedia.org • More appropriate term will be …. Knowledge Discovery in Databases

  6. Steps in Knowledge Discovery

  7. Steps in the KDD Procedure • Data Cleaning • (removal of noise and inconsistent records) • Data Integration • (combining multiple sources) • Data Selection • (only data relevant for the task are retrieved from the database) • Data Transformation • (converting data into a form more appropriate for mining) • Data Mining • (application of intelligent methods in order to extract data patterns) • Model Evaluation • (identification of truly interesting patterns representing knowledge) • Knowledge Presentation • (visualization or other knowledge presentation techniques)

  8. What can Data mining do? • Figures out some intelligent ways of handling the data • Finds valuable information hidden in large volumes of data. • Analyze the data and find patterns and regularities in data. • Mining analogy: in a mining operation large amounts of low grade materials are sifted through in order to find something of value. • Identify some abnormal/suspicious activities • To provide guidelines to humans - what to look for in a dataset?

  9. Related CS Topics Database Systems Pattern Recognition Optimization Data Mining Algorithms Artificial Intelligence Statistics Machine Learning Visualization

  10. Typical Data Mining Tasks are … • Prediction Methods (You know what to look for) • Use some variables to predict unknown or future values of other variables. • Description Methods (you don’t know what to look for) • Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

  11. Basic components • Data Pre-processing • Data Visualization • Model Evaluation • Classification • Clustering • Association Analysis

  12. Different kinds of Data • Record Data • Data Matrix • Document Data • Transaction Data • Graph Data • Ordered • Temporal Data • Sequence Data • Spatio-Temporal Data

  13. Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes

  14. Document Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • the value of each component is the number of times the corresponding term occurs in the document.

  15. Transaction Data • A special type of record data, where • Each record (transaction) involves a set of items. • The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

  16. Graph Data • Data with Relationships among objects • Examples: (a) Generic Web Data (b) Citation DataAnalysis

  17. Ordered Data • Time Series data – series of some measurements taken over certain time frame • E.g. financial Data

  18. Ordered Data • Sequence data – no time stamps, but order is still important. E.g. Genome data

  19. Ordered Data • Spatio-Temporal Data Average Monthly Temperature of land and ocean collected for a variety of geographical locations ( a total of 250,000 data points)

  20. Data Pre-Processing • Removal of noise and outliers • Will improve the performance of mining • Sampling is employed for data selection • Processing entire Data might be expensive • Dealing with High-dimensional data • Curse of dimensionality • Data Normalization • Different features have different range values e.g. human age, height, weight. • Feature Selection • Remove unnecessary features – redundant or irrelevant

  21. Data Visualization • Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data itemsor attributes can be analyzed or reported. Histograms Pie Chart

  22. Scatter Plot Array of Iris Attributes

  23. Celsius Contour Plot Example:

  24. Parallel Coordinates Plots for Iris Data

  25. Chernoff Faces for Iris Data Setosa Versicolour Virginica

  26. All, All, All A Sample Data Cube Total annual sales of TV in U.S.A. Date 2Qtr 1Qtr sum 3Qtr 4Qtr TV Product U.S.A PC VCR sum Canada Country Mexico sum

  27. Organization • Introduction • Basic components • Fundamental Topics • Classification • Clustering • Association Analysis • Research Topics • Probabilistic Graphical Models • Boosting Algorithms • Active Learning • Mining under Constraints • Teaching

  28. Classification Training Algorithm Training Phase Learn Model Apply Model Result Existing Data New Data ??? Testing Phase

  29. Classification models Outlook Sunny Rainy Overcast Windy Humidity Yes True False High Normal No Yes No Yes

  30. Metrics for Performance Evaluation Most widely-used metric:

  31. Evaluating Data Mining techniques • Predictive Accuracy (ability of a model to predict future) or • Descriptive Quality (ability of a model to find meaningful descriptions of the data, e.g. clusters) • Speed(computation cost involved in generating and using the model) • Robustness (ability of a model to work well even with noisy or missing data) • Scalability(ability of a model to scale up well with large amounts of data) • Interpretability(level of understanding and insight provided by the model)

  32. Clustering • No class Labels – so, no prediction • Groupings in the data (descriptive) • Can be used to summarize the data • Can help in removing outliers and noise • Image segmentation, document clustering, gene expression data etc..

  33. Association Analysis • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer},{Milk, Bread}  {Eggs,Coke},{Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

  34. Organization • Introduction • Basic components • Fundamental Topics • Classification • Clustering • Association Analysis • Research Topics • Probabilistic Graphical Models • Boosting Algorithms • Active Learning • Mining under Constraints • Teaching

  35. Probabilistic Graphical Models • Real World Data is very complicated • We would like to understand the underlying distribution that generated the data • If it is unimodal, then it is easy to solve • But, usually the distribution is multimodal – not unimodal

  36. Parameter Estimation • Modeling with Probabilistic Graphical Models • Mixture Models • Hidden Markov Models • Mixture-of-Experts • Bayesian Networks • Mixture of Factor Analyzers • Neural Networks • And so on….. We don’t want Sub-optimal models

  37. Example

  38. Motivation ? “Searching for a needle in hay stack” ? ? ? ? ?

  39. Problems with Local Optimization X Local methods suffer from “fine-tuning” capability and there is a need for a method that explores a subspace in a systematic manner. X X

  40. TRUST-TECH Approach X X X X X X X X X X Systematic Tier-by-Tier search

  41. Mixture Models • Let x = [ x1, x2,…, xd ] T be the d - dimensional feature vector • Assumption :K components in the mixture model. • Let  = { 1, 2,…, k, 1, 2,…, k } represent the collection of parameters

  42. Maximum Likelihood Estimation • Let X = { x(1), x(2),…, x(n) } be the set of n i.i.d samples • Goal : Find  that maximizes the likelihood function • Difficulty :(i) No closed-form solution and (ii) The likelihood surface is highly nonlinear

  43. EM Algorithm • Initialization : Set the initial parameters  • Iteration : Iterate the following until convergence • E-Step :Compute the Q-function i.e. expectation of the log likelihood given the current parameters • M-Step :Maximize the Q-function with respect to 

  44. Nonlinear Transformation one-to-one correspondence of the critical points Dynamical System Original Function [ JCB ’06 ] Local Minimum Stable Equilibrium Point Saddle Point Decomposition Point Local Maximum Source Likelihood Function Energy Function

  45. Experimental Results [ IEEE PAMI ’08 ]

  46. Finding Motifs using Probabilistic Models

  47. Results

  48. Results Different Motifs and the average score using random starts. The first tier and second tier improvements [ BMC AMB ’06 ]

  49. Neural Network Diagram Inputs : xi Output : y Weights : wij Biases : bi Targets : t # of Input Nodes : n # of Hidden Layers : 1 # of Hidden Nodes : k # of Output Nodes : 1

  50. Results – Classification Error (%) [ IJCNN ’07 ]

More Related