1 / 80

Intro to Data Mining

Intro to Data Mining. Natasha Balac, Ph.D. Predictive Analytics Center of Excellence, Director San Diego Supercomputer Center University of California, San Diego. Necessity is the Mother of Invention. Problem Data explosion

ninam
Télécharger la présentation

Intro to Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intro to Data Mining Natasha Balac, Ph.D. Predictive Analytics Center of Excellence, Director San Diego Supercomputer Center University of California, San Diego

  2. Necessity is the Mother of Invention • Problem • Data explosion • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories • “We are drowning in data, but starving for knowledge!” (John Naisbitt, 1982)

  3. Necessity is the Mother of Invention • Solution • Predictive Analytics or Data Mining • Extraction or “mining” of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases • Data -driven discovery and modeling of hidden patterns in large volumes of data • Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data

  4. Surface SQL tools for simple queries and reporting Shallow Statistical & OLAPtools for summaries and analysis Hidden Data Mining methods for knowledge discovery Bottom-Up Methodology Predictive Analytics Analytical Tools Top-Down Methodology Data

  5. DM Enables Predictive Analytics Predictive Analytics Role of Software Data mining Proactive Interactive OLAP Ad-hoc reporting Canned reporting Passive BusinessInsight Presentation Exploration Discovery

  6. What Is Data Mining? • Data mining: • Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

  7. Data Mining is NOT • Data Warehousing • (Deductive) query processing • SQL/ Reporting • Software Agents • Expert Systems • Online Analytical Processing (OLAP) • Statistical Analysis Tool • Data visualization

  8. Machine Learning Techniques • Technical basis for data mining: algorithms for acquiring structural descriptions from examples • Methods originate from artificial intelligence, statistics, and research on databases

  9. Multidisciplinary Field Database Technology Statistics Data Mining Machine Learning Visualization Artificial Intelligence Other Disciplines

  10. Multidisciplinary Field • Database technology • Artificial Intelligence • Machine Learning including Neural Networks • Statistics • Pattern recognition • Knowledge-based systems/acquisition • High-performance computing • Data visualization

  11. History of Data Mining

  12. History • Emerged late 1980s • Flourished –1990s • Roots traced back along three family lines • Classical Statistics • Artificial Intelligence • Machine Learning

  13. Statistics • Foundation of most DM technologies • Regression analysis, standard distribution/deviation/variance, cluster analysis, confidence intervals • Building blocks • Significant role in today’s data mining – but alone is not powerful enough

  14. Artificial Intelligence • Heuristics vs. Statistics • Human-thought-like processing • Requires vast computer processing power • Supercomputers

  15. Machine Learning • Union of statistics and AI • Blends AI heuristics with advanced statistical analysis • Machine Learning – let computer programs • learn about data they study - make different decisions based on the quality of studied data • using statistics for fundamental concepts and adding more advanced AI heuristics and algorithms

  16. Data Mining • Adoption of the Machine learning techniques to the real world problems • Union: Statistics, AI, Machine learning • Used to find previously hidden trends or patterns • Finding increasing acceptance in science and business areas which need to analyze large amount of data to discover trends which could not be found otherwise

  17. Terminology • Gold Mining • Knowledge mining from databases • Knowledge extraction • Data/pattern analysis • Knowledge Discovery Databases or KDD • Information harvesting • Business intelligence

  18. Explores Your Data Performs Predictions Finds Patterns What does Data Mining Do?

  19. What can we do with Data Mining? • Exploratory Data Analysis • Predictive Modeling: Classification and Regression • Descriptive Modeling • Cluster analysis/segmentation • Discovering Patterns and Rules • Association/Dependency rules • Sequential patterns • Temporal sequences • Deviation detection

  20. CRISP-DM - Cross Industry Standard Process for Data Mining CRISP-DM Process Model

  21. Data Mining Applications • Science: Chemistry, Physics, Medicine • Biochemical analysis, remote sensors on a satellite, Telescopes – star galaxy classification, medical image analysis • Bioscience • Sequence-based analysis, protein structure and function prediction, protein family classification, microarray gene expression • Pharmaceutical companies, Insurance and Health care, Medicine • Drug development, identify successful medical therapies, claims analysis, fraudulent behavior, medical diagnostic tools, predict office visits • Financial Industry, Banks, Businesses, E-commerce • Stock and investment analysis, identify loyal customers vs. risky customer, predict customer spending, risk management, sales forecasting

  22. Data Mining Applications • Market analysis and management • Target marketing, customer relation management, market basket analysis, cross selling, market segmentation (Credit Card scoring, Personalization & Customer Profiling ) • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis (Banking and Credit Card scoring) • Fraud detection and management • Fraud, waste and abuse including: illegal practices, waste, payment error, non-compliance, incorrect billing practices

  23. Data Mining Applications • Sports and Entertainment • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining • Campaign Management and Database Marketing

  24. Data Mining Tasks • Concept/Class description: Characterization and discrimination • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions; “normal” vs. fraudulent behavior • Association(correlation and causality) • Multi-dimensional interactions and associations age(X, “20-29”) ^ income(X, “60-90K”) à buys(X, “TV”) Hospital(area code) ^ procedure(X) ->claim (type) ^ claim(cost)

  25. Data Mining Tasks • Classification and Prediction • Finding models (functions) that describe and distinguish classes or concepts for future prediction • Example: classify countries based on climate, or classify cars based on gas mileage, fraud based on claims information • Presentation: • If-THEN rules, decision-tree, classification rule, neural network • Prediction: Predict some unknown or missing numerical values

  26. Data Mining Tasks • Cluster analysis • Class label is unknown: Group data to form new classes • Example: cluster claims or providers to find distribution patterns of unusual behavior • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

  27. Data Mining Tasks • Outlier analysis • Outlier: a data object that does not comply with the general behavior of the data • Mostly considered as noise or exception, but is quite useful in fraud detection, rare events analysis • Trend and evolution analysis • Trend and deviation: regression analysis • Sequential pattern mining, periodicity analysis

  28. KDD Process Database Data Mining Data Preparation Training Data Selection Transformation Model, Patterns Evaluation, Verification

  29. Learning and Modeling Methods • Decision Tree Induction (C4.5, J48) • Regression Tree Induction (CART, MP5) • Multivariate Regression Tree (MARS) • Clustering (K-means, EM, Cobweb) • Artificial Neural Networks (Backpropagation, Recurrent) • Support Vector Machines (SVM) • Various other models

  30. TAXONOMY • Predictive Methods • Use some variables to predict some unknown or future values of other variables • Descriptive Methods • Find human –interpretable patterns that describe the data • Supervised vs. Unsupervised • Labeled vs. unlabeled data

  31. Data Mining Challenges • Computationally expensive to investigate all possibilities • Dealing with noise/missing information and errors in data • Mining methodology and user interaction • Mining different kinds of knowledge in databases • Incorporation of background knowledge • Handling noise and incomplete data • Pattern evaluation: the interestingness problem • Expression and visualization of data mining results

  32. Data Mining Heuristics and Guide • Choosing appropriate attributes/input representation • Finding the minimal attribute space • Finding adequate evaluation function(s) • Extracting meaningful information • Model evaluation accuracy vs. overfitting

  33. Available Data Mining Tools COTs: • IBM Intelligent Miner • SAS Enterprise Miner • Oracle ODM • Microstrategy • Microsoft DBMiner • Pentaho • Matlab • Teradata Open Source: • WEKA • KNIME • Orange • RapidMiner • NLTK • R • Rattle

  34. Data Mining Trends • Many data mining problems involve large, complex databases, complicated modeling techniques and substantial computer processing • Scalability is very important due to the rapid growth in the amount the data and the need to build and deploy the models at rapid rates

  35. Needs for Scalable High Performance Data Mining • The size of a database as well as factors in building, testing, validation and deploying a data mining solution • Taking advantage of parallel database/file system systems and additional CPUs • Work with more data, build more models, and improve their accuracy by simply adding additional CPUs • Build a good data mining model as quickly as possible!

  36. Scalable High Performance Data Mining • One solution: run parallel data mining algorithms and parallel DBMSs on parallel hardware • Once the model is build – still need for parallel computing to apply the model to a large amount of data • Scoring: data mining model applied to a batch of data, record or events at the time; require that a scalable solution be deployed

  37. Data mining applications on Gordon MathWorks Octave DM Suites Computational Packages with DM tools Others as Requested Libraries for building tools

  38. Summary • Discovering interesting patterns from large amounts of data • CRISP-DM Industry standard • Learn from the past • High quality, evidence based decisions • Predict for the future • Prevent future instances of fraud, waste & abuse • React to changing circumstances • Current models, continuous learning

  39. Thank you! Questions: natashab@sdsc.edu

  40. An Introduction to the Gordon Architecture San Diego Supercomputer Center

  41. Gordon Design Partnership

  42. Gordon is not about FLOPS, but … A conservative estimate puts Gordon in the top 50 on the Top 500 list

  43. Access to Big Data Comes with a Latency Penalty Data Oasis Lustre 4PB PFS 64 I/O nodes 300 TB Intel SSD (lower is better) Quick Path Interconnect 10’s of GB QDR InfiniBand Interconnect 100’s of GB L3 Cache MB L1 Cache KB DDR3 Memory 10’s of GB L2 Cache KB (higher is better)

  44. Access to Big Data Comes with a Latency Penalty Data Oasis Lustre 4PB PFS 64 I/O nodes 300 TB Intel SSD (lower is better) Quick Path Interconnect 10’s of GB QDR InfiniBand Interconnect 100’s of GB L3 Cache MB L1 Cache KB DDR3 Memory 10’s of GB L2 Cache KB (higher is better)

  45. Flash Drives are a Good Fit for Data Intensive Computing . Apart from the differences between HDD and SSD it is not common to find local storage “close” to the compute. We have found this to be attractive in our Trestles cluster, which has local flash on the compute, but is used for traditional HPC applications (not high IOPS).

  46. Flash Glossary of Terms

  47. Gordon System Specification * May be revised

  48. Gordon Applications Software chemistry adf amber gamess gaussian gromacs lammps namd nwchem visualization idl NCL paraview tecplot visit VTK genomics abyss blast hmmer soapdenovo velvet data mining IntelligentMiner RapidMiner RATTLE Weka libraries ATLAS BLACS fftw HDF5 Hypre SPRNG superLU distributed computing globus Hadoop MapReduce compilers/languages gcc, intel, pgi MATLAB, Octave, R PGAS (UPC) DB2, PostgreSQL

  49. MapReduce Paradigmatic Example: string counting • Scheduler: manage threads, initiate data split and call Map • Map: count strings, output key=string & value=count • Scheduler: re-partitions keys & values • Reduce: sum up counts MR provides parallelization,concurrency, and intermediate data functions (by key&value) User defines keys & values User defined functions

More Related