Download
high performance data mining n.
Skip this Video
Loading SlideShow in 5 Seconds..
High Performance Data Mining PowerPoint Presentation
Download Presentation
High Performance Data Mining

High Performance Data Mining

342 Views Download Presentation
Download Presentation

High Performance Data Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. High Performance Data Mining Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar Research sponsored by AHPCRC/ARL, DOE, NASA, and NSF

  2. Overview • Introduction to Data Mining (What, Why, and How?) • Issues and Challenges in Designing Parallel Data Mining Algorithms • Case Study: Discovery of Patterns in Global Climate Data using Data Mining • Summary

  3. What is Data Mining? • Many Definitions • Non-trivial extraction of implicit, previously unknown and potentially useful information from data • Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

  4. What is (not) Data Mining? • What is not Data Mining? • Look up phone number in phone directory • Query a Web search engine for information about “Amazon” • What is Data Mining? • Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) • Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

  5. Why Mine Data? Commercial Viewpoint • Lots of data is being collected and warehoused • Web data, e-commerce • purchases at department/grocery stores • Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong • Provide better, customized services for an edge (e.g. in Customer Relationship Management)

  6. Why Mine Data? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) • remote sensors on a satellite • telescopes scanning the skies • microarrays generating gene expression data • scientific simulations generating terabytes of data • Traditional techniques infeasible for raw data • Data mining may help scientists • in classifying and segmenting data • in Hypothesis Formation

  7. From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications” Mining Large Data Sets - Motivation • There is often information “hidden” in the data that is not readily evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all The Data Gap Total new disk (TB) since 1995 Number of analysts

  8. Origins of Data Mining • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • Traditional Techniquesmay be unsuitable due to • Enormity of data • High dimensionality of data • Heterogeneous, distributed nature of data Statistics/AI Machine Learning/ Pattern Recognition Data Mining Database systems

  9. Statistics/AI Machine Learning/ Pattern Recognition Data Mining High Performance Computing Database systems Role of Parallel and Distributed Computing • Many algorithms use computation time more than O(n) • High Performance Computing (HPC) is often critical for scalability to large data sets • Sequential computers have limited memory • This may required multiple,expensive I/O passes over data • Data may be distributed • due to privacy reasons • physically dispersed over many different geographic locations

  10. Data Mining Tasks... Data Clustering Predictive Modeling Anomaly Detection Association Rules Milk

  11. Predictive Modeling • Find a model for class attribute as a function of the values of other attributes Model for predicting tax evasion categorical categorical continuous Married class No Yes NO Income100K Yes Yes Income  80K NO Yes No Learn Classifier NO YES

  12. Predictive Modeling: Applications • Targeted Marketing • Customer Attrition/Churn • Classifying Galaxies • Class: • Stages of Formation Early • Attributes: • Image features, • Characteristics of light waves received, etc. Intermediate Late • Sky Survey Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB Courtsey: http://aps.umn.edu

  13. Clustering • Given a set of data points, find groupings such that • Data points in one cluster are more similar to one another • Data points in separate clusters are less similar to one another

  14. Clustering: Applications • Market Segmentation • Gene expression clustering • Document Clustering

  15. Association Rule Discovery • Given a set of records, find dependency rules which will predict occurrence of an item based on occurrences of other items in the record • Applications • Marketing and Sales Promotion • Supermarket shelf management • Inventory Management Rules Discovered: {Milk} --> {Coke} (s=0.6, c=0.75) {Diaper, Milk} --> {Beer} (s=0.4, c=0.67)

  16. Deviation/Anomaly Detection • Detect significant deviations from normal behavior • Applications: • Credit Card Fraud Detection • Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day

  17. General Issues and Challenges in Parallel Data Mining • Dense vs. Sparse • Structured versus Unstructured • Static vs. Dynamic • Data mining computations tend to be unstructured, sparse and dynamic.

  18. Specific Issues and Challenges in Parallel Data Mining • Disk I/O • Data is often too large to fit in main memory • Spatial locality is critical • Hash Tables • Many efficient data mining algorithms require fast access to large hash tables.

  19. Pay Evade Refund 3 0 No Refund 4 3 Constructing a Decision Tree Marital Status Refund Single/Divorced Married Yes No Pay: 3 Evade:3 Pay: 4 Evade:0 Pay: 3 Evade:0 Pay: 4 Evade:3 Key Computation

  20. Constructing a Decision Tree Refund: Yes Refund: No

  21. Partitioning of data only global reduction per node is required large number of classification tree nodes gives high communication cost Pay Evade Refund 3 0 No Refund 4 3 Constructing a Decision Tree in Parallel m categorical attributes n records

  22. 10,000 training records 7,000 records 3,000 records 2,000 5,000 2,000 1,000 Constructing a Decision Tree in Parallel • Partitioning of classification tree nodes • natural concurrency • load imbalance as the amount of work associated with each node varies • child nodes use the same data as used by parent node • loss of locality • high data movement cost

  23. Challenges in Constructing Parallel Classifier • Partitioning of data only • large number of classification tree nodes gives high communication cost • Partitioning of classification tree nodes • natural concurrency • load imbalance as the amount of work associated with each node varies • child nodes use the same data as used by parent node • loss of locality • high data movement cost • Hybrid algorithms: partition both data and tree

  24. Experimental Results(Srivastava, Han, Kumar, and Singh, 1999) • Data set • function 2 data set discussed in SLIQ paper (Mehta, Agrawal and Rissanen, EDBT’96) • 2 class labels, 3 categorical and 6 continuous attributes • IBM SP2 with 128 processors • 66.7 MHz CPU with 256 MB real memory • AIX version 4 • high performance switch

  25. Speedup Comparison of the Three Parallel Algorithms 0.8 million examples 1.6 million examples

  26. Splitting Criterion Verification in the Hybrid Algorithm 0.8 million examples on 8 processors 1.6 million examples on 16 processors

  27. Speedup of the Hybrid Algorithm with Different Size Data Sets

  28. Scaleup of the Hybrid Algorithm

  29. Hash Table Access • Some efficient decision tree algorithms require random access to large data structures. • Example: SPRINT (Shafer, Agrawal, Mehta) Hash Table Storing the entire has table on one processor makes the algorithm unscalable.

  30. ScalParC (Joshi, Karypis, Kumar, 1998) • ScalParC is a scalable parallel decision tree construction algorithm • Scales to large number of processors • Scales to large training sets • ScalParC is memory efficient • The hash-table is distributed among the processors • ScalParC performs minimum amount of communication

  31. This Design is Inspired by.. • Communication Structure of Parallel Sparse Matrix-Vector Algorithms.

  32. Parallel Runtime(Joshi, Karypis, Kumar, 1998) 128 Processor Cray T3D

  33. Computing Association Patterns 2. Find item combinations (itemsets) that occur frequently in data 1. Market-basket transactions 3. Generate association rules

  34. Computing Association Require Exponential Computation {a} {b} {c} {d} {a,b} {a,c} {a,d} {b,c} {b,d} {c,d} {a,b,c} {a,b,d} {a,c,d} {b,c,d} {a,b,c,d} Given m items, there are 2m-1 possible item combinations

  35. Handling Exponential Complexity • Given n transactions and m different items: • number of possible association rules: • computation complexity: • Systematic search for all patterns, based on support constraint [Agarwal & Srikant]: • If {A,B} has support at least a, then both A and B have support at least a. • If either A or B has support less than a, then {A,B} has support less than a. • Use patterns of n-1 items to find patterns of n items.

  36. Illustrating Apriori Principle(Agrawal and Srikant, 1994) Items (1-itemset candidates) Pairs (2-itemset candidates) Minimum Support = 3 Triplets (3-itemset candidates) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13

  37. Counting Candidates • Frequent Itemsets are found by counting candidates. • Simple way: • Search for each candidate in each transaction. Expensive!!! Transactions Candidates M N

  38. Parallel Formulation of Association Rules(Han, Karypis, and Kumar, 2000) • Need: • Huge Transaction Datasets (10s of TB) • Large Number of Candidates. • How? • Partition the Transaction Database among processors • communication needed for global counts • local memory on each processor should be large enough to store the entire hash tree • Partition the Candidates among processors • redundant I/O for transactions • Partition both Candidates and Transaction Database

  39. Parallel Association Rules: Scaleup Results (100K,0.25%)(Han, Karypis, and Kumar, 2000)

  40. Parallel Association Rules: Response Time (np=64,50K) (Han, Karypis, and Kumar, 2000)

  41. Discovery of Patterns in the Global Climate System Research Goals: • Find global climate patterns of interest to Earth Scientists • Global snapshots of values for a number of variables on land surfaces or water. • Monthly over a range of 10 to 50 years. # grid points: 67K Land, 40K Ocean Current data size range: 20 – 400 MB

  42. Importance of Global Climate Patterns and NPP • Net Primary Production (NPP) is the net assimilation of atmospheric carbon dioxide (CO2) into organic matter by plants. • Keeping track of NPP is important because it includes the food source of humans and all other organisms. • NPP is impacted by global climate patterns. Image from http://www.pmel.noaa.gov/co2/gif/globcar.png

  43. Patterns of Interest • Zone Formation • Find regions of the land or ocean which have similar behavior. • Associations • Find relations between climate events and land cover. • Teleconnections • Teleconnections are the simultaneous variation in climate and related processes over widely separated points on the Earth. • El Nino associated with droughts in Australia and Southern Africa and heavy rainfall along the western coast of South America. Sea Surface Temperature Anomalies off Peru (ANOM 1+2)

  44. Clustering of Raw NPP and Raw SST(Num clusters = 2)

  45. K-Means Clustering of Raw NPP and Raw SST (Num clusters = 2) Land Cluster Cohesion: North = 0.78 South = 0.59 Ocean Cluster Cohesion: North = 0.77 South = 0.80

  46. © V. Kumar Discovery of Patterns in the Global Climate System using Data Mining 46 Ocean Climate Indices: Connecting the Ocean and the Land • An OCI is a time series of temperature or pressure • Based on Sea Surface Temperature (SST) or Sea Level Pressure (SLP) • OCIs are important because • They distill climate variability at a regional or global scale into a single time series. • They are related to well-known climate phenomena such as El Niño.

  47. Ocean Climate Indices – ANOM 1+2 • ANOM 1+2 is associated with El Niño and La Niña. • Defined as the Sea Surface Temperature (SST) anomalies in a regions off the coast of Peru • El Nino is associated with • Droughts in Australia and Southern Africa • Heavy rainfall along the western coast of South America • Milder winters in the Midwest El Nino Events

  48. Connection of ANOM 1+2 to Land Temp OCIs capture teleconnections, i.e., the simultaneous variation in climate and related processes over widely separated points on the Earth.

  49. Iceland Azores Ocean Climate Indices - NAO • The North Atlantic Oscillation (NAO) is associated with climate variation in Europe and North America. • Normalized pressure differences between Ponta Delgada, Azores and Stykkisholmur, Iceland. • Associated with warm and wet winters in Europe and in cold and dry winters in northern Canada and Greenland • The eastern US experiences mild and wet winter conditions.

  50. Connection of NAO to Land Temp