parallel and distributed computing for data mining n.
Skip this Video
Loading SlideShow in 5 Seconds..
Parallel and Distributed Computing for Data Mining PowerPoint Presentation
Download Presentation
Parallel and Distributed Computing for Data Mining

Parallel and Distributed Computing for Data Mining

299 Views Download Presentation
Download Presentation

Parallel and Distributed Computing for Data Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Parallel and Distributed Computing for Data Mining Mohammad Alshamimi April 28, 2011

  2. Outline • What is data mining • Techniques of Data Mining • Distributed Data Mining • Example: Credit Card Fraud Detection • Knowledge Grid

  3. What is data mining? • Data Mining is discovery of valuable information from large data volumes, using computationally efficient techniques. • Some call data mining “Knowledge discovery”

  4. Techniques in Data Mining • Clustering • Association rules • Classification • Sequential pattern • Outlier detection • Decision Trees

  5. Techniques in Data Mining: Clustering • Clustering is the process of partitioning or grouping a given set of data points into distinct groups or clusters based on the similarity. Similar data sets are in same cluster • E.g. group similar companies with similar stock behavior or similar growth to identify genes and proteins that have similar functions

  6. Techniques in Data Mining: Association • Association rule is to find all rules that correlate the presence of one set of items with that of another set of items. • E.g. identify the items that sell together in a supermarket from mining the sales transactions

  7. Techniques in Data Mining: Classification • Classification is to assign objects to predefined categories or classes • E.g. credit-evaluation, classification can be good or bad

  8. Techniques in Data Mining: Sequential pattern discovery • Sequential pattern discovery determines strong sequential dependencies among different events • E.g. medical diagnosis or sales-transactions analysis to determine which customers are likely to buy a specific product in the future

  9. Techniques in Data Mining: Detection of outliers • Outlier detection finds data points that differ significantly from the majority of the data points in a given data set • E.g. medical diagnosis and credit card fraud detection

  10. Techniques in Data Mining: Decision Trees • Decision Trees is a classifier where a a tree is built of decision points that are effective to split remaining data • example: decision to play golf • Attribute: • Outlook • Windy • Humidity

  11. Distributed Data Mining (DDM) • Goal: • to increase the knowledge about the promote of benefits of using parallel and distributed computing platforms to solve problems in data-mining applications • Discover hidden pattern in complete data set that is partitioned and physically distributed • Better than centralized DM in: • Privacy • Data Transmission bandwidth

  12. Distributed Data Mining (DDM) • Can be done as services at each database and a global service controller • Global brokering service coordinates a group of expert services • Each service perform local analysis on a particular data partition • The global service performs further analysis based on local results to integrate a global results

  13. Distributed Data Mining (DDM) • Challenges: • Heterogeneity: • each node or data source may not have the same system, same hardware • Autonomy • Each node has its own control over data and management

  14. Credit Card Fraud Detection • Credit card becomes the standard way for web and e-commerce payment • Risk of fraud transactions is very high • Data mining is used to detect the fraud transactions

  15. Credit Card Fraud Detection Challenges facing DM in this problem: • Billions of credit card transactions processed daily (Massive amount of data) • Data highly skewed • Many more transactions are legitimate than fraud • Each transaction record has different amount • Variable potential loss, not fixed misclassification cost

  16. Credit Card Fraud Detection • Approach: • Prepare set of test labeled data • Divide large known data labeled data into smaller subsets • distribute sets to different processors • Each processors create local classifier • Integrate global classifier using meta learning • Use classifier to classify new transactions • Process repeated periodically • Classifier can be decision tree, neural network or other types of classifiers

  17. Credit Card Fraud Detection • This approach may results in large amount of local classifiers • Pruning techniques is used to remove redundant classifiers

  18. Distributed Data Mining (DDM) in the Grid • Can think of such environment as Data Grids • Grids: Geographically distributed platform with Heterogeneous machines accessible by a single interface • Data Grid: • Grids designed to allow large data sets to be stored and moved easily • Handle data sets without constant or repeated authentication • Support distributed data-intensive applications

  19. DM in the Grid • Motivation to have data grid that is: • high performance • Secure • Robust data transfer mechanism • With • Set of tools for creating and manipulating replicas of large data sets • A mechanism for maintaining a catalog of data set replicas • Knowledge Grid is introduced to fulfill these infrastructure

  20. Knowledge Grid • Knowledge Grid: defined on top grid toolkit and services • Knowledge Grid can • be used to perform data mining on very large data sets available over grids • Make scientific discoveries • Improve industrial processes • Uncover business valuable information

  21. Knowledge Grid • There are two hierarchic level for Knowledge Grid: • Core K-grid layer • High level k-grid layer

  22. Knowledge Grid

  23. Knowledge Grid: Core K Grid • Core K Grid layer offers basic services for definition, composition and execution of distributed knowledge discovery • Main services: • Knowledge Directory Service • Manage metadata and tools of knowledge • Resource Allocation & Execution management • Find the best mapping between execution plan and available recourses to achieve application requirements

  24. Knowledge Grid: High Level K Grid • High Level K Grid include services to: • Compose • Validate • Execute Parallel and distributed knowledge discovery computation • Also it provide services to store and analyze discovered knowledge

  25. Knowledge Grid: High Level K Grid • Main Services: • Data Access Service (DAS) • Search, selection, extraction, transformation, delivery of data to be mined • Tools and algorithms access service (TAAS) • Search, selection, downloading of data mining tools • Execution plan management service (EPMS) • Semi-automatic tool that takes data and programs and generate different execution plan • Results presentation service (RPS) • Generate, present, visualize, store knowledge models

  26. Conclusion • The need to transfer data into knowledge is very demanding • Data mining is about discovering valuable information • Two main components needed by data mining: • Data • Efficient algorithms • Parallel computing can leads to very efficient algorithms if used in data mining

  27. Conclusion • Data Mining in Grid or Knowledge Grid is very efficient solution that distribute the process of data mining • It also keep the privacy of data since each data source will be responsible for the computation of its own data

  28. Refrences • Cannataro, M., Talia, D., Trunfio, P. “Distributed data mining on the grid”. Future Generation Computer Systems 18 (2002) 1101–1112 • Fran, W. Distributed data mining in credit card fraud detection. IEEE intelligent systems & their applications 6 (14) (2000) 67 • Lou P., Lu K., Shi, Z., He, Q. “Distributed data mining in grid computing environments”. Future Generation Computer Systems 23 (2007) 84–91 • Talia, D. , Trunfio, P. “How Distributed Data Mining Tasks can Thrive as Knowledge Services” 7(53)  (2010) 132-137 • Zomaya, A. , El-Ghazawi, T., Frieder, O. “Parallel and Distributed Computing for Data Mining”. IEE concurrency (October-December)(1999) 11-13