Parallel and Distributed Computing for Data Mining

# Parallel and Distributed Computing for Data Mining

## Parallel and Distributed Computing for Data Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Parallel and Distributed Computing for Data Mining Mohammad Alshamimi April 28, 2011

2. Outline • What is data mining • Techniques of Data Mining • Distributed Data Mining • Example: Credit Card Fraud Detection • Knowledge Grid

3. What is data mining? • Data Mining is discovery of valuable information from large data volumes, using computationally efficient techniques. • Some call data mining “Knowledge discovery”

4. Techniques in Data Mining • Clustering • Association rules • Classification • Sequential pattern • Outlier detection • Decision Trees

5. Techniques in Data Mining: Clustering • Clustering is the process of partitioning or grouping a given set of data points into distinct groups or clusters based on the similarity. Similar data sets are in same cluster • E.g. group similar companies with similar stock behavior or similar growth to identify genes and proteins that have similar functions

6. Techniques in Data Mining: Association • Association rule is to find all rules that correlate the presence of one set of items with that of another set of items. • E.g. identify the items that sell together in a supermarket from mining the sales transactions

7. Techniques in Data Mining: Classification • Classification is to assign objects to predefined categories or classes • E.g. credit-evaluation, classification can be good or bad

8. Techniques in Data Mining: Sequential pattern discovery • Sequential pattern discovery determines strong sequential dependencies among different events • E.g. medical diagnosis or sales-transactions analysis to determine which customers are likely to buy a specific product in the future

9. Techniques in Data Mining: Detection of outliers • Outlier detection finds data points that differ significantly from the majority of the data points in a given data set • E.g. medical diagnosis and credit card fraud detection

10. Techniques in Data Mining: Decision Trees • Decision Trees is a classifier where a a tree is built of decision points that are effective to split remaining data • example: decision to play golf • Attribute: • Outlook • Windy • Humidity

11. Distributed Data Mining (DDM) • Goal: • to increase the knowledge about the promote of benefits of using parallel and distributed computing platforms to solve problems in data-mining applications • Discover hidden pattern in complete data set that is partitioned and physically distributed • Better than centralized DM in: • Privacy • Data Transmission bandwidth

12. Distributed Data Mining (DDM) • Can be done as services at each database and a global service controller • Global brokering service coordinates a group of expert services • Each service perform local analysis on a particular data partition • The global service performs further analysis based on local results to integrate a global results

13. Distributed Data Mining (DDM) • Challenges: • Heterogeneity: • each node or data source may not have the same system, same hardware • Autonomy • Each node has its own control over data and management

14. Credit Card Fraud Detection • Credit card becomes the standard way for web and e-commerce payment • Risk of fraud transactions is very high • Data mining is used to detect the fraud transactions

15. Credit Card Fraud Detection Challenges facing DM in this problem: • Billions of credit card transactions processed daily (Massive amount of data) • Data highly skewed • Many more transactions are legitimate than fraud • Each transaction record has different amount • Variable potential loss, not fixed misclassification cost

16. Credit Card Fraud Detection • Approach: • Prepare set of test labeled data • Divide large known data labeled data into smaller subsets • distribute sets to different processors • Each processors create local classifier • Integrate global classifier using meta learning • Use classifier to classify new transactions • Process repeated periodically • Classifier can be decision tree, neural network or other types of classifiers

17. Credit Card Fraud Detection • This approach may results in large amount of local classifiers • Pruning techniques is used to remove redundant classifiers

18. Distributed Data Mining (DDM) in the Grid • Can think of such environment as Data Grids • Grids: Geographically distributed platform with Heterogeneous machines accessible by a single interface • Data Grid: • Grids designed to allow large data sets to be stored and moved easily • Handle data sets without constant or repeated authentication • Support distributed data-intensive applications

19. DM in the Grid • Motivation to have data grid that is: • high performance • Secure • Robust data transfer mechanism • With • Set of tools for creating and manipulating replicas of large data sets • A mechanism for maintaining a catalog of data set replicas • Knowledge Grid is introduced to fulfill these infrastructure

20. Knowledge Grid • Knowledge Grid: defined on top grid toolkit and services • Knowledge Grid can • be used to perform data mining on very large data sets available over grids • Make scientific discoveries • Improve industrial processes • Uncover business valuable information

21. Knowledge Grid • There are two hierarchic level for Knowledge Grid: • Core K-grid layer • High level k-grid layer

22. Knowledge Grid

23. Knowledge Grid: Core K Grid • Core K Grid layer offers basic services for definition, composition and execution of distributed knowledge discovery • Main services: • Knowledge Directory Service • Manage metadata and tools of knowledge • Resource Allocation & Execution management • Find the best mapping between execution plan and available recourses to achieve application requirements

24. Knowledge Grid: High Level K Grid • High Level K Grid include services to: • Compose • Validate • Execute Parallel and distributed knowledge discovery computation • Also it provide services to store and analyze discovered knowledge

25. Knowledge Grid: High Level K Grid • Main Services: • Data Access Service (DAS) • Search, selection, extraction, transformation, delivery of data to be mined • Tools and algorithms access service (TAAS) • Search, selection, downloading of data mining tools • Execution plan management service (EPMS) • Semi-automatic tool that takes data and programs and generate different execution plan • Results presentation service (RPS) • Generate, present, visualize, store knowledge models

26. Conclusion • The need to transfer data into knowledge is very demanding • Data mining is about discovering valuable information • Two main components needed by data mining: • Data • Efficient algorithms • Parallel computing can leads to very efficient algorithms if used in data mining

27. Conclusion • Data Mining in Grid or Knowledge Grid is very efficient solution that distribute the process of data mining • It also keep the privacy of data since each data source will be responsible for the computation of its own data

28. Refrences • Cannataro, M., Talia, D., Trunfio, P. “Distributed data mining on the grid”. Future Generation Computer Systems 18 (2002) 1101–1112 • Fran, W. Distributed data mining in credit card fraud detection. IEEE intelligent systems & their applications 6 (14) (2000) 67 • Lou P., Lu K., Shi, Z., He, Q. “Distributed data mining in grid computing environments”. Future Generation Computer Systems 23 (2007) 84–91 • Talia, D. , Trunfio, P. “How Distributed Data Mining Tasks can Thrive as Knowledge Services” 7(53)  (2010) 132-137 • Zomaya, A. , El-Ghazawi, T., Frieder, O. “Parallel and Distributed Computing for Data Mining”. IEE concurrency (October-December)(1999) 11-13