Parallel and Distributed Computing for Data Mining
Parallel and Distributed Computing for Data Mining. Mohammad Alshamimi April 28, 2011. Outline. What is data mining Techniques of Data Mining Distributed Data Mining Example: Credit Card Fraud Detection Knowledge Grid. What is data mining?.
Parallel and Distributed Computing for Data Mining
E N D
Presentation Transcript
Parallel and Distributed Computing for Data Mining Mohammad Alshamimi April 28, 2011
Outline • What is data mining • Techniques of Data Mining • Distributed Data Mining • Example: Credit Card Fraud Detection • Knowledge Grid
What is data mining? • Data Mining is discovery of valuable information from large data volumes, using computationally efficient techniques. • Some call data mining “Knowledge discovery”
Techniques in Data Mining • Clustering • Association rules • Classification • Sequential pattern • Outlier detection • Decision Trees
Techniques in Data Mining: Clustering • Clustering is the process of partitioning or grouping a given set of data points into distinct groups or clusters based on the similarity. Similar data sets are in same cluster • E.g. group similar companies with similar stock behavior or similar growth to identify genes and proteins that have similar functions
Techniques in Data Mining: Association • Association rule is to find all rules that correlate the presence of one set of items with that of another set of items. • E.g. identify the items that sell together in a supermarket from mining the sales transactions
Techniques in Data Mining: Classification • Classification is to assign objects to predefined categories or classes • E.g. credit-evaluation, classification can be good or bad
Techniques in Data Mining: Sequential pattern discovery • Sequential pattern discovery determines strong sequential dependencies among different events • E.g. medical diagnosis or sales-transactions analysis to determine which customers are likely to buy a specific product in the future
Techniques in Data Mining: Detection of outliers • Outlier detection finds data points that differ significantly from the majority of the data points in a given data set • E.g. medical diagnosis and credit card fraud detection
Techniques in Data Mining: Decision Trees • Decision Trees is a classifier where a a tree is built of decision points that are effective to split remaining data • example: decision to play golf • Attribute: • Outlook • Windy • Humidity
Distributed Data Mining (DDM) • Goal: • to increase the knowledge about the promote of benefits of using parallel and distributed computing platforms to solve problems in data-mining applications • Discover hidden pattern in complete data set that is partitioned and physically distributed • Better than centralized DM in: • Privacy • Data Transmission bandwidth
Distributed Data Mining (DDM) • Can be done as services at each database and a global service controller • Global brokering service coordinates a group of expert services • Each service perform local analysis on a particular data partition • The global service performs further analysis based on local results to integrate a global results
Distributed Data Mining (DDM) • Challenges: • Heterogeneity: • each node or data source may not have the same system, same hardware • Autonomy • Each node has its own control over data and management
Credit Card Fraud Detection • Credit card becomes the standard way for web and e-commerce payment • Risk of fraud transactions is very high • Data mining is used to detect the fraud transactions
Credit Card Fraud Detection Challenges facing DM in this problem: • Billions of credit card transactions processed daily (Massive amount of data) • Data highly skewed • Many more transactions are legitimate than fraud • Each transaction record has different amount • Variable potential loss, not fixed misclassification cost
Credit Card Fraud Detection • Approach: • Prepare set of test labeled data • Divide large known data labeled data into smaller subsets • distribute sets to different processors • Each processors create local classifier • Integrate global classifier using meta learning • Use classifier to classify new transactions • Process repeated periodically • Classifier can be decision tree, neural network or other types of classifiers
Credit Card Fraud Detection • This approach may results in large amount of local classifiers • Pruning techniques is used to remove redundant classifiers
Distributed Data Mining (DDM) in the Grid • Can think of such environment as Data Grids • Grids: Geographically distributed platform with Heterogeneous machines accessible by a single interface • Data Grid: • Grids designed to allow large data sets to be stored and moved easily • Handle data sets without constant or repeated authentication • Support distributed data-intensive applications
DM in the Grid • Motivation to have data grid that is: • high performance • Secure • Robust data transfer mechanism • With • Set of tools for creating and manipulating replicas of large data sets • A mechanism for maintaining a catalog of data set replicas • Knowledge Grid is introduced to fulfill these infrastructure
Knowledge Grid • Knowledge Grid: defined on top grid toolkit and services • Knowledge Grid can • be used to perform data mining on very large data sets available over grids • Make scientific discoveries • Improve industrial processes • Uncover business valuable information
Knowledge Grid • There are two hierarchic level for Knowledge Grid: • Core K-grid layer • High level k-grid layer
Knowledge Grid: Core K Grid • Core K Grid layer offers basic services for definition, composition and execution of distributed knowledge discovery • Main services: • Knowledge Directory Service • Manage metadata and tools of knowledge • Resource Allocation & Execution management • Find the best mapping between execution plan and available recourses to achieve application requirements
Knowledge Grid: High Level K Grid • High Level K Grid include services to: • Compose • Validate • Execute Parallel and distributed knowledge discovery computation • Also it provide services to store and analyze discovered knowledge
Knowledge Grid: High Level K Grid • Main Services: • Data Access Service (DAS) • Search, selection, extraction, transformation, delivery of data to be mined • Tools and algorithms access service (TAAS) • Search, selection, downloading of data mining tools • Execution plan management service (EPMS) • Semi-automatic tool that takes data and programs and generate different execution plan • Results presentation service (RPS) • Generate, present, visualize, store knowledge models
Conclusion • The need to transfer data into knowledge is very demanding • Data mining is about discovering valuable information • Two main components needed by data mining: • Data • Efficient algorithms • Parallel computing can leads to very efficient algorithms if used in data mining
Conclusion • Data Mining in Grid or Knowledge Grid is very efficient solution that distribute the process of data mining • It also keep the privacy of data since each data source will be responsible for the computation of its own data
Refrences • Cannataro, M., Talia, D., Trunfio, P. “Distributed data mining on the grid”. Future Generation Computer Systems 18 (2002) 1101–1112 • Fran, W. Distributed data mining in credit card fraud detection. IEEE intelligent systems & their applications 6 (14) (2000) 67 • Lou P., Lu K., Shi, Z., He, Q. “Distributed data mining in grid computing environments”. Future Generation Computer Systems 23 (2007) 84–91 • Talia, D. , Trunfio, P. “How Distributed Data Mining Tasks can Thrive as Knowledge Services” 7(53) (2010) 132-137 • Zomaya, A. , El-Ghazawi, T., Frieder, O. “Parallel and Distributed Computing for Data Mining”. IEE concurrency (October-December)(1999) 11-13