Parallel and Distributed Computing for Data Mining

Parallel and Distributed Computing for Data Mining Mohammad Alshamimi April 28, 2011

Outline • What is data mining • Techniques of Data Mining • Distributed Data Mining • Example: Credit Card Fraud Detection • Knowledge Grid

What is data mining? • Data Mining is discovery of valuable information from large data volumes, using computationally efficient techniques. • Some call data mining “Knowledge discovery”

Techniques in Data Mining • Clustering • Association rules • Classification • Sequential pattern • Outlier detection • Decision Trees

Techniques in Data Mining: Clustering • Clustering is the process of partitioning or grouping a given set of data points into distinct groups or clusters based on the similarity. Similar data sets are in same cluster • E.g. group similar companies with similar stock behavior or similar growth to identify genes and proteins that have similar functions

Techniques in Data Mining: Association • Association rule is to find all rules that correlate the presence of one set of items with that of another set of items. • E.g. identify the items that sell together in a supermarket from mining the sales transactions

Techniques in Data Mining: Classification • Classification is to assign objects to predefined categories or classes • E.g. credit-evaluation, classification can be good or bad

Techniques in Data Mining: Sequential pattern discovery • Sequential pattern discovery determines strong sequential dependencies among different events • E.g. medical diagnosis or sales-transactions analysis to determine which customers are likely to buy a specific product in the future

Techniques in Data Mining: Detection of outliers • Outlier detection finds data points that differ significantly from the majority of the data points in a given data set • E.g. medical diagnosis and credit card fraud detection

Techniques in Data Mining: Decision Trees • Decision Trees is a classifier where a a tree is built of decision points that are effective to split remaining data • example: decision to play golf • Attribute: • Outlook • Windy • Humidity

Distributed Data Mining (DDM) • Goal: • to increase the knowledge about the promote of benefits of using parallel and distributed computing platforms to solve problems in data-mining applications • Discover hidden pattern in complete data set that is partitioned and physically distributed • Better than centralized DM in: • Privacy • Data Transmission bandwidth

Distributed Data Mining (DDM) • Can be done as services at each database and a global service controller • Global brokering service coordinates a group of expert services • Each service perform local analysis on a particular data partition • The global service performs further analysis based on local results to integrate a global results

Distributed Data Mining (DDM) • Challenges: • Heterogeneity: • each node or data source may not have the same system, same hardware • Autonomy • Each node has its own control over data and management

Credit Card Fraud Detection • Credit card becomes the standard way for web and e-commerce payment • Risk of fraud transactions is very high • Data mining is used to detect the fraud transactions

Credit Card Fraud Detection Challenges facing DM in this problem: • Billions of credit card transactions processed daily (Massive amount of data) • Data highly skewed • Many more transactions are legitimate than fraud • Each transaction record has different amount • Variable potential loss, not fixed misclassification cost

Credit Card Fraud Detection • Approach: • Prepare set of test labeled data • Divide large known data labeled data into smaller subsets • distribute sets to different processors • Each processors create local classifier • Integrate global classifier using meta learning • Use classifier to classify new transactions • Process repeated periodically • Classifier can be decision tree, neural network or other types of classifiers

Credit Card Fraud Detection • This approach may results in large amount of local classifiers • Pruning techniques is used to remove redundant classifiers

Distributed Data Mining (DDM) in the Grid • Can think of such environment as Data Grids • Grids: Geographically distributed platform with Heterogeneous machines accessible by a single interface • Data Grid: • Grids designed to allow large data sets to be stored and moved easily • Handle data sets without constant or repeated authentication • Support distributed data-intensive applications

DM in the Grid • Motivation to have data grid that is: • high performance • Secure • Robust data transfer mechanism • With • Set of tools for creating and manipulating replicas of large data sets • A mechanism for maintaining a catalog of data set replicas • Knowledge Grid is introduced to fulfill these infrastructure

Knowledge Grid • Knowledge Grid: defined on top grid toolkit and services • Knowledge Grid can • be used to perform data mining on very large data sets available over grids • Make scientific discoveries • Improve industrial processes • Uncover business valuable information

Knowledge Grid • There are two hierarchic level for Knowledge Grid: • Core K-grid layer • High level k-grid layer

Knowledge Grid

Knowledge Grid: Core K Grid • Core K Grid layer offers basic services for definition, composition and execution of distributed knowledge discovery • Main services: • Knowledge Directory Service • Manage metadata and tools of knowledge • Resource Allocation & Execution management • Find the best mapping between execution plan and available recourses to achieve application requirements

Knowledge Grid: High Level K Grid • High Level K Grid include services to: • Compose • Validate • Execute Parallel and distributed knowledge discovery computation • Also it provide services to store and analyze discovered knowledge

Knowledge Grid: High Level K Grid • Main Services: • Data Access Service (DAS) • Search, selection, extraction, transformation, delivery of data to be mined • Tools and algorithms access service (TAAS) • Search, selection, downloading of data mining tools • Execution plan management service (EPMS) • Semi-automatic tool that takes data and programs and generate different execution plan • Results presentation service (RPS) • Generate, present, visualize, store knowledge models

Conclusion • The need to transfer data into knowledge is very demanding • Data mining is about discovering valuable information • Two main components needed by data mining: • Data • Efficient algorithms • Parallel computing can leads to very efficient algorithms if used in data mining

Conclusion • Data Mining in Grid or Knowledge Grid is very efficient solution that distribute the process of data mining • It also keep the privacy of data since each data source will be responsible for the computation of its own data

Refrences • Cannataro, M., Talia, D., Trunfio, P. “Distributed data mining on the grid”. Future Generation Computer Systems 18 (2002) 1101–1112 • Fran, W. Distributed data mining in credit card fraud detection. IEEE intelligent systems & their applications 6 (14) (2000) 67 • Lou P., Lu K., Shi, Z., He, Q. “Distributed data mining in grid computing environments”. Future Generation Computer Systems 23 (2007) 84–91 • Talia, D. , Trunfio, P. “How Distributed Data Mining Tasks can Thrive as Knowledge Services” 7(53) (2010) 132-137 • Zomaya, A. , El-Ghazawi, T., Frieder, O. “Parallel and Distributed Computing for Data Mining”. IEE concurrency (October-December)(1999) 11-13

Parallel and Distributed Computing for Data Mining

Parallel and Distributed Computing for Data Mining

Presentation Transcript

Parallel and Distributed Computing for Cyber Security

OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING

OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING

Parallel (and Distributed) Computing Overview

Parallel and Distributed Computing Overview and Syllabus

Parallel Data Mining

Why Parallel/Distributed Computing

Distributed Parallel Computing

Highly Distributed Parallel Computing

Gfarm Grid File System for Distributed and Parallel Data Computing

Parallel and Distributed Computing for Neuroinformatics

Parallel and Distributed Computing: MapReduce

The DryadLINQ Approach to Distributed Data-Parallel Computing

Parallel and Distributed Computing

Parallel distributed computing techniques

Parallel and Distributed Computing

What is Parallel and Distributed computing?

Gfarm Grid File System for Distributed and Parallel Data Computing

Parallel and Distributed Computing in CS2013

Parallel and Distributed Computing: MapReduce

Why Parallel/Distributed Computing