The Power of Data Mining and Machine Learning Techniques for Network Construction and Analysis

The Power of Data Mining and Machine Learning Techniques for NetworkConstruction and Analysis RedaAlhajj University of Calgary, Calgary, Alberta, Canada Global University, Beirut, Lebanon alhajj@ucalgary.ca

General Overview The network model provides a powerful platform to study a group of entities and their relationships The semantics of the links in the network is determined by considering the application domain to be investigated A network can be constructed by considering pairwise correlation between entities or by investigating the correlation between two entities based on a global view of the data Data mining and machine learning techniques allow for better investigation by globally visioning the data to derive the strength of pairwise links The combination of data mining, machine learning and network analysis would lead to a comprehensive and robust framework for data analysis. RedaAlhajj, University of Calgary

Outline of the talk • Background on ARM, Clustering, Network Model, fuzziness • From FPM, ARM and clustering to network • Some Application Domains: • database design • web mining • terror network analysis • outlier detection • Disease Biomarker • Database search • Conclusions and research directions RedaAlhajj, University of Calgary

Overview of Association Rules Mining A general model for mining domains where there is many2many relationship between two sets of entities, e.g., baskets and items; documents and words, etc. Consider a set of items I ={I1, I2, I3,…, Im} Consider a database of transactions D where each transaction T is a set of items such that T  I So, if A is a set of items a transaction T is said to contain A if and only if A  T An association rule is an implication or correlation of the form: A  B where A  I, B  I, and A  B =  Support and confidence are the measures generally used to filter the rules RedaAlhajj, University of Calgary

Association Rules Mining: Two Steps • In general association rules mining can be reduced to the following two steps: • Find all frequent itemsets • Each itemset will occur at least as frequently as a minimum support count • Generate strong association rules from the frequent itemsets • These rules will satisfy minimum support and confidence measures • We use the outcome from the first step in part of the research and the outcome from the second step in another part of the research RedaAlhajj, University of Calgary

Association Rules Mining: Apriori Algorithm • Any subset of a frequent itemset must be frequent • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! • Minimum support = 2 RedaAlhajj, University of Calgary

Association Rule MiningFrequent Closed Itemset Image Reference: http://www.siam.org/meetings/sdm06/proceedings/038lucchesec.pdf A frequent itemset X is closed if none of its immediate supersets has the same support as the itemset X Example RedaAlhajj, University of Calgary

Clustering • It is an unsupervised learning process • It is the process of distributing a given set of data instances into groups such that the similarity of instances is high within each group and low between the groups. • Similarity within the cluster (intra-cluster) is measured using variance average variance or TWCV • Similarity across the clusters (inter-cluster) is measure based on linkage. • For clustering we need to know at least the characteristics of the instances and the similarity measure to be used in the process • Various algorithms exist for clustering, e.g., k-means, DBscan, • Each algorithm has its advantages and disadvantages RedaAlhajj, University of Calgary

Clustering • Example 2 RedaAlhajj, University of Calgary Example 1

Overview of Social Network Analysis • A social network is a set of entities called actors and the links connecting them. • Ex: students enrolled in same courses, people and likes, etc • A social network is mostly represented as a graph called sociogram • Social Network Analysis (SNA) is powerful because it has foundations in math/graph theory • SNA provides a set of tools to empirically extend our theoretical intuition of the patterns that compose a social structure. • SNA provides a set of relationalmethods for systematicallyunderstanding and identifying connections among actors. • SNA embodies a range of theories relating types of observable social spaces and their relation to individual and group behavior. RedaAlhajj, University of Calgary

Social Network AnalysisCentrality Measures • Degree • Sum of connections (sum of the weights of connections in case of weighted graphs) from or to an actor • Closeness • Distance of one actor to all others in the network • Betweenness • The number of shortest paths that passes through an actor • Eigen-vector • Measures how importance of an actor RedaAlhajj, University of Calgary

Social Network AnalysisCentrality Measures (example) The red nodes have the highest degree centrality The blue node has the highest Closeness and betweenness centrality Node 7 has the highest degree centrality Node 8 has the highest betweenness Centrality Nodes 4 and 5 have the highest Closeness Centrality Example 1 Example 2 Image Reference: http://mande.co.uk/special-issues/network-models/ Image Reference: http://www.biomedcentral.com/ RedaAlhajj, University of Calgary

Social Network AnalysisGraph Clustering Algorithms • MST based clustering • First finds a Minimum Spanning Tree (MST) of the graph • Removes edges with the highest weight from the MST to form clusters of vertices (actors) • Edge Betweenness clustering • The betweenness of an edge is defined as the extent to which the edge lies along shortest paths • First computes edge betweenness for all edges in current graph • Removes edges having the highest betweenness from the graph RedaAlhajj, University of Calgary

One Mode versus Two Mode Networks Queries (users) versus Tables is a two mode network Folding is used to produce one mode networks from a two mode network Folding is simply the multiplication of the adjacency matrix of the two mode network by its transpose RedaAlhajj, University of Calgary

Fuzzy Sets • Generalizes the classical set theory by a characteristic membership function. • A membership function introduces a grey area between the black and white areas • Consider fuzzy set A, its domain D, and object x. • Membership function µ specifies the degree of membership of x in A: • µA(x): D → [0, 1]. • µA(x)= 0 means x does not belong to A. • µA(x)= 1 means x completely belongs to A. • Intermediate values 0< µA(x)<1 represent varying degree of membership. RedaAlhajj, University of Calgary

Income Range Centroid Quite poor 10-10-30 - Poor 10-30-70 30 Moderate 30-70-120 70 Rich 70-120-120 - The ranges of fuzzy sets Example on Membership Membership 1.0 0.5 0.0 The membership functions found according to the centroids RedaAlhajj, University of Calgary

From FPM to Network Construction • Given a data set of M instances and N features per instance • Prepare the data for FPM by deciding on the baskets and items. Keep in mind that items are the actors in the network • Apply the FPM algorithm of your choice to find Frequent sets of items; it is possible to narrow down to closed or maximal FP • Construct the network by considering the frequent sets as follows: • Add a link between two actors i and j iffi and j exist together in at least one FP, the weight of the link is set to the number of common FP’s • It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc. Reda Alhajj, University of Calgary

From FPM to Network Construction RedaAlhajj, University of Calgary

From ARM to Network Construction • Given a data set of M instances and N features per instance • Prepare the data for ARM by deciding on the baskets and items. Keep in mind that items are the actors in the network; they will form the antecedents and consequents of the rules • Apply the ARM algorithm of your choice to find all AR’s that satisfy certain criteria • Construct the network by considering the AR’s as follows: • Add a link between two actors i and j iffi and j exist together in at least one AR, the weight of the link is set to the number of common AR’s. It is possible to concentrate on antecedent, consequent or both. • It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc. Reda Alhajj, University of Calgary

From ARM to Network Construction Reda Alhajj, University of Calgary

From Clustering to Network Construction • Given a data set of M instances and N features per instance • Prepare the data for clustering by deciding on the features to consider in computing the similarity measure • Apply either one clustering algorithm several times by playing with the required input parameters or a number of clustering algorithms to find one clustering solution per run. • Construct the network by considering the clusters as follows: • Add a link between two actors i and j iffi and j exist together in the same cluster in at least one clustering solution, the weight of the link is set to the number of common clusters across the solutions. • It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc. RedaAlhajj, University of Calgary

Network Construction Multiple clustering solutions Reda Alhajj, University of Calgary

From the Data to Network Construction • Given a data set of M instances and N features per instance • Prepare the data processing by deciding on the features P to consider in the analysis • Construct a MxP matrix A by considering every instance as a row and every feature as a column • Find the transpose of matrix A • Multiply matrix A by its transpose to get the adjacency matrix for the target network. • It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc. Reda Alhajj, University of Calgary

NetDriller : A Powerful Social Network Analysis Tool* Negar Koochakzadeh, Atieh Sarraf, KeivanKianmehr, Jon Rokne, Reda Alhajj {nkoochak, sarrafsa}@ucalgary.ca, kkianmeh@uwo.ca, {alhajj, rokne}@ucalgary.ca • Social Network Analysis (SNA) is a technique first used in sociology. • Recently computer scientists have realized that this model is general enough to be applied to any domain where the entities and their interconnections can be separated into actors and their links, respectively. • Data Mining techniques can strengthen SNA Network Construction 1 … Raw Dataset: People and their attributes 2 • Searching in the Network: • Example1: Find individuals who could monitor the information flow in an organization better than most others. • Example 2: Find individuals who have best picture of what is happening in the network as a whole. • Closeness centrality reveals how long it takes information to spread from one individual to others in the network. High scoring individuals in Closeness have the shortest paths to all others in the network. • Betweenness centrality indicates the extent that an individual is a broker of indirect connections among all others in a network. Someone with high Betweenness could be thought of as a gatekeeper of information flow. People that occur on many shortest paths among other People have highest Betweenness value. • Degree centrality indicates the extent that an individual send or receive information to the neighbors. • Eigenvector centrality calculates the principle eigenvector of the network. A node is central to the extent that its neighbors are central. • Fuzzy Query Example: Find individuals with high centralities Social Network: Based on community detection Fuzzy Sets: Based on multi-objective GA optimization Fuzzy Query Result: Color hue shows DofM http://cpsc.ucalgary.ca/~nkoochak/NetDriller/ * ICDM 2011 IEEE International Conference on Data Mining

Improving Database performance by Building and Analyzing Network of Tables from Query access patterns Reda Alhajj, University of Calgary

Problem Definition • Response time in a distributed or parallel database system is largely determined by how data is organized and stored on different machines/sites. • The goal is to place related data on nearby, or preferably the same, sites to minimize the response time. • The study of data distribution requires solving two problems: • The partitioning problem • The allocation problem RedaAlhajj, University of Calgary

Queries (users) versus Tables RedaAlhajj, University of Calgary

Overview of the analysis process • Three main steps: • Considering tables as items and queries as transactions, extract frequent closed itemsets • A kind of fuzzy sets can be built from the closed itemsets in this step • Use the extracted itemsets from the previous step to build the network of tables • Use network analysis to extract information about the tables from the network of tables RedaAlhajj, University of Calgary

Step1Items and Transactions • Sample database • EMPLOYEE (Ssn, Fname, Lname, Dno) • DEPARTMENT (Dnumber, Dname) • PROJECT (Pnumber, Pname, Plocation, Dno) • Sample query (Q1) • SELECTLname FROM EMPLOYEE, DEPARTMENT WHERE DNO = Dnumber ANDDname = ‘Reasearch’ • Items • EMPLOYEE, DEPARTMENT, PROJECT • Transactions • Q1: EMPLOYEE, DEPARTMENT RedaAlhajj, University of Calgary

Step 1Example (Sample Database) Sample database schema from Fundamentals of Database Systems, Elmasri/Navathe RedaAlhajj, University of Calgary

Step 1Example (List of Queries) RedaAlhajj, University of Calgary

Step 1Example (Closed Itemsets) • List of frequent closed itemsets with min-support-threshold = 2 • Note: 1-itemsets are omitted from the results RedaAlhajj, University of Calgary

Step1Example (Fuzzy Sets) RedaAlhajj, University of Calgary

Example (Fuzzy Sets) RedaAlhajj, University of Calgary

Step2Building the Network • Each item (table) is a node in the network • An edge exists between two nodes if they appear together in at least one frequent closed itemset • The weight of an edge between two nodes is related to the number of frequent closed itemsets in which corresponding tables appear together • Weight is normalized RedaAlhajj, University of Calgary

Step 2Example Network of tables Note: Table DEPT_LOCATIONS is not included in the graph since this table did not appear in any of the queries RedaAlhajj, University of Calgary

Step 3Applying Network Analysis • Various network analysis techniques can be used to extract relationships of tables from the social network • Centrality measures can be used to identify the tables that are in relationship with many other tables and consequently play a key role in linking data from different tables together • Graph clustering algorithms can be applied to find groups of tables that are frequently accessed together in queries RedaAlhajj, University of Calgary

Step 3Example (Centrality Measures) RedaAlhajj, University of Calgary

Step 3Example (Clustering Results) • Edge betweenness clusters • C1: EMPLOYEE, PROJECT, DEPARTMENT • C2: WORKS_ON • C3: DEPENDENT • MST clusters • C1: DEPENDENT • C2: EMPLOYEE, WORKS_ON, PROJECT • C3: DEPARTMENT • Clustering results may seem meaningless since in this example we have 5 highly correlated nodes in the graph RedaAlhajj, University of Calgary

Experiment1Centrality Measures • This experiment has been done on a synthetic dataset of 14 tables (T0 to T13) and 20 queries, min-support-threshold = 2 • High degree nodes • T10: 6 • T14: 4 • High closeness nodes • T10: 0.25 • T14: 0.20 • High betweenness nodes • T10: 86 • T14: 49 RedaAlhajj, University of Calgary

Experiment1Clustering Result • Edge betweenness clusters • C1: T11, T12, T13, T14 • C2: T1, T0, T2 • C3: T4, T5, T10, T8, T3 • MST clusters • C1: T11 • C2: T4, T3 • C3: T5, T10, T12, T13, T8, T14, T1, T0, T2 RedaAlhajj, University of Calgary

Experiment 2Centrality Measures • The experiment has been done on a synthetic dataset of 14 tables (T0 to T13) and 30 queries, min-support-threshold = 1 • High degree nodes • T7: 12 • T10: 11 • High closeness nodes • T10: 0.20 • T7: 0.19 • High betweenness nodes • T7: 43 • T10: 31 RedaAlhajj, University of Calgary

Experiment 2Clustering Result • Edge betweenness clusters • C1: T6 • C2: T8 • C3: T4, T5, T3, T2 • C4: T1, T0 • C5: T7, T10, T11, T12, T13, T14, T9 • MST clusters • C1: T6, T8 • C2: T11 • C3: T7, T9 • C4: T10, T12, T13, T14, T1, T0, T2 • C5: T4, T5, T3 RedaAlhajj, University of Calgary

To further demonstrate the effectiveness of the proposed approach in practice • we conducted another experiment using a synthetic query set of 1000 queries on 50 tables • finding real data is very hard because this type of data is very sensitive and hence highly confidential. • We have generated the data by restricting the number of tables that could appear in the same query to be at most 20 • one query may require accessing at most 20 different tables, though in practice it is not more than four or five tables. RedaAlhajj, University of Calgary

RedaAlhajj, University of Calgary

These are four example communities: {T6, T8, T9, T22, T23, T24, T33 } – { T6, T9, T21, T37, T42, T45} – {T5, T6, T11, T13, T14, T16, T19 } – { T6, T7, T9, T10, T12, T13, T19} . RedaAlhajj, University of Calgary

From Frequent Patterns to Network construction Reda Alhajj, University of Calgary

Overview • Given a dataset, e.g., emails exchanged between a group of people, like employees in the same company • Partition the dataset into groups based on a certain criteria to be studied • To study the employees, all emails are grouped such that emails of the same employee form one group • Decide on the items to be considered in the analysis • E.g., each email could be a transaction and words/emails within the header/text could be items • Mine FP within each group and globally • Find relevant features for each group based on the entropy RedaAlhajj, University of Calgary

The Proposed Framework Feature Extraction Model Network Creation Model Mine frequent closed patterns Select suitable features based on entropy ranking Calculate weights of features to create feature vectors Features Freq. Closed Pats. Statistical Analysis Model Front End Interface and Visualization Tool Reda Alhajj, University of Calgary

The Power of Data Mining and Machine Learning Techniques for Network Construction and Analysis

The Power of Data Mining and Machine Learning Techniques for Network Construction and Analysis

Presentation Transcript

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining Practical Machine Learning Tools and Techniques

Data Mining and Machine Learning

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Data Mining (and machine learning)

Machine Learning Techniques for Data Mining

Data Mining (and machine learning)

Data Mining (and machine learning)