540 likes | 709 Vues
Practical Graph Mining with R. Graph-based Proximity Measures. Nagiza F. Samatova William Hendrix John Jenkins Kanchana Padmanabhan Arpan Chakraborty Department of Computer Science North Carolina State University. Outline. Defining Proximity Measures Neumann Kernels
E N D
Practical Graph Mining with R Graph-based Proximity Measures Nagiza F. Samatova William Hendrix John Jenkins KanchanaPadmanabhan Arpan Chakraborty Department of Computer Science North Carolina State University
Outline • Defining Proximity Measures • Neumann Kernels • Shared Nearest Neighbor
Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are. • Is higher when objects are more alike. • Often falls in the range [0,1]: • Examples: Cosine, Jaccard, Tanimoto, • Dissimilarity • Numerical measure of how different two data objects are • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies • Proximity refers to a similarity or dissimilarity Src: “Introduction to Data Mining” by Vipin Kumar et al
Distance Metric • Distanced (p, q) between two points p and q is a dissimilarity measure if it satisfies: 1. Positive definiteness: d (p, q) 0 for all p and q and d (p, q) = 0 only if p = q. 2. Symmetry: d (p, q) = d (q, p) for all p and q. 3. Triangle Inequality: d (p, r)d (p, q) + d (q, r) for all points p, q, and r. • Examples: • Euclidean distance • Minkowski distance • Mahalanobis distance Src: “Introduction to Data Mining” by Vipin Kumar et al
Is this a distance metric? Not: Positive definite Not: Symmetric Not: Triangle Inequality Distance Metric
Distance: Euclidean, Minkowski, Mahalanobis Minkowski Mahalanobis Euclidean
Ex: Mean of attributes Standard deviation of attributes Standardized/Normalized Vector Euclidean Distance Standardization is necessary, if scales differ.
Input Data Table: P File name: points.dat Output Distance Matrix: D Distance Matrix • P = as.matrix (read.table(file=“points.dat”)); • D = dist (P[, 2;3], method = "euclidean"); • L1 = dist (P[, 2;3], method = “minkowski", p=1); • help (dist) Src: “Introduction to Data Mining” by Vipin Kumar et al
One definition: Mean of attributes Or a better definition: E is the Expected values of a random variable. Covariance of Two Vectors, cov(p,q)
Covariance, or Dispersion Matrix, Npoints in d-dimensional space: The covariance, or dispersion matrix: The inverse, Σ-1, is concentration matrix or precision matrix
Common Properties of a Similarity • Similarities, also have some well known properties. • s(p, q) = 1 (or maximum similarity) only if p = q. • s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q. Src: “Introduction to Data Mining” by Vipin Kumar et al
Similarity Between Binary Vectors • Suppose p and q have only binary attributes • Compute similarities using the following quantities • M01 = the number of attributes where p was 0 and q was 1 • M10 = the number of attributes where p was 1 and q was 0 • M00 = the number of attributes where p was 0 and q was 0 • M11 = the number of attributes where p was 1 and q was 1 • Simple Matching and Jaccard Coefficients: SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11) Src: “Introduction to Data Mining” by Vipin Kumar et al
SMC versus Jaccard: Example p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M01 = 2 (the number of attributes where p was 0 and q was 1) M10 = 1 (the number of attributes where p was 1 and q was 0) M00 = 7 (the number of attributes where p was 0 and q was 0) M11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
Cosine Similarity • If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1d2) / ||d1|| ||d2|| , where: indicates vector dot product and || d || is the length of vector d. • Example: d1= 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 cos( d1, d2 ) = .3150 d1d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 Src: “Introduction to Data Mining” by Vipin Kumar et al
Extended Jaccard Coefficient (Tanimoto) • Variation of Jaccard for continuous or count attributes • Reduces to Jaccard for binary attributes Src: “Introduction to Data Mining” by Vipin Kumar et al
Correlation (Pearson Correlation) • Correlation measures the linear relationship between objects • To compute correlation, we standardize data objects, p and q, and then take their dot product Src: “Introduction to Data Mining” by Vipin Kumar et al
Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1. Src: “Introduction to Data Mining” by Vipin Kumar et al
Sometimes attributes are of many different types, but an overall similarity is needed. General Approach for Combining Similarities Src: “Introduction to Data Mining” by Vipin Kumar et al
Using Weights to Combine Similarities • May not want to treat all attributes the same. • Use weights wk which are between 0 and 1 and sum to 1. Src: “Introduction to Data Mining” by Vipin Kumar et al
Outline • Defining Proximity Measures • Neumann Kernels • Shared Nearest Neighbor
Neumann Kernels (NK) • Generalization of HITS • Input: Undirected or Directed Graph • Output: Within Graph Proximity Measure • Importance • Relatedness von Neumann
NK: Citation graph • n1 • n2 • n3 • n4 • n5 • n6 • n7 • n8 • Input: Graph • n1…n8 vertices (articles) • Graph is directed • Edges indicate a citation • Citation Matrix C can be formed • If an edge between two vertices exists then the matrix cell = 1 else = 0
NK: Co-citation graph • n1 • n2 • n3 • n4 • n5 • n6 • n7 • n8 • Co-citation graph: A graph which has two nodes connected if they appear simultaneously in the reference list of a third node in citation graph. • In above graph n1 and n2 are connected because both are referenced by same node n5 in citation graph • CC=CTC
NK: Bibliographic Coupling Graph • n1 • n2 • n3 • n4 • n5 • n6 • n7 • n8 • Bibliographic coupling graph: A graph which has two nodes connected if they share one or more bibliographic references. • In above graph n5 and n6 are connected because both are referencing same node n2 in citation graph • CC=C CT
NK: Document and Term Correlation Term-document matrix: A matrix in which the rows represent terms, columns represent documents, and entries represent a function of their relationship (e.g. frequency of the given term in the document). Example: D1: “I like this book” D2: “We wrote this book” Term-Document Matrix X
NK: Document and Term Correlation (2) Document correlation matrix: A matrix in which the rows and the columns represent documents, and entries represent the semantic similarity between two documents. Example: D1: “I like this book” D2: “We wrote this book” Document Correlation matrix K = (XTX)
NK: Document and Term Correlation (3) Term Correlation Matrix:- A matrix in which the rows and the columns represent terms, and entries represent the semantic similarity between two terms. Example: D1: “I like this book” D2: “We wrote this book” Term Correlation Matrix T = (XXT)
Neumann Kernel Block Diagram Input: Graph Output: Two matrices of dimensions n x n called K γ and Tγ Diffusion/Decay Factor: A tunable parameter that controls the balance between relatedness and importance .
NK: Diffusion Factor - Equation & Effect Neumann Kernel defines two matrices incorporating a diffusion factor: Simplifies with our definitions of K and T When When
NK: Diffusion Factor - Terminology • A • B • C • D Indegree = The indegree, δ-(v), of vertex v is the number of edges leading to vertex v. δ- (B)=1 Outdegree = The outdegree, δ+(v), of vertex v is the number of edges leading away from vertex v. δ+(A)=3 Maximal indegree= The maximal indegree, Δ-, of the graph is the maximum of all indegree counts of all vertices of graph. Δ-(G)= 2 Maximal outdegree= The maximal outdegree, Δ+, of the graph is the maximum of all outdegree counts of all vertices of graph. Δ+(G)= 3
NK: Choice of Diffusion Factor and its effects on the Neumann Algorithm • Neumann Kernel outputs relatedness between documents and between terms when g = γ • Similarly when γ is larger, then the Kernel output matches with HITS
Comparing NK, HITS, andCo-citation Bibliographic Coupling • n1 • n2 • n3 • n4 • n5 • n6 • n7 • n8 HITS authority ranking for above graph n3 > n4 > n2 > n1 > n5 = n6 = n7 = n8 Calculation of Neumann Kenel for gamma=0.207 which is maximum possible value of gamma for this case gives following ranking n3 > n4 > n2 > n1 > n5 = n6 = n7 = n8 For higher values of gamma Neumann Kernel converges to HITS .
Outline • Defining Proximity Measures • Neumann Kernels • Shared Nearest Neighbor
Shared Nearest Neighbor (SNN) • An indirect approach to similarity • Uses a dynamic method of a k-Nearest Neighbor graph to determine the similarity between the nodes • If two vertices have more than k neighbors in common then they can be considered similar to one another even if a direct link does not exist
SNN - Proximity Graphs • A graph obtained by connecting two points, in a set of points, by an edge if the two points, in some sense, are close to each other
SNN – Proximity Graphs(continued) • 1 • 2 • 3 • 4 • 5 • 6 LINEAR Various Types of Proximity Graphs CYCLIC RADIAL
SNN – Proximity Graphs(continued) Other types of proximity graphs. GABRIEL GRAPH NEAREST NEIGHBOR GRAPH (Voronoi diagram) RELATIVE NEIGHBOR GRAPH MINIMUM SPANNING TREE
SNN – Shared Nearest Neighbor Graph • An SNN graph is a special type of KNN graph. • If an edge exists between two vertices, then they both belong to each other’s k-neighborhood In the figure to the left, each of the two black vertices, i and j, have eight nearest neighbors, including each other. Four of those nearest neighbors are shared which are shown in red. Thus, the two black vertices are similar when parameter k=4 for SNN graph.
SNN – The Algorithm Input: G: an undirected graph Input: k: a natural number (number of shared neighbors) fori = 1 to N(G) do forj = i+1 to N(G) do ifj < = N(G) then counter = 0 end if form = 1 to N(G) do if vertex i and vertex j both have an edge with vertex mthen counter ++ end if end for if counter kthen Connect an edge between vertex i and vertex j in SNN graph. end if end for end for return SNN graph
SNN – Time Complexity O(n3) • The number of vertices of graph G can be defined as n • “for loops” i and k iterate once for each vertex in graph G (ntimes) • “for loop” j iterates at most n -1 times (O(n)) • Cumulatively this results in a total running time of:
E SNN – R Code Example • library(“igraph”) • library(“ProximityMeasure”) • data = c( 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0) • mat = matrix(data,6,6) • G = graph.adjacency(mat,mode=c("directed"), weighted=NULL) • V(G)$label<-c(‘A’,’B’,’C’,’D’,’E’,’F’) • tkplot(G) • SNN(mat, 2) • D • F [0] A -- D [1] B -- D [2] B -- E [3] C -- E
SNN – Outlier/Anomaly Detection Outlier/Anomaly