1 / 53

Graph-based Proximity Measures

Practical Graph Mining with R. Graph-based Proximity Measures. Nagiza F. Samatova William Hendrix John Jenkins Kanchana Padmanabhan Arpan Chakraborty Department of Computer Science North Carolina State University. Outline. Defining Proximity Measures Neumann Kernels

kendra
Télécharger la présentation

Graph-based Proximity Measures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Practical Graph Mining with R Graph-based Proximity Measures Nagiza F. Samatova William Hendrix John Jenkins KanchanaPadmanabhan Arpan Chakraborty Department of Computer Science North Carolina State University

  2. Outline • Defining Proximity Measures • Neumann Kernels • Shared Nearest Neighbor

  3. Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are. • Is higher when objects are more alike. • Often falls in the range [0,1]: • Examples: Cosine, Jaccard, Tanimoto, • Dissimilarity • Numerical measure of how different two data objects are • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies • Proximity refers to a similarity or dissimilarity Src: “Introduction to Data Mining” by Vipin Kumar et al

  4. Distance Metric • Distanced (p, q) between two points p and q is a dissimilarity measure if it satisfies: 1. Positive definiteness: d (p, q) 0 for all p and q and d (p, q) = 0 only if p = q. 2. Symmetry: d (p, q) = d (q, p) for all p and q. 3. Triangle Inequality: d (p, r)d (p, q) + d (q, r) for all points p, q, and r. • Examples: • Euclidean distance • Minkowski distance • Mahalanobis distance Src: “Introduction to Data Mining” by Vipin Kumar et al

  5. Is this a distance metric? Not: Positive definite Not: Symmetric Not: Triangle Inequality Distance Metric

  6. Distance: Euclidean, Minkowski, Mahalanobis Minkowski Mahalanobis Euclidean

  7. Ex: Mean of attributes Standard deviation of attributes Standardized/Normalized Vector Euclidean Distance Standardization is necessary, if scales differ.

  8. Input Data Table: P File name: points.dat Output Distance Matrix: D Distance Matrix • P = as.matrix (read.table(file=“points.dat”)); • D = dist (P[, 2;3], method = "euclidean"); • L1 = dist (P[, 2;3], method = “minkowski", p=1); • help (dist) Src: “Introduction to Data Mining” by Vipin Kumar et al

  9. One definition: Mean of attributes Or a better definition: E is the Expected values of a random variable. Covariance of Two Vectors, cov(p,q)

  10. Covariance, or Dispersion Matrix, Npoints in d-dimensional space: The covariance, or dispersion matrix: The inverse, Σ-1, is concentration matrix or precision matrix

  11. Common Properties of a Similarity • Similarities, also have some well known properties. • s(p, q) = 1 (or maximum similarity) only if p = q. • s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q. Src: “Introduction to Data Mining” by Vipin Kumar et al

  12. Similarity Between Binary Vectors • Suppose p and q have only binary attributes • Compute similarities using the following quantities • M01 = the number of attributes where p was 0 and q was 1 • M10 = the number of attributes where p was 1 and q was 0 • M00 = the number of attributes where p was 0 and q was 0 • M11 = the number of attributes where p was 1 and q was 1 • Simple Matching and Jaccard Coefficients: SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11) Src: “Introduction to Data Mining” by Vipin Kumar et al

  13. SMC versus Jaccard: Example p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M01 = 2 (the number of attributes where p was 0 and q was 1) M10 = 1 (the number of attributes where p was 1 and q was 0) M00 = 7 (the number of attributes where p was 0 and q was 0) M11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

  14. Cosine Similarity • If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1d2) / ||d1|| ||d2|| , where:  indicates vector dot product and || d || is the length of vector d. • Example: d1= 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 cos( d1, d2 ) = .3150 d1d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 Src: “Introduction to Data Mining” by Vipin Kumar et al

  15. Extended Jaccard Coefficient (Tanimoto) • Variation of Jaccard for continuous or count attributes • Reduces to Jaccard for binary attributes Src: “Introduction to Data Mining” by Vipin Kumar et al

  16. Correlation (Pearson Correlation) • Correlation measures the linear relationship between objects • To compute correlation, we standardize data objects, p and q, and then take their dot product Src: “Introduction to Data Mining” by Vipin Kumar et al

  17. Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1. Src: “Introduction to Data Mining” by Vipin Kumar et al

  18. Sometimes attributes are of many different types, but an overall similarity is needed. General Approach for Combining Similarities Src: “Introduction to Data Mining” by Vipin Kumar et al

  19. Using Weights to Combine Similarities • May not want to treat all attributes the same. • Use weights wk which are between 0 and 1 and sum to 1. Src: “Introduction to Data Mining” by Vipin Kumar et al

  20. Graph-Based Proximity Measures

  21. Outline • Defining Proximity Measures • Neumann Kernels • Shared Nearest Neighbor

  22. Neumann Kernels: Agenda

  23. Neumann Kernels (NK) • Generalization of HITS • Input: Undirected or Directed Graph • Output: Within Graph Proximity Measure • Importance • Relatedness von Neumann

  24. NK: Citation graph • n1 • n2 • n3 • n4 • n5 • n6 • n7 • n8 • Input: Graph • n1…n8 vertices (articles) • Graph is directed • Edges indicate a citation • Citation Matrix C can be formed • If an edge between two vertices exists then the matrix cell = 1 else = 0

  25. NK: Co-citation graph • n1 • n2 • n3 • n4 • n5 • n6 • n7 • n8 • Co-citation graph: A graph which has two nodes connected if they appear simultaneously in the reference list of a third node in citation graph. • In above graph n1 and n2 are connected because both are referenced by same node n5 in citation graph • CC=CTC

  26. NK: Bibliographic Coupling Graph • n1 • n2 • n3 • n4 • n5 • n6 • n7 • n8 • Bibliographic coupling graph: A graph which has two nodes connected if they share one or more bibliographic references. • In above graph n5 and n6 are connected because both are referencing same node n2 in citation graph • CC=C CT

  27. NK: Document and Term Correlation Term-document matrix: A matrix in which the rows represent terms, columns represent documents, and entries represent a function of their relationship (e.g. frequency of the given term in the document). Example: D1: “I like this book” D2: “We wrote this book” Term-Document Matrix X

  28. NK: Document and Term Correlation (2) Document correlation matrix: A matrix in which the rows and the columns represent documents, and entries represent the semantic similarity between two documents. Example: D1: “I like this book” D2: “We wrote this book” Document Correlation matrix K = (XTX)

  29. NK: Document and Term Correlation (3) Term Correlation Matrix:- A matrix in which the rows and the columns represent terms, and entries represent the semantic similarity between two terms. Example: D1: “I like this book” D2: “We wrote this book” Term Correlation Matrix T = (XXT)

  30. Neumann Kernel Block Diagram Input: Graph Output: Two matrices of dimensions n x n called K γ and Tγ Diffusion/Decay Factor: A tunable parameter that controls the balance between relatedness and importance .

  31. NK: Diffusion Factor - Equation & Effect Neumann Kernel defines two matrices incorporating a diffusion factor: Simplifies with our definitions of K and T When When

  32. NK: Diffusion Factor - Terminology • A • B • C • D Indegree = The indegree, δ-(v), of vertex v is the number of edges leading to vertex v. δ- (B)=1 Outdegree = The outdegree, δ+(v), of vertex v is the number of edges leading away from vertex v. δ+(A)=3 Maximal indegree= The maximal indegree, Δ-, of the graph is the maximum of all indegree counts of all vertices of graph. Δ-(G)= 2 Maximal outdegree= The maximal outdegree, Δ+, of the graph is the maximum of all outdegree counts of all vertices of graph. Δ+(G)= 3

  33. NK: Diffusion Factor - Algorithm

  34. NK: Choice of Diffusion Factor and its effects on the Neumann Algorithm • Neumann Kernel outputs relatedness between documents and between terms when g = γ • Similarly when γ is larger, then the Kernel output matches with HITS

  35. Comparing NK, HITS, andCo-citation Bibliographic Coupling • n1 • n2 • n3 • n4 • n5 • n6 • n7 • n8 HITS authority ranking for above graph n3 > n4 > n2 > n1 > n5 = n6 = n7 = n8 Calculation of Neumann Kenel for gamma=0.207 which is maximum possible value of gamma for this case gives following ranking n3 > n4 > n2 > n1 > n5 = n6 = n7 = n8 For higher values of gamma Neumann Kernel converges to HITS .

  36. Strengths and Weaknesses

  37. Outline • Defining Proximity Measures • Neumann Kernels • Shared Nearest Neighbor

  38. Shared Nearest Neighbor (SNN) • An indirect approach to similarity • Uses a dynamic method of a k-Nearest Neighbor graph to determine the similarity between the nodes • If two vertices have more than k neighbors in common then they can be considered similar to one another even if a direct link does not exist

  39. SNN - Agenda

  40. SNN – Understanding Proximity

  41. SNN - Proximity Graphs • A graph obtained by connecting two points, in a set of points, by an edge if the two points, in some sense, are close to each other

  42. SNN – Proximity Graphs(continued) • 1 • 2 • 3 • 4 • 5 • 6 LINEAR Various Types of Proximity Graphs CYCLIC RADIAL

  43. SNN – Proximity Graphs(continued) Other types of proximity graphs. GABRIEL GRAPH NEAREST NEIGHBOR GRAPH (Voronoi diagram) RELATIVE NEIGHBOR GRAPH MINIMUM SPANNING TREE

  44. SNN – Proximity Graphs (continued)

  45. SNN – Kth Nearest Neighbor (k-NN) Graph

  46. SNN – Shared Nearest Neighbor Graph • An SNN graph is a special type of KNN graph. • If an edge exists between two vertices, then they both belong to each other’s k-neighborhood In the figure to the left, each of the two black vertices, i and j, have eight nearest neighbors, including each other. Four of those nearest neighbors are shared which are shown in red. Thus, the two black vertices are similar when parameter k=4 for SNN graph.

  47. SNN – The Algorithm Input: G: an undirected graph Input: k: a natural number (number of shared neighbors) fori = 1 to N(G) do forj = i+1 to N(G) do ifj < = N(G) then counter = 0 end if form = 1 to N(G) do if vertex i and vertex j both have an edge with vertex mthen counter ++ end if end for if counter kthen Connect an edge between vertex i and vertex j in SNN graph. end if end for end for return SNN graph

  48. SNN – Time Complexity O(n3) • The number of vertices of graph G can be defined as n • “for loops” i and k iterate once for each vertex in graph G (ntimes) • “for loop” j iterates at most n -1 times (O(n)) • Cumulatively this results in a total running time of:

  49. E SNN – R Code Example • library(“igraph”) • library(“ProximityMeasure”) • data = c( 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0) • mat = matrix(data,6,6) • G = graph.adjacency(mat,mode=c("directed"), weighted=NULL) • V(G)$label<-c(‘A’,’B’,’C’,’D’,’E’,’F’) • tkplot(G) • SNN(mat, 2) • D • F [0] A -- D [1] B -- D [2] B -- E [3] C -- E

  50. SNN – Outlier/Anomaly Detection Outlier/Anomaly

More Related