230 likes | 345 Vues
This document delves into the intricacies of community detection within networks. It covers previously studied network structures highlighting short paths, navigability, and heavy-tailed degrees. Key concepts include cohesive nodes, link structures with dense interconnections, and varied community definitions. Operational motivations for community analysis range from network compression to predicting biological functions and link recommendations. It also discusses the role of social capital in group dynamics, emphasizing micro-markets, information propagation, and the significance of weak ties in enhancing community engagement and growth over time.
E N D
Communities • Previously studied structure of the network • Short paths; navigability; heavy tailed degrees • Communities • cohesive set of nodes • Link structure: “many” edges inside, “few” outside • “more” interconnections than ‘expected’ • many different definitions… • Over the next few classes we will cover some community definitions in more details • Density • Modularity • Local Spectral
Why Communities • Purely operational reasons • Compressing a network • Identifying link-spam • Predicting biological functions in protein networks • Link recommendation • Studying the network at appropriate level of detail • zooming in and out • Information propagation in links that cross communities • “weak ties” [Granovetter ‘73] • Social capital theory • groups with higher social capital “flourish” more
Micro-markets • Micro markets important for experimentation with pricing
Micro-markets in ad-analytics What is the CTR and advertiser ROI of sports gambling keywords? Movies Media Sports Sport videos Gambling Sports Gambling 1.4 Million Advertisers 10 million keywords
Communities capture network dynamics • Zachary’s karate club study • Small scale, indepth study of how communities predict change in a network • Group dynamics in LiveJournal and DBLP • How do people decide to join groups?
Zachary’s karate club (1977) • studied 34 members for a period of 2 years • Edges denote friendships • During study, club broke into two pieces • Instructor and administrator • Split along min-cut of instr-admin
Group Evolution(Backstrom, Kleinberg’05) • Membership: what determines whether a person will join a group • Randomly? • Along friendship edges? • Growth: What are the characteristics of the group that makes it grow? • Change: How do groups change over time?
Probability of joining • Probability of joining increases with number of friends in group, but has diminishing returns
Second order Effect • Of the two candidate x and y, who is more likely to join? • Two possible theories: • “Weak ties = information” [Granovetter ‘73]: more information obtained when ties are “weak” • “Social capital = more active community” [Coleman ‘88]: safety and more engagement when community is active y x
Dense subgraphs • Define community in terms of dense subgraphs • Densest subgraph: • -density: • Densest at least k: • Densest at most k: S
Mining dense graphs • For the webgraph • For detecting link-spam, for use in webgraph compression • For search engines • Looking for reasonably large dense subgraphs • Should be scalable = small memory + time • Algorithm • [Gibson, Kumar, Tomkins] give a fast heuristic • no theoretical guarantee • mines dense bipartite subgraphs
Tool: Minhashing • Sets A and B • Want to keep a sketch such that Jaccard coefficient can be estimated quickly A B • Consider a random permutation • For each set A, maintain
Shingling and Minhashing • For each node, consider “shingles” of s-subsets of neighbors • Compute c-minhash signatures for each node • For each signature, find s-shingle of nodes that contain it • Compute minhash, iterate. • Output all connected components as clusters • each original node can be present in multiple nodes now clusters are overlapping s=3 v1 [s1, s2, s3] v2 [s1, s4, s5] v3 [s4, s7, s6]
Results • Run on graph with 50M nodes and 11B edges • Can play around with (s, c) to get different density • Most clusters were attributable to link-spam • Justified using handpicked editorial labeling • Dense subgraphs exhibited faster growth
Algorithms with Guarantees? • Finding out the set S with • Consider decision version • Can be solved exactly using max-flow (or LP) u du v dv
Algorithms • Can be solved using parametric max-flow because of special structure of the graph • Time O(V2 E) • Not practical for large networks • Can we get a faster approximation?
Faster Approximation Theorem [Charikar, Khuller+Saha]: The above algorithm gives a 2-approximation to the densest subgraph problem. Can be implemented in linear time (how?)
Reducing passes of Greedy Algorithm(Bahmani, Kumar, Vassilivetskii’12) • Will need only O(log(n)/) passes • Subgraph reduced by a constant factor at each stage • 2 + 2 approximation, similar proof • Similar constructions for Dalk, directed versions
Empirical Results • Much better performance in practice than predicted • Number of passes is at most 10-12 in practice
Variants: approximations or hardness • Exact-k • getting PTAS is UGC hard (Khot’05) • Feige et al. O(n0.33-a) approx. • Bhaskara et al. O(n0.25+a) approx. • DalkS (at least k): • NP hard • 2-approx (Andersen-Chellapilla, Khuller-Saha) • DamkS (at most k): • As hard as Exact-k • Directed graphs • 2-approx (Charikar, Khuller-Saha)
Applications of densest subgraph • Important in many theoretical settings, mostly as a candidate hard problem • Establishing hardness of financial derivatives; cryptography • variants hard even for random graphs • Practical implications • In making reachability and distance queries efficient • biology • Mining coherent dense subgraphs across massive biological networks for functional discovery [HYHHZ ’05] • dense protein interaction subgraph corresponds to a protein complex [BD’03] [SM’03]