320 likes | 444 Vues
This article explores efficient methods for solving graph-related problems using streaming and MapReduce techniques. We cover various topics including finding connected components, estimating the clustering coefficient, minimum spanning trees, and maximum matching. Focusing on restricted memory models, we highlight algorithms that maintain low space requirements and fast update times. Additionally, we provide insights into estimating moments, triangle counting, and random walks in graph streams, all while addressing challenges like connectivity and clustering. Our goal is to present innovative strategies for processing large-scale graph data efficiently. ###
E N D
Sample problems • How do we solve these problems ? • finding connected components • estimating clustering coefficient • minm. spanning tree (weighted) • minm-cut, other partitioning • maximum matching (weighted) • random walks
Streaming Model • Stream = m elements from an universe of size n (possibly with some weights) …..(v1, 1), (v2, 2), (v2, 1), (v1, 300),…. • Vector interpretation • stream over universe [n] => vector of size n • Restrictions • Restricted memory, preferably logarithmic • Small number of passes over input, preferably constant • Fast update time • Different models • Simple: (e,w) – each element arrives only once • Cash register : multiple arrivals, i.e. updates (e, +w) arrive but are all increments • Turnstile : (e, ± w) -- both positive and negative updates
Estimating moments • stream over [n] => vector • Estimate moments • To a factor (1±)w.p. 1- • (AMS) In order to (, ) estimate • space is sufficient for 0 < p < 2 • n1- 2/p space is necessary for p > 2
Estimating F2 • Pick a random hash function h:[n] {+1,-1} • For each update (i, v) perform • At the end estimate X = Z2 • Finally, use median of means.
Estimating F0 • Define hash function • k = 1/2 • On element (x, v) • compute h(x) and maintain v = k-th minimum • Finally, output X = k*M/v h:[n] [M]
Graph Streams and Problems • Stream = edges • e1, e2, e3,…. • other variants too • Space used = O(n*polylog(n)) • Problems • Connectivity • Matching • Spanners • Clustering coefficient • Moments of degree distribution
Connectivity • Doable in O(n log(n)) space • keep a label L(u) with every node u • same labels indicate same component • update label information as new edge (u, v) arrives • L(u) L(w) for all w with label L(v) • At the end each connected component has same label
Connectivity • Not doable in space • P is a “balanced” property if for there exists G and node u such that • V1 = {v: G + (u, v) satisfies P} ; • V2 = {v: G + (u, v) does not satisfy P} • min( |V1|, |V2|) > O(n) • Any such P needs space
Spanners • = shortest path distance in G • Want a subgraph H = (V, E’) such that • H is -spanner • Can construct a (2t-1)-spanner in space
Spanner Algorithm • Initialize H = empty • For each new edge (u, v) • if current d(u, v) > 2t -1 , include (u, v) in H • Claim • H is (2t – 1) spanner. • Number of edges • Takes time O(n) per edge, but faster algorithms exist
Counting triangles • Clustering coefficient = (#closed triplets)/(#connected triplets) • signature of community structure • Different types of signed triangles measure the “balance” of the network ( +++ or --- vs. ++- ) • Algorithms • sampling based: sparsify the graph such that it fits into memory • streaming: reduce to frequency moments • linear algebra based: reduce triangle counting to a trace estimation problem and use randomized approximations
Naïve triangle counting • Time O(mn)
Improving Exact Counting(Alon, Yuster, Zwick) • Algorithm: • Divide vertices according to • For all low-degree vertices • check neighbor-pairs and whether they are connected • For high-degree subgraph • use matrix multiplication to estimate number of triangles Asymptotically the fastest algorithm but not practical for large graphs.
AYZ triangle counting • Use threshold • Time spend in counting triangle with low-degree pivots = E • Number of high degree vertices = 2E/ • Time spent in matrix multiplication = (2E/) • Total time = O(E + (2E/) ) • By appropriate choice of , minimized at
Naïve sampling • r independent samples of three distinct vertices = number of triplets with i edges Then the following holds: with probability at least 1-δ Works for dense graphs. e.g., T3 n2logn WAW '10
Edge sampling • Triangle Sparsifiers • Keep an edge with probability p. Count the triangles in sparsified graph and multiply by 1/p3. • If the graph has O(n*logc(n)) triangles we get concentration • Proof of concentration tricky • uses the Kim-Vu concentration result for multivariate polynomials which have bad Lipschitz constant but behave “well” on average • improved using colorability result by Hajnal-Szemeredi • works ; t = #triangles. = max degree; d=avg.
Streaming Triangle counting • Consider a pseudo-array, where each element is a triplet • t1 = (a1,b1,c1) • Estimate F0, F1, F2 for this pseudo-array • using sketches • Use the relation to estimate T3 • Number of samples = • Better in the incidence model = number of triplets with i edges
Random Walks on a stream • Naïve method • For each step of random walk, do a pass over the network • Using space O(n), k steps need k passes • Sample O(kn) edges, one from every node • In one pass, can do walk of length k. • Main result: • Using space O(n), can do k steps of the random walk using only k1/2 passes • Uses this to approximate PR, conductance etc.
Random walks • Multiple start points: sample each node w.p. p and create a w-length random walk from there in w passes • Will try to stitch these together • Can get stuck as • Endpoint was not in original sample (i.e. no random walk from here) • Endpoint was already used (i.e. cannot take independent steps) • Handling stuck nodes: • Maintain the stuck node(s) and the set of “used” startpoints • Take a new random sample of s edges from each of these (maybe multiple times) • Crucial step: • Whenever stuck, either the new random sample is enough to make progress, or we discover new nodes (and there are not many of them)
Key-value groups map map reduce k k v v k v v v k v k v v group k k k v v v … k v MapReduce Input key-value pairs Intermediate key-value pairs Output key-value pairs … … k v [slide from J. Ullman cs345A]
MapReduce formalization • Number of machines = N1 - • Memory per machine = N1 - • Total communication = N2 - • Over all rounds • MRCk = problems that need <=k rounds • Each round has 2-phase map, then reduce structure • Ideally, want same “total work” as optimal sequential algorithm
MST • Suppose |E| = |V|1 + c • Assume #machines = |V|1 - • memory per machine = |V|1 - • Claim: • number of iterations = c/ • total work = O(m*(m,n)/)
Back to triangle counting: curse of the last reducer • Naïve mapreduce algorithm • In the first pass, collect edge-pairs [(u,v), (v, x)] • In the second pass, count triangles • Problem • Reducers that deal with high degrees take a long time
Trick 1: pivoting on smallest degrees • Pivot on the smallest degree node of the triangle • Reduces counting time to O(m 3/2 ) • Intuition: • Similar to the AYZ proof, divide analysis by pivot degree threshold m1/2 • In the MapReduce setting, just use this trick to decide which vertices should be pivots
Trick 2 : Overlapping partitions • Divide vertices V = {V1, V2,…Vt} • Vijk= Vi Vj Vk . Eijk = corresponding edges • In the first pass, partition the graph and weight each triangle such that it is counted exactly once • Run the previous algorithm on each partition in parallel • Total work done is still O(m1/2)
Runtimes • Note reduction in number of paths using Trick-1 • However, running it on MR requires overheads
Models +Bag of Algorithmic tricks • Models • streaming, semi-streaming, stream + sort, mapreduce • Algorithmic tricks • Moment estimation on data stream • Edge sampling > triplet sampling • Reducing triangle counting to moment estimation • Piecing together random walks • Pivoting on the smallest degree to count triangles • Overlapping partitions to fit graph into memory
Not covered • Streaming + dynamic: • Model in which graph edges can appear/disappear • How can we test connectivity? • Multigraph stream • How do we compute different function of node degrees • Streaming + sort • Can solve a number of the discussed problems in poly(log) space • Interesting only if there is an efficient way to do disk based sort • Clustering • Are these the right computational models?