470 likes | 487 Vues
Explore graph streaming algorithms and their applications in processing massive graphs such as web graphs and recommendation systems. Learn about balanced properties, lower bounds, sparsification techniques, and algorithms for various graph problems.
E N D
Graph Problems in the Streaming Model Sampath Kannan University of Pennsylvania Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian Zhang
Graph Streaming • G=(V,E), • V known; |V| = n • E revealed in arbitrary order (e1, e2, …) • Space allowed O(n polylog n): Semi streaming
Motivation? • Fundamental problems … help ‘calibrate’ model • Massive graphs such as the webgraph can appear as stream • Recommendation systems… and more generally data mining
Why so much space? Even simple problems need it: Given u,v, and a streamed graph G, is there path of length 2 between u & v? Requires W(n) space. More generally … for balanced graph properties …
Balanced Properties v A property is balanced, if there existsstream of edges such that: before seeing last edge: There exists v: last edge is (v,x)... for Ω(n) x’s, property holds for Ω(n) x’s property doesn’t hold.
Lower Bound for Balanced Props Consider all isomorphic versions of the graph that demonstrates the balance property. Before seeing last edge, streaming algorithm has to remember the subset x of vertices such that the addition of edge (v,x) causes property to hold. As we range over isomorphisms... this is an arbitrary subset of the given cardinality... and there are exponentially many possibilities.
“Exceptions” • Counting Local Structures • Counting triangles (Bar-Yossef et al, Buriol et al) • Counting |E(G2)| (Ganguly et al) • Duplicate elimination and aggregation • (Cormode,Muthukrishnan)
One algorithm design technique Sparsification(Eppstein, Galil,Italiano,Nissenzweig ‘97) For graph property P: G’ strong certificate for G if ∀ H: (G ⋃ H) ∈ P ⇔ (G’ ⋃ H) ∈ P. Existence of quickly computable, sparse, strong certificates leads to good semi-streaming algorithms
Sparsification-based algorithms • Bipartiteness, 1-, 2-, 3-vertex connectedcomponents, 2-, 3-edge connected components: O(a(n)) per edge • MST, 4-vertex connected comps., 3-edge connected comps. O(log n) • Higher connectivities: O~(n). (Zelke)
Bipartite Matching Matching (maximal) Augmenting path Approximable with local greed
Constant-pass 2/3-approx for bip. matching • Maximal matching is .5 approx: If M’ maximum and M maximal thenM matches at least one endpoint of each edge in M’… has |M’|/2 edges. • If M has only a|M| vertex-disjoint 3-aug-paths => • |M| (1 + a) ≥ 2 OPT/3M’ maximum: M’∆ M – bunch of augmenting paths. Count!
Can find maximal matching • To go beyond: Need to get most aug. paths of length 3. • Randomly project all free vertices into Layer 0 or Layer 3 • Matched edges go from layer 1 to layer 2. • Expect half the augmenting paths of length 3 to respect layering • Use maximal matchings between successive layers to get constant fraction of these. • Gives constant-pass 2/3 - approximation
To get approximation scheme: Need to findmost augmenting paths of length • Again project vertices into k+1 layers to find augmenting paths of length k • Use carefully chosen maximal matchings algorithms between successive layers • Repeat constant number of times • Gives streaming linear time approx scheme for unweighted matching in general graphs (McGregor)
A 1/6 Approximation in 1 Pass • At all times we store some matching M. • On seeing edge e =(u,v) we compare the w(e) with the weight W of edges e1 and e2 in M incident on u and v. • If w(e) > 2W then M M {e} \ {e1,e2 }
To show 1/6 approx: Account for the weight of edges lost in terms of weight of edges that survive • Can improve approx to 1/2 - (McGregor) in constant number of passes: • Choose an edge if it is (1 + ) times the weight of edges that it kills.
The “Sketch” Approach • A two-stage approach • First stage: While going through the stream, construct a smallsketch of the input graph. • Second stage: Compute the distance using the sketch, without further access to the stream. • Perform BFS-like computations in the second stage.
Graph Spanners as Sketches • Multiplicative t-spanner:Edge subgraph H of a graph G, s.t., for any pair of vertices u and v, distH(u,v) t·distG(u,v). • There is a t-Spanner withO(n1+1/t)edges. • Reduce streaminggraph distance to streaming spanner construction. • BFS-like subroutines are used in most existing spanner constructions.
Streaming Spanner Construction • For each incoming edge, decide whether it should be in the spanner. • If the edge causes a cycle of length t, do not put the edge in the spanner. • This gives a t-spanner, because there is a path P of length < t connecting the two endpoints of any discarded edge. • This spanner is sparse. Thm [Bollobás78] : A graph whose girth is larger than k can only have O(n1+2/(k-1)) edges. • Need to know: For an incoming edge, does a short path exist?
Baswana & Sen show almost linear time non-streaming algorithm for spanners… growingBFS-trees from appropriate nodes. Difficult to do in streaming fashion… Instead we grow a BFS-like tree not just from itsroot! Clusters: Rooted BFS trees Preclusters: Free floating pieces of BFS trees … will attach to clusters
Summary of the One-Pass Algorithm • Use a vertex-labeling scheme to construct clusters. • Structure of the algorithm: • In the pre-processing phase, generate a multi-level set of labels for the vertices. • Go through the stream; for each edge: • According to the current assignment of labels to vertices, decide whether to put this edge in the spanner. • Depending on the type of edge, possibly assign more labels to one of its endpoints. • Next, an example with t = log n
Level 2 Level 1 Level 0 Labels • logn/2 levels • w.h.p., there are top-level labels. • Semantics of labels: • The set of vertices assigned the same top-level label forms a cluster. • The set of vertices assigned the same lower-level label forms a “pre-cluster.” (2,2) (2,7) (1,2) (1,4) (1,7) (1,11) (1,2) (1,4) (1,7) (1,11) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10) (0,11) (0,12) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10) (0,11) (0,12)
Level 2 Level 1 Level 0 Initial Label Assignment (2,2) (2,7) (1,2) (1,4) (1,7) (1,11) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10) (0,11) (0,12) v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12
On arrival of an edge • Already know what to do with: • Intra-cluster/pre-cluster edges • Inter-cluster edges • Edges connecting pre-clusters: the sticky edges • They are added to the spanner. • They may lead to new label assignment and cluster growth.
“Good” Neighbor (1) (3,2) (3,2) (2,2) (1,2) (0,2) (2,2) Has marked labels (1,6) (0,6) v u
Good Neighbor (2) C(3,2) C(2,2) C(1,2) C(1,6) v u
“Bad” Neighbor No marked labels (1,6) (3,2) v u
Properties of the Clusters • Small diameter • Number of clusters bounded by . • Do not need to cover the whole graph with clusters, but the uncovered subgraph issparse. The uncovered subgraph consists of sticky edges, and there are not too many of them.
Sticky Edges are Rare u1 • A neighbor is good with probability at least ½. • After seeing at most logn/2 good neighbors, v will be assigned a top-level label and be included in a cluster. No more sticky edges for v. • The number of sticky edges can be bounded by the length of the shortest prefix in the above sequence that contains logn/2 good neighbors. v u1, u2, u3, u4 … u4 u2 u3
One-pass diameter lower bound • Theorem: For any , any one-pass algorithm that • returns a k (slightly better than 1/) approx to diameter • in weighted graph requires n1+) space. • Proof (Sketch): • Some properties of random graph G in Gn,pwith p = 1/n1- • w.h.p. Contains set E’ of edges: |E’| = n1+64 : • no edge in E’ is in a cycle of length k or less. • When all edges in E’ are removed, graph still has diameter < 2/ Fix one such G = (V, E E’)
Sketch (cont’d): Reduce from INDEX (hard for comm. cmplxty) • INDEX: Alice has m-bit string x and Bob has index i. One-way comm. complexity for Bob to learn xi is m. • Reduction: m edges in E’ enumerated 1 .. m. • Alice constructs prefix of stream corresponding to multiple copies of • H = (V,E E’’) where E’’ E’ are the indices where xi=1. All Alice’s edges have weight 1 • Bob constructs rest of stream: If his index corresponds to edge (a,b) in E’ • He connects vertex b in one copy with vertex a in next copy at 0 weight • Also creates source s and sink t and connects s to a in 1st copy and b in last copy to t at high weight. • Properties: If xi = 1 where i is Bob’s index then small diameter; • else large diameter. • Small space streaming violates comm. lower bound.
Open Problems • Are there interesting subclasses of graphs for which distances and diameters are “easier” in streaming model? • Is there a more generous but reasonable model?
Network Intrusion Detection Systems • Current techniques fairly primitive: • Misuse: Pattern match packets with misuse signatures in database • Anomaly: Look for statistical anomalies in individual packet headers and payload • Needed: • Look across multiple packets for intrusions • Deal with interleaved traffic
An Example: Browsing habits • You read sports and cartoons. You’re equally likely to read both. You do not remember what you read last. • You’d expect a “random” sequence SCSSCSSCSSCCSCCCSSSSCSC…
Two readers • I like health, entertainment, and politics • I always read entertainment first, health next and politics last • The sequence would be EHPEHPEHPEHPEHPEHPEHP…
Two readers, one log file • If there is one log file… • Assume there is no correlation between us SECHSSPECSHPESCSSHCPCESCHCCPSESHPESSHPE… Is there enough information to tell that there are two people browsing? What are they browsing? How are they browsing?
Clues in stream? • Yes, under model assumptions. • H,E, P have special relationship. • They cannot belong to different (uncorrelated) people. • Not clear about S and C ... These could be two people or one person. SECHSSPECSHPESCSSHCPCESCHCCPSESHPESSHPE…
Markov Chains as Stochastic Sources .4 2 1 Output sequence: 1 4 7 7 1 2 5 7 ... .3 .4 .7 .2 4 6 .5 .8 .1 3 .5 .2 5 1 .9 7 .9 .1
1/2 1/2 C S 1/2 1/2 Markov chains on S,E,C,H,F Modeled by … 1 E H 1 1 F
Need more realistic generalizations of such analysis to • deal with: • Worm detection • Anomaly detection at high traffic links in a network • TCP compliance • BGP policy behavior
Partial Solution: Clusters (1) • A cluster is a subset of vertices and a small diameter spanning tree built on these vertices. • Intra-cluster edge
Partial Solution: Clusters (2) • Inter-cluster edges Bollobás’s result no longer applies. Need to control the number of clusters (i.e., make it ).
Open Shortest Path First (OSPF) • Packet routing protocol: • Each link broadcasts its weight (initially could be 1/bw...) • To route from A to B, each router sends along shortest path to B, dividing traffic evenly if many shortest paths. • Adjustments: • Human operator observing congestion on link could raise wt • Local decisions could lead to oscillation & suboptimality • Link latency: Convex function of its utilization • Goal:Minimize max link latency, total link latency, expected path latency, etc. • Exact optimizations typically NP-hard
Streaming problem • Can we automate the weight adjustments? • Simple scenario: • Assume weights have been optimized for current traffic matrix • Assume we now have a new (unknown) traffic matrix • observed at routers • Assume some simple goal ... minimize time to converge to new solution ... or something ... • Streaming algorithm should itself be allowed to generatetraffic for communication between monitors and for • diagnostics, but this overhead should be low.
Early Worm Detection EarlyBird System [Singh et al] identifies following characteristics: Substantial volume of identical traffic Rising infection levels (# sources & destinations increasing) Random probing (infected source tries many IP addresses) 1. Top-k type streaming algorithm can identify high volume of identical traffic at one location. Can we do better in distributed fashion? 2. How do we communicate to detect rising inf. levels? 3. Sophisticated worms may not use random probing. What other discriminating tests are possible? 4. Sophisticated worms are polymorphic… not “identical” traffic.