1 / 60

Leveraging Big Data: Lecture 12

http://www.cohenwang.com/edith/bigdataclass2013. Leveraging Big Data: Lecture 12. Edith Cohen Amos Fiat Haim Kaplan Tova Milo. Instructors:. Today. All-Distances Sketches Applications of All-Distance sketches Back to linear sketches (random linear transformations).

gerik
Télécharger la présentation

Leveraging Big Data: Lecture 12

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://www.cohenwang.com/edith/bigdataclass2013 Leveraging Big Data: Lecture 12 Edith Cohen Amos Fiat Haim Kaplan Tova Milo Instructors:

  2. Today • All-Distances Sketches • Applications of All-Distance sketches • Back to linear sketches (random linear transformations)

  3. All-Distances Sketches (ADSs) is in the Min-Hash sketch of for some • : the set of nodes that are within distance at most from Bottom- is a list of pairs , where is a node ID. < kthsmallest hash ofnodes that are closer to than

  4. ADS example SP distances: 13 14 15 10 0 6 5 7 15 17 16 17 10 0.28 0.77 0.07 5 0.49 1 0.14 3 4 6 4 4 3 4 7 3 10 10 5 5 10 3 0.91 0.70 2 0.35 10 10 10 7 0.21 5 3 4 6 0.63 0.84 4 Random permutation of nodes 0.56 0.42

  5. ADS example Sorted SP distances from to all other nodes 0.49 0.35 0.91 0.56 0.70 0.63 0.42 0.21 0.14 0.77 0.28 0.07 0.84 0.63 0.42 0.07 :

  6. ADS example Sorted SP distances from to all other nodes 0.49 0.35 0.91 0.56 0.70 0.63 0.42 0.21 0.14 0.77 0.28 0.07 0.84 0.63 0.42 0.14 0.35 0.56 0.21 0.07 :

  7. Expected Size of Bottom- ADS Lemma: Proof: Theithclosest node to is included with probability * Same argument as in lecture2 to bound number of updates to a Min-Hash sketch of a stream. Distance instead of time.

  8. Computing bottom- ADS for all nodes: pruned Dijkstra’s Iterate over nodes by increasing : Run Dijkstra’s algorithm from on the reverse graph. When visiting : IF • ADS • Continue on • ELSE, prune at

  9. ADS Computation: Dijkstra Perform pruned Dijkstra from nodes by increasing hash: 0.14 0.28 0.77 0.07 0.49 5 1 3 4 6 4 4 3 4 7 3 10 10 5 5 10 3 0.91 0.35 0.70 2 10 10 10 7 0.21 5 3 4 0.63 6 0.84 4 0.56 0.42

  10. Computing bottom- ADSs (unweighted edges): Dynamic Programming Initalize: send update to Iterate on until no updates: For all nodes • If update received and then • If new entry , send update to

  11. ADS Computation: Dynamic Programming Iteration computes entries with distance We display lowest hash in current iteration. 0.14 0.28 0.77 0.07 0.49 0.91 0.35 0.70 0.21 0.63 0.84 0.56 0.42

  12. ADS Computation: Dynamic Programming Start: Each places in ADS 0.14 0.77 0.07 0.28 0.49 0.91 0.35 0.70 0.21 0.63 0.84 0.56 0.42

  13. ADS Computation: Dynamic Programming Start: Each places in ADS

  14. ADS Computation: Dynamic Programming Iteration send distance hash to all neighbors. Create ADS entry with distance 1 if hash is lower.

  15. ADS Computation: Dynamic Programming Iteration send distance hash to all neighbors. Create ADS entry with distance 1 if hash is lower.

  16. ADS Computation: Dynamic Programming Iteration send distance hash to all neighbors. Create ADS entry with distance 1 if hash is lower.

  17. ADS Computation: Dynamic Programming Iteration send distance entry to all neighbors. Create ADS entry with distance 2 if hash is lower.

  18. ADS computation: Analysis • Pruned Dijkstra’s Introduces ADS entries by increasing hash. • DP introduces entriesby increasing distance. • With either approach, inNeighbors are used only after an update expected number of edge traversals is bounded by sum over nodes of ADS size times inDegree:

  19. ADS computation: Comments • Pruned Dijkstra ADS computation can be parallelized, similarly to BFS reachability sketches, to reduce dependency. • DP can be implemented via (diameter number of) sequential passes over edge set. We only need to keep in memory k entries for each node (k smallest hashes so far). • It is also possible to perform a distributed computation where nodes asynchronously communicate with neighbors. Entries can be modified or removed. This incurs overhead.

  20. Next: Some ADS applications • Cardinality/similarity of -neighborhoods by extractingMin-Hash sketches • Closeness Centralities: HIP estimators • Closeness similarities • Distance oracles

  21. Next: Some ADS applications • Cardinality/similarity of -neighborhoods by extractingMin-Hash sketches • Closeness Centralities: HIP estimators • Closeness similarities • Distance oracles

  22. Using ADSs Extract Min-Hash sketch of the neighborhood of , , from ADS: bottom- Directly using Min-Hash sketches (lectures 2-3): • Can estimate cardinality • Can estimate Jaccard similarity of and

  23. Some ADS applications • Cardinality/similarity of neighborhoodsby extractingMin-Hash sketches • Closeness Centralities: HIP estimators • Closeness similarities • Distance oracles

  24. Closeness Centrality Based on distances to all other nodes Bavelas (1948) : Issues: • Does not work for disconnected graphs • Emphasis on contribution of “far” nodes Correction : Harmonic Mean of distances

  25. Closeness Centrality Based on distances to all other nodes More general definition : • non increasing; some filter • Harmonic Mean: , • Exponential decay with distance • Degree centrality: ; • Neighborhood size :

  26. Closeness Centrality Based on distances to all other nodes More general definition : • non increasing; some filter Centrality with respect to a filter • Education level, community (TAU graduates), geography, language, product type • Applications for filter: attribute completion, targeted ads

  27. HIP estimators for ADSs • For each node , we estimate the “presence” of with respect to (=1 if , 0 otherwise) • Estimate is if . • If , we compute the probability that it is included, conditioned on fixed hash values of all nodes that are closer to than We then use the inverse-probability estimate . • For bottom- and

  28. Example: HIP estimates Bottom- ADS of 0.63 0.42 0.14 0.35 0.56 0.21 0.07 : : :2nd smallest hash amongst closer nodes

  29. Example: HIP estimates Bottom- ADS of distance: : We estimate:

  30. Example: HIP estimates Bottom- ADS of distance: : Only good guys ( is good) :

  31. Example: HIP estimates Bottom- ADS of distance: : Only bad guys (is bad) :

  32. Estimating Closeness Centrality • non increasing; some filter Lemma: The HIP estimator has CV for uniform or when ADSs are computed with respect to . We do not give the proof here

  33. Closeness Centrality Interpreted as the L1 norm of closeness vectors • viewnodes as features weighted by some • Relevance of node to node decreases with distance ,according to The closeness vector of node :

  34. Next: Some ADS applications • Cardinality/similarity of neighborhoodsby extractingMin-Hash sketches • Closeness Centralities: HIP estimators • Closeness similarities • Distance oracles

  35. Closeness Similarity computed from the closeness vectors , • Weighted Jaccard coefficient: • Cosine similarity:

  36. Closeness Similarity: choices of , Similarity of-Neighborhoods: when when • -Neighborhood Jaccard: =1 • -Neighborhood Adamic-Adar:

  37. Estimating Closeness Similarity Lemma: We can estimate weighted Jaccard coefficient or the cosine similarity of the closeness vectors of two nodes from ADS with mean square error *For uniform or when ADSs are computed with respect to . We do not give the proof here

  38. Next: Some ADS applications • Cardinality/similarity of neighborhoodsby extractingMin-Hash sketches • Closeness Centralities: HIP estimators • Closeness similarities • Distance oracles

  39. Estimating SP distance We can use and to obtain an upper bound on : Comment: For directed graphs we need a “forward” and a “backward” What can we say about the quality of this bound ?

  40. Bottom-ADSs of 0.63 0.42 0.56 0.07 0.35 0.21 0.14 distance: 0.77 0.91 0.35 0.49 0.21 0.14 0.07 distance:

  41. Common nodes in bottom-ADSs of 10+15=25 10+4=14 10+15=25 0.63 0.42 0.56 0.07 0.35 0.21 0.14 distance: 0.77 0.91 0.35 0.49 0.21 0.14 0.07 distance:

  42. Query time improvement: Only test Min-Hash nodes membership in other ADS 0.63 0.42 0.56 0.07 0.35 0.21 0.14 distance: 0.77 0.91 0.35 0.49 0.21 0.14 0.07 distance:

  43. Query time • Basic version: intersectionof ADSs: • Faster version tests presence of “pivots” in the other ADS. Query time (requires data structures) is . • Can (will not show in class) further reduce query time by noting dependencies between pivots: no need to test all of them Comment: Theoretical worst-case upper bound on stretchis same, but in practice, estimate quality deteriorates with query time “improvements”: Better to use the full “sketch”

  44. Bounding the stretch Stretch: ratio between approximate and true distance Theorem:On undirected graphs, for any integer , if we use , there is constant probability that the estimate is at most times the actual distance. • : stretch is at most but • With fixed we get stretch. • Stretch/representation-size tradeoff is worst-case tight (under some hardness assumptions) • In practice, stretch is typically much better.

  45. Bounding the stretch We prove a slightly weaker version in class Theorem:On undirected graphs, for any integer , if we use , there is constant probability that the estimate is at most times the actual distance.

  46. Proof outline: • Part 1: We show that if the ratio between the cardinalities of two consecutive sets is , then the min-hash of the smaller set is likely to be a member of the bottom- of the larger set. • If the larger set is then the bound that we get is • Part 2: If all pairs have ratio , we show can not be too big and the minimum hash node gives good stretch.

  47. Part 2: If all pairs have ratio , “distance” from : Distance , set size is . If growth ratio at least , we must have . • If then . This means all nodes are of distance from (one of) . • In particular, the node with minimum hash must be of distance at most from one and from the other. We obtain “stretch” .

  48. Proof outline: • Part 1: We show that if the ratio between the cardinalities of two consecutive sets is , then the min-hash of the smaller set is likely to be a member of the bottom- of the larger set. • If the larger set is then the bound that we get is • Part 2: If all pairs have ratio , we show can not be too big and the minimum hash node gives good stretch.

  49. If the minimum hash in the bottom-k in our estimate is at most More generally….

More Related