Leveraging Big Data: Lecture 12

http://www.cohenwang.com/edith/bigdataclass2013 Leveraging Big Data: Lecture 12 Edith Cohen Amos Fiat Haim Kaplan Tova Milo Instructors:

Today • All-Distances Sketches • Applications of All-Distance sketches • Back to linear sketches (random linear transformations)

All-Distances Sketches (ADSs) is in the Min-Hash sketch of for some • : the set of nodes that are within distance at most from Bottom- is a list of pairs , where is a node ID. < kthsmallest hash ofnodes that are closer to than

ADS example SP distances: 13 14 15 10 0 6 5 7 15 17 16 17 10 0.28 0.77 0.07 5 0.49 1 0.14 3 4 6 4 4 3 4 7 3 10 10 5 5 10 3 0.91 0.70 2 0.35 10 10 10 7 0.21 5 3 4 6 0.63 0.84 4 Random permutation of nodes 0.56 0.42

ADS example Sorted SP distances from to all other nodes 0.49 0.35 0.91 0.56 0.70 0.63 0.42 0.21 0.14 0.77 0.28 0.07 0.84 0.63 0.42 0.07 :

ADS example Sorted SP distances from to all other nodes 0.49 0.35 0.91 0.56 0.70 0.63 0.42 0.21 0.14 0.77 0.28 0.07 0.84 0.63 0.42 0.14 0.35 0.56 0.21 0.07 :

Expected Size of Bottom- ADS Lemma: Proof: Theithclosest node to is included with probability * Same argument as in lecture2 to bound number of updates to a Min-Hash sketch of a stream. Distance instead of time.

Computing bottom- ADS for all nodes: pruned Dijkstra’s Iterate over nodes by increasing : Run Dijkstra’s algorithm from on the reverse graph. When visiting : IF • ADS • Continue on • ELSE, prune at

ADS Computation: Dijkstra Perform pruned Dijkstra from nodes by increasing hash: 0.14 0.28 0.77 0.07 0.49 5 1 3 4 6 4 4 3 4 7 3 10 10 5 5 10 3 0.91 0.35 0.70 2 10 10 10 7 0.21 5 3 4 0.63 6 0.84 4 0.56 0.42

Computing bottom- ADSs (unweighted edges): Dynamic Programming Initalize: send update to Iterate on until no updates: For all nodes • If update received and then • If new entry , send update to

ADS Computation: Dynamic Programming Iteration computes entries with distance We display lowest hash in current iteration. 0.14 0.28 0.77 0.07 0.49 0.91 0.35 0.70 0.21 0.63 0.84 0.56 0.42

ADS Computation: Dynamic Programming Start: Each places in ADS 0.14 0.77 0.07 0.28 0.49 0.91 0.35 0.70 0.21 0.63 0.84 0.56 0.42

ADS Computation: Dynamic Programming Start: Each places in ADS

ADS Computation: Dynamic Programming Iteration send distance hash to all neighbors. Create ADS entry with distance 1 if hash is lower.

ADS Computation: Dynamic Programming Iteration send distance entry to all neighbors. Create ADS entry with distance 2 if hash is lower.

ADS computation: Analysis • Pruned Dijkstra’s Introduces ADS entries by increasing hash. • DP introduces entriesby increasing distance. • With either approach, inNeighbors are used only after an update expected number of edge traversals is bounded by sum over nodes of ADS size times inDegree:

ADS computation: Comments • Pruned Dijkstra ADS computation can be parallelized, similarly to BFS reachability sketches, to reduce dependency. • DP can be implemented via (diameter number of) sequential passes over edge set. We only need to keep in memory k entries for each node (k smallest hashes so far). • It is also possible to perform a distributed computation where nodes asynchronously communicate with neighbors. Entries can be modified or removed. This incurs overhead.

Next: Some ADS applications • Cardinality/similarity of -neighborhoods by extractingMin-Hash sketches • Closeness Centralities: HIP estimators • Closeness similarities • Distance oracles

Using ADSs Extract Min-Hash sketch of the neighborhood of , , from ADS: bottom- Directly using Min-Hash sketches (lectures 2-3): • Can estimate cardinality • Can estimate Jaccard similarity of and

Some ADS applications • Cardinality/similarity of neighborhoodsby extractingMin-Hash sketches • Closeness Centralities: HIP estimators • Closeness similarities • Distance oracles

Closeness Centrality Based on distances to all other nodes Bavelas (1948) : Issues: • Does not work for disconnected graphs • Emphasis on contribution of “far” nodes Correction : Harmonic Mean of distances

Closeness Centrality Based on distances to all other nodes More general definition : • non increasing; some filter • Harmonic Mean: , • Exponential decay with distance • Degree centrality: ; • Neighborhood size :

Closeness Centrality Based on distances to all other nodes More general definition : • non increasing; some filter Centrality with respect to a filter • Education level, community (TAU graduates), geography, language, product type • Applications for filter: attribute completion, targeted ads

HIP estimators for ADSs • For each node , we estimate the “presence” of with respect to (=1 if , 0 otherwise) • Estimate is if . • If , we compute the probability that it is included, conditioned on fixed hash values of all nodes that are closer to than We then use the inverse-probability estimate . • For bottom- and

Example: HIP estimates Bottom- ADS of 0.63 0.42 0.14 0.35 0.56 0.21 0.07 : : :2nd smallest hash amongst closer nodes

Example: HIP estimates Bottom- ADS of distance: : We estimate:

Example: HIP estimates Bottom- ADS of distance: : Only good guys ( is good) :

Example: HIP estimates Bottom- ADS of distance: : Only bad guys (is bad) :

Estimating Closeness Centrality • non increasing; some filter Lemma: The HIP estimator has CV for uniform or when ADSs are computed with respect to . We do not give the proof here

Closeness Centrality Interpreted as the L1 norm of closeness vectors • viewnodes as features weighted by some • Relevance of node to node decreases with distance ,according to The closeness vector of node :

Next: Some ADS applications • Cardinality/similarity of neighborhoodsby extractingMin-Hash sketches • Closeness Centralities: HIP estimators • Closeness similarities • Distance oracles

Closeness Similarity computed from the closeness vectors , • Weighted Jaccard coefficient: • Cosine similarity:

Closeness Similarity: choices of , Similarity of-Neighborhoods: when when • -Neighborhood Jaccard: =1 • -Neighborhood Adamic-Adar:

Estimating Closeness Similarity Lemma: We can estimate weighted Jaccard coefficient or the cosine similarity of the closeness vectors of two nodes from ADS with mean square error *For uniform or when ADSs are computed with respect to . We do not give the proof here

Next: Some ADS applications • Cardinality/similarity of neighborhoodsby extractingMin-Hash sketches • Closeness Centralities: HIP estimators • Closeness similarities • Distance oracles

Estimating SP distance We can use and to obtain an upper bound on : Comment: For directed graphs we need a “forward” and a “backward” What can we say about the quality of this bound ?

Bottom-ADSs of 0.63 0.42 0.56 0.07 0.35 0.21 0.14 distance: 0.77 0.91 0.35 0.49 0.21 0.14 0.07 distance:

Common nodes in bottom-ADSs of 10+15=25 10+4=14 10+15=25 0.63 0.42 0.56 0.07 0.35 0.21 0.14 distance: 0.77 0.91 0.35 0.49 0.21 0.14 0.07 distance:

Query time improvement: Only test Min-Hash nodes membership in other ADS 0.63 0.42 0.56 0.07 0.35 0.21 0.14 distance: 0.77 0.91 0.35 0.49 0.21 0.14 0.07 distance:

Query time • Basic version: intersectionof ADSs: • Faster version tests presence of “pivots” in the other ADS. Query time (requires data structures) is . • Can (will not show in class) further reduce query time by noting dependencies between pivots: no need to test all of them Comment: Theoretical worst-case upper bound on stretchis same, but in practice, estimate quality deteriorates with query time “improvements”: Better to use the full “sketch”

Bounding the stretch Stretch: ratio between approximate and true distance Theorem:On undirected graphs, for any integer , if we use , there is constant probability that the estimate is at most times the actual distance. • : stretch is at most but • With fixed we get stretch. • Stretch/representation-size tradeoff is worst-case tight (under some hardness assumptions) • In practice, stretch is typically much better.

Bounding the stretch We prove a slightly weaker version in class Theorem:On undirected graphs, for any integer , if we use , there is constant probability that the estimate is at most times the actual distance.

Proof outline: • Part 1: We show that if the ratio between the cardinalities of two consecutive sets is , then the min-hash of the smaller set is likely to be a member of the bottom- of the larger set. • If the larger set is then the bound that we get is • Part 2: If all pairs have ratio , we show can not be too big and the minimum hash node gives good stretch.

Part 2: If all pairs have ratio , “distance” from : Distance , set size is . If growth ratio at least , we must have . • If then . This means all nodes are of distance from (one of) . • In particular, the node with minimum hash must be of distance at most from one and from the other. We obtain “stretch” .

Proof outline: • Part 1: We show that if the ratio between the cardinalities of two consecutive sets is , then the min-hash of the smaller set is likely to be a member of the bottom- of the larger set. • If the larger set is then the bound that we get is • Part 2: If all pairs have ratio , we show can not be too big and the minimum hash node gives good stretch.

If the minimum hash in the bottom-k in our estimate is at most More generally….

Leveraging Big Data: Lecture 12

Leveraging Big Data: Lecture 12

Presentation Transcript

Leveraging Digital Signature with LincPass

Fall 2004, CIS, Temple University CIS527: Data Warehousing, Filtering, and Mining Lecture 7 Decision Trees Lecture slide

DATA MINING LECTURE 10

Lecture 1 What is ( Astronomical ) Data Mining

807 - TEXT ANALYTICS

Lecture 10: data compression

DATA MINING LECTURE 8

DATA MINING LECTURE 5

CSC 211 Data Structures Lecture 15

DATA MINING LECTURE 10

CSc212 AB Data Structures Lecture 10

Data Mining-Knowledge Presentation—ID3 algorithm

CPSC 221: Data Structures Lecture #23 Counting

Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store

Lecture 3: Dynamic ILP

6.096 Lecture 10

Cold atoms

LECTURE Notes

Cold atoms

Introduction to Regression Lecture 2.1

Leveraging Audit Management Software (TeamMate Suite)