Anthony Okorodudu CSE 6392 2006-4-25

Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data NetworksBy Nikon Ntarmos, Peter Triantafillou, and Gerhard Weikum Anthony Okorodudu CSE 6392 2006-4-25

Outline • Introduction • Motivation • Related Work • Distributed Hash Tables (DHT) • Hash Sketches • Distributed Hash Sketches (DHS) • Counting with DHS • Conclusion Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Introduction • Peer-to-peer (P2P) started as a way of sharing files/CPU cycles among end-users • Evolved into cutting networks of today • Distributed Hash Tables (DHT) made this feasible • Probabilistic guarantees for degree of efficiency, fault tolerance, and availability • Data management systems of huge scale Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Motivation • Need for distributed counting mechanisms • File-sharing P2P systems: total number of documents shared by users • Sensor networks: compute aggregates in a duplicate-insensitive manner • Internet-scale DB system: build histograms for query access plans Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Central Goals • Efficiency: number of nodes contacted for counting must be small • Scalability and availability: large numbers of nodes may need to add elements to a (multi-) set • Access and storage load balancing: counting and related overheads should be fairly distributed across all nodes Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Central Goals (continued) • Accuracy: tunable, robust, and highly accurate cardinality estimation • Simplicity and ease of integration: special, solution-based indexing structures should be avoided • Duplicate (in)sensitivity: count total number of items as well as the number of unique items in multi-sets Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Distributed Counting Protocols • One-node-per-counter protocols • Gossip-based protocols • Broadcast/convergecast-type protocols • Sampling-based protocols Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

One-node-per-counter • Select a node in the overlay of the DHT and use it to maintain counter value • Poor scalability • Resembles a centralized system Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Gossip-based • Provide weak probabilistic semantics of “eventual consistency” for outcome • Every node exchanges information with a set of nodes • Low bandwidth • Not efficient in terms of number of nodes to be contacted • Low accuracy Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Broadcast/Convergecast-type • Broadcast phase • Querying node broadcasts query through network, creating tree of nodes as query propagates the overlay • Convergecast phase • Node sends its local part of the answer along with answers received from nodes deeper down the tree to “parent” node • Similar to gossip-based Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Sampling-based • Estimate the value of the counter by selectively querying a set of nodes in the network • Sampling based techniques suffer from accuracy issues • Large samples lead to higher accuracy but more nodes need to be contacted • Sampling based techniques are usually duplicate-sensitive Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Distributed Hash Tables (DHT) • Family of structured P2P network overlays exposing hash-table like interface • insert(key, value) • lookup(key) • Highly efficient for point queries Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Hash Sketches • First proposed as a means of estimating the cardinality of a multiset in a database • Used in many application domains for counting distinct elements in multi-sets • Approximate query answering in very large DBs, data mining on the internet graph, stream processing Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Hash Sketches (continued) • PCSA (Probabilistic Counting with Stochastic Averaging) algorithm assumes of a pseudo-uniform hash function • Super-LogLog algorithm relaxes pseudo-uniform hash function constraints of PCSA Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Distributed Hash Sketches (DHS) • Fully decentralized, scalable, and efficient mechanism capable of providing estimates on the cardinality of multi-sets • Satisfy all the central goals • Implemented using PCSA (DHS-PCSA) or super-LogLog (DHS-sLL) hash sketches Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

DHS • O(log N) cost to insert object in an N-node DHS • O(b * log N) bandwidth consumption if size of data is b bytes • Data items are deleted if not updated within time-to-live so deleting an item incurs no extra cost Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

DHS (continued) • Accuracy of hash sketches increases with multiple bitmap vectors • Either PCSA or super-LogLog algorithm is applied for counting Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Counting with DHS Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Conclusion • Distributed Hash Sketches is a fully decentralized, scalable, and efficient mechanism for providing estimates on the cardinality of multi-sets in internet-scale information systems • DHS implemented using either PCSA or the super-LogLog hash sketches • DHS histograms can introduce great performance savings during query optimization Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

References • N. Ntarmos, P. Triantafillou, and G. Weikum. Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks. ICDE 2006. Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Thanks Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

Anthony Okorodudu CSE 6392 2006-4-25

Anthony Okorodudu CSE 6392 2006-4-25

Presentation Transcript

Mariam John CSE 6392 06/27/2006

CSE 6392 Multiservice Switching Architecture

Anthony Kleanthous 25 November , 2010

Anthony Okorodudu CSE 6392 2006-2-7

CSE 5/7353 – January 25 th 2006

CSE 403 Lecture 25

Anthony Okorodudu CSE 6392 2006-4-11

October 25, 2006

Anthony Kleanthous 25 November , 2010

CSE 143 Lecture 25

January 25, 2006

CSE 2353 Spring 2006

CSE 143 Lecture 25

CSE 5/7353 – January 25 th 2006

CSE 403 Lecture 25

CSE 143 Lecture 25