SCUD: Scalable Counting of Unique Data

Further Information Challenges Demo Problem Description SCUD: Scalable Counting of Unique Data Dmitry Kit, Prince Mahajan, Navendu Jain, Praveen Yalagandula*, Mike Dahlin, and Yin Zhang Laboratory for Advanced Systems Research, The University of Texas at Austin * Hewlett-Packard Labs, Palo Alto, CA Laboratory for Advanced Systems Research Our Approach Push query processing into the network Counting information about unique data is an important basic operation in large scale distributed applications • Web Demo: • Perform two operations on the system put/get. • Can specify the address from which requests originate. • View the top-10 list of clients who performed a put or a get. • Details: • Uses Bamboo DHT as the storage system. • When Bamboo receives a “put” or a “get” request it notifies SDIMS with the new access count (old_count+1). • SDIMS notifies to Bamboo the list of aggregated top-10 client IPs in terms of the “put” and “get” operations. • Bamboo updates its local top-10 list. • If there is a change, this list is inserted into SDIMS under as a different aggregation. • The lists from each root is combined to form the final top-10 list. • Demo URL: • http://z.cs.utexas.edu/users/dkit/bamboo_test.php • Leverage SDIMS system • Multiple aggregation trees. • Flexible API to install aggregation functions and control the aggregate value propagation. • Chaining of Aggregation functions • Several Applications: • Distributed storage: [e.g., Bamboo/OpenDHT] • Top-k users in a storage system grouped by activity type. • Put – store information. • Get – retrieve information. • Network Monitoring: [e.g., Heavy Hitters] • Top-k flows by bytes and packets. • Aggregate MAX/MIN/AVG. incoming flows in an organization. • Content Distribution Networks: [e.g., Akamai] • Counting the number of unique accesses per webpage. • Telematics Applications: [e.g., BMW Assist] • Counting the number of unique cars per highway. • Use one aggregation function to aggregate information per attribute • Example: Aggregating the number of accesses made by a client can be aggregated across multiple DHT trees. • Use a basic SUM aggregation for each client: • Chain the output of the first aggregation function into the second aggregation. [e.g., Top-K] • The results of the first aggregation reside on a large number of nodes (e.g., 1,000,000). • The chaining process allows us to combine this distributed information. • Example: Maintain a top-10 list of heavy hitter flows by source IPs in terms of the total bytes sent. • Scalability: • large number of unique items. • large number of distributed data sources at which the information about these items are being updated. • Existing schemes don’t scale: • Centralized: High bandwidth cost, large processing delay, high response latency. • Hierarchical aggregation: Root node and nodes near the root: O(# items) message cost, storage cost. Top-10 aggregation <9.2.5.89, 4.2MB><9.2.5.15, 2.2MB><9.2.5.12, 1.5MB><9.2.5.67, 1.3MB> … <9.2.5.23, 800 KB> <9.2.5.12, 1.5 MB> <9.2.5.11, 400 KB> <9.2.5.10, 653 KB> <9.2.5.67, 1.3 MB> <9.2.5.15, 2.2 MB> URL: http://www.cs.utexas.edu/users/ypraveen/sdims Email: sdims@cs.utexas.edu <9.2.5.56, 257 KB> <9.2.5.21, 120 KB> <9.2.5.89, 4.2 MB> <9.2.5.25, 900 KB> <9.2.5.20, 760 KB> <9.2.5.24, 100 KB>

Sensor Data <IP1, count1> Top-K list of <IP, count> pairs <IP2, count2> <IP3, count3> Aggregation 2: Top-K Aggregation 1: SUM …<IPn, countn> AVG aggregation MAX aggregation MIN aggregation Major Roads in Texas The tree Major Roads for each city (identified by traffic greater than some threshold) Cars/minute on the roads in Houston Cars/minute on the roads in Dallas Cars/minute on the roads in San Antonio Estimated Network Size: 18 nodes Top 10 Gets by client: Client IP Number of Gets 127.0.0.1 1182 128.83.144.30 243 128.83.120.172 168 128.83.120.245 147 128.83.120.138 120 128.83.120.21 90 128.83.144.43 75 128.83.130.11 48 128.83.120.114 27 128.83.144.241 12 Top 10 Puts by client: Client IP Number of Puts 127.0.0.1 1100 128.83.120.138 180 128.83.144.30 144

SCUD: Scalable Counting of Unique Data

SCUD: Scalable Counting of Unique Data

Presentation Transcript

Basics of Counting

AHP Data Project – Counting Matters!

Scalable Data Mining

Methods of Counting

Counting Data

Scalable Integration and Processing of Linked Data

Units of Counting

SCUD Model Tracks Tsunami Driftage

Scalable Data Management@facebook

WANTED Scud on the loose

PARSYMONY: Scalable Parallel Data Mining

Scalable Data Clustering with GPUs

SCUD

Scalable Algorithms for Analysis of Genomic Diversity Data

Counting Data

Basics of Counting

The Unique Value of Avaya Data Solutions

Basics of Counting

Scalable Data Science Systems