Finding hot-lists
E N D
Presentation Transcript
Finding hot-lists Given a data stream S, find the k most popular items Most popular product in retail sales data Hot-spots for intrusion detection, fraud detection, network congestion etc. Frequently used blocks in executed code
Finding hot-lists – negative result [AMS96]: Approximating the frequency of the most frequent item in a sequence requires Omega(n) memory bits. Proof using Razborov’s element disjointness in communication complexity.
Communication Complexity ALICE input A BOB input B Cooperatively compute function f(A,B) Minimize bits communicated Unbounded computational power Communication Complexity C(f) – bits exchanged by optimal protocol Π Protocols? 1-way versus 2-way deterministic versus randomized Cδ(f) – randomized complexity for error probability δ
Adaptive Sampling [GM98] - Sample elements from the input set - Frequently occurring elements will be sampled more often - Sampling probability determined at runtime, according to the allowed memory usage - Tradeoff between overhead and accuracy - Give an estimate of the sample’s accuracy
Concise Samples - Uniform random sampling - Maintain an <id, count> pair for each element - The sample size can be much larger than the memory size - For skewed input sets the gain is much larger - Sampling is not applied at every block - Vitter’s reservoir sampling
Comparison of Hot List Algorithms 500K values in [1,500] Zipf parameter = 1.5 Footprint = 100 20