460 likes | 565 Vues
Dealing with MASSIVE Data. Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008. Brief Bio. B.A.S. in computer engineering from Nanyang Technological University in 2002 Ph.D. in computer science from Boston University in 2007
E N D
Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008
Brief Bio • B.A.S. in computer engineering from Nanyang Technological University in 2002 • Ph.D. in computer science from Boston University in 2007 • Research Interns/Visitors at AT&T Labs, IBM T. J. Watson Research Center, Microsoft Research. • Now: Assistant Professor in CS Department at FSU
Research Areas Database Applications indexing query processing spatial databases Algorithms and Data structures data streams I/O-efficient algorithms computational geometry streaming algorithms misc. Geographic Information Systems data security and privacy Probabilistic Data
Massive Data • Massive datasets are being collected everywhere • Storage management software is billion-$ industry Examples (2002): • Phone: AT&T 20TB phone call database, wireless tracking • Consumer: WalMart 70TB database, buying patterns • WEB: Web crawl of 200M pages and 2000M links, Google’s huge indexes • Geography: NASA satellites generate 1.2TB per day
Example: LIDAR Terrain Data • Massive (irregular) point sets (1-10m resolution) • Becoming relatively cheap and easy to collect • Appalachian Mountains between 50GB and 5TB • Exceeds memory limit and needs to be stored on disk
Example: Network Flow Data • AT&T IP backbone generates 500 GB per day • Gigascope: A data stream management system • Compute certain statistics • Can we do computation without storing the data?
Traditional Random Access Machine Model • Standard theoretical model of computation: • Infinite memory (how nice!) • Uniform access cost • Simple model crucial for success of computer industry R A M
How to Deal with MASSIVE Data? when there is not enough memory
Solution 1: Buy More Memory • Expensive • (Probably) not scalable • Growth rate of data is higher than the growth of memory
Solution 2: Cheat! (by random sampling) • Provide approximate solution for some problems • average, frequency of an element, etc. • What if we want the exact result? • Many problems can’t be solved by sampling • maximum, and all problems mentioned later
Solution 3: Using the Right Computation Model External Memory Model Streaming Model Probabilistic Model (brief)
Computation Model for Massive Data (1):External Memory Model Internal memory is limited but fast External memory is unlimited but slow
R A M L 1 L 2 Memory Hierarchy • Modern machines have complicated memory hierarchy • Levels get larger and slower further away from CPU • Block sizes and memory sizes are different! • There are a few attempts to model the hierarchy but not successful • They are too complicated!
read/write head read/write arm track magnetic surface Slow I/O • Disk access is 106 times slower than main memory access “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) • Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes) • Important to store/access data to take advantage of blocks (locality)
Puzzle #1: Majority Counting • A huge file of characters stored on disk • Question: Is there a character that appears > 50% of the time • Solution 1: sort + scan • A few passes (O(logM/BN)): will come to it later • Solution 2: divide-and-conquer • Load a chunk in to memory: N/M chunks • Count them, return majority • The overall majority must be the majority in >50% chunks • Iterate until < M • Very few passes (O(logMN)), geometrically decreasing • Solution 3: O(1) memory, 2 passes (answer to be posted later) d a a e a b a a f a g b b a e c a d a a
External Memory Model [AV88] N = # of items in the problem instance B = # of items per disk block M = # of items that fit in main memory I/O: Move block between memory and disk Performance measure: # of I/Os performed by algorithm We assume (for convenience) that M >B2 D Block I/O M P
Sorting in External Memory • Break all N elements into N/M chunks of size M each • Sort each chunk individually in memory • Merge them together • Can merge <M/B sorted lists (queues) at once M/B blocks in main memory
Sorting in External Memory • Merge sort: • Create N/M memory sized sorted lists • Repeatedly merge lists together Θ(M/B) at a time phases using I/Os each I/Os
External Searching: B-Tree • Each node (except root) has fan-out between B/2 and B • Size: O(N/B) blocks on disk • Search: O(logBN) I/Os following a root-to-leaf path • Insertion and deletion: O(logBN) I/Os
Fundamental Bounds Internal External • Scanning: N • Sorting: N log N • Searching: More Results • List ranking N • Minimal spanning tree N log N • Offline union-find N • Interval searching log N + T logBN + T/B • Rectangle enclosure log N + T log N + T/B • R-tree search
D M running time P data size Does All the Theory Matter? • Programs developed in RAM-modelstill runs even there is not enough memory • Run on large datasets because OS moves blocks as needed • OS utilizes paging and prefetching strategies • But if program makes scattered accesses even good OS cannot take advantage of block access Thrashing!
Toy Experiment: Permuting • Problem: • Input: N elements out of order: 6, 7, 1, 3, 2, 5, 10, 9, 4, 8 • Each element knows its correct position • Output: Store them on disk in the right order • Internal memory solution: • Just scan the original sequence and move every element in the right place! • O(N) time, O(N) I/Os • External memory solution: • Use sorting • O(N log N) time, I/Os
A Practical Example on Real Data • Computing persistence on large terrain data
Takeaways • Need to be very careful when your program’s space usage exceeds physical memory size • If program mostly makes highly localized accesses • Let the OS handle it automatically • If program makes many non-localized accesses • Need I/O-efficient techniques • Three common techniques (recall the majority counting puzzle): • Convert to sort + scan • Divide-and-conquer • Other tricks
Want to know more about I/O-efficient algorithms? A course on I/O-efficient algorithms is offered as CIS5930 (Advanced Topics in Data Management)
You got to look at each element only once! Computation Model for Massive Data (2):Streaming Model Cannot Don’t want to store data and do further processing Can’t wait to
Back-end Data Warehouse DBMS (Oracle, DB2) Streaming Algorithms: Applications What are the top (most frequent) 1000 (source, dest) pairs seen over the last month? How many distinct (source, dest) pairs have been seen? Off-line analysis – slow, expensive Set-Expression Query Network Operations Center (NOC) SELECT COUNT (R1.source, R2.dest) FROM R1, R2 WHERE R1.dest = R2.source Peer SQL Join Query • Other applications: • Sensor networks • Network security • Financial applications • Web logs and clickstreams EnterpriseNetworks PSTN DSL/Cable Networks
Puzzle #2: Find Missing Card Mahjong tile • How to find the missing tile by making one pass over everything? • Assuming you can’t memorize everything (of course) • Assign a number to each type of tiles: = 8, = 14, = 22 • Compute the sum of all remaining tiles • (1+…+9+11+…+19+21+…+29)*4 – sum = missing tile!
A Research Problem: Count # Distinct Elements • Unfortunately, there is a lower bound saying you can’t do this without using Ω(n) memory • But if we allow some errors, then can approximate it well d a a e a b a a f a g b b a e c a d a a # distinct elements = 7
Solution: FM Sketch [FM85, AMS99] • Take a (pseudo) random hash function h : {1,…,n} {1,…,2d}, where 2d > n • For each incoming element x, compute h(x) • e.g., h(5) = 10101100010000 • Count how many trailing zeros • Remember the maximum number of trailing zeroes in any h(x) • Let Y be the maximum number of trailing zeroes • Can show E[2Y] = # distinct elements • 2 elements, “on average” there is one h(x) with 1 trailing zero • 4 elements, “on average” there is one h(x) with 2 trailing zeroes • 8 elements, “on average” there is one h(x) with 3 trailing zeroes • …
Counting Paintballs • Imagine the following scenario: • A bag of n paintballs is emptied at the top of a long stair-case. • At each step, each paintball either bursts and marks the step, or bounces to the next step. 50/50 chance either way. Looking only at the pattern of marked steps, what was n?
Counting Paintballs (cont) B(n,1/2) • What does the distribution of paintball bursts look like? • The number of bursts at each step follows a binomial distribution. • The expected number of bursts drops geometrically. • Few bursts after log2 n steps B(n,1/4) 1st 2nd B(n,1/2 Y) Y th B(n,1/2 Y)
Solution: FM Sketch [FM85, AMS99] • So 2Yis an unbiased estimator for # distinct elements • However, has a large variance • Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε • Applications: • How many distinct IP addresses used a given link to send their traffic from the beginning of the day? • How many new IP addresses appeared today that didn’t appear before?
Finding Heavy Hitters • Which elements appeared in the stream more than 10% of the time? • Applications: • Networking • Finding IP addresses sending most traffic • Databases • Iceberg queries • Data mining • Finding “hot” items (item sets) in transaction data • Solution • Exact solution is difficult • If allow approximation of ε • Use O(1/ε) space and O(1) time per element in stream
Query Query site Q(S1∪S2∪…) S1 S3 S6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 S5 S2 S4 0 1 0 0 1 Streaming in a Distributed World Network Operations Center (NOC) • Large-scale querying/monitoring: Inherently distributed! • Streams physically distributed across remote sitesE.g., stream of UDP packets through subset of edge routers • Challenge is “holistic” querying/monitoring • Queries over the union of distributed streams Q(S1 ∪S2 ∪…) • Streaming data is spread throughout the network
Query Query site Q(S1∪S2∪…) S1 S3 S6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 S5 S2 S4 0 1 0 0 1 Streaming in a Distributed World Network Operations Center (NOC) • Need timely, accurate, and efficient query answers • Additional complexity over centralized data streaming! • Need space/time- and communication-efficient solutions • Minimize network overhead • Maximize network lifetime (e.g., sensor battery life) • Cannot afford to “centralize” all streaming data
Want to know more about streaming algorithms? A graduate-level course on streaming algorithms willbe approximately offered in the next next next semester with an error guarantee of 5%! Or, talk to me tomorrow!
Top-k Queries • Extremely useful in information retrieval • top-k sellers, popular movies, etc. • google Threshold Alg RankSQL top-2 = {t3, t5}
Top-k Queries on Uncertain Data top-k answer depends onthe interplay between score and confidence (sensor reading, reliability) (page rank, how well match query)
Top-k Definition: U-Topk The k tuples with the maximum probabilityof being the top-k {t3, t5}: 0.2*0.8 = 0.16 {t3, t4}: 0.2*(1-0.8)*0.9 = 0.036 {t5, t4}: (1-0.2)*0.8*0.9 = 0.576 ... Potential problem: top-k could be very different from top-(k+1)
Top-k Definition: U-kRanks The i-th tuple is the one with the maximumprobability of being at rank i, i=1,...,k Rank 1: t3: 0.2 t5: (1-0.2)*0.8 = 0.64 t4: (1-0.2)*(1-0.8)*0.9 = 0.144 ... Rank 2: t3: 0 t5: 0.2*0.8 = 0.16 t4: 0.9*(0.2*(1-0.8)+(1-0.2)*0.8) = 0.612 Potential problem: duplicated tuples in top-k
Uncertain Data Models • An uncertain data model represents a probability distribution of database instances (possible worlds) • Basic model: mutual independence among all tuples • Complete models: able to represent any distribution of possible worlds • Atomic independent random Boolean variables • Each tuple corresponds to a Boolean formula, appears iff the formula evaluates to true • Exponential complexity
Uncertain Data Model: x-relations Each x-tuple represents a discrete probability distribution of tuples x-tuples are mutually independent, and disjoint single-alternative multi-alternative U-Top2: {t1,t2} U-2Ranks: (t1, t3)
Want to know more about uncertainty data management? A graduate-level course on uncertainty data management will be (likely probably) offered in the next next next next next semester Or, talk to me tomorrow!
Recap • External memory model • Main memory is fast but limited • External memory slow but unlimited • Aim to optimize I/O performance • Streaming model • Main memory is fast but small • Can’t store, not willing to store, or can’t wait to store data • Compute the desired answers in one pass • Probabilistic data model • Can’t store, query exponential possible instances of possible worlds • Compute the desired answers in the succinct representation of the probabilistic data (efficiently!! Possibly allow some errors)
Thanks! Questions?