Download Presentation
## BIG DATA

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**BIG DATA**Algorithms**Big data**• everyone talks about it, • nobody really knows how to do it, • everyone thinks everyone else is doing it, • so everyone claims they are doing it...**Is there anything fundamentally new?**• Massive Data vs Big Data • The 3 V’s • Volume • Velocity • Variety**Big Data Algorithms**Distributed Algorithms Data stream Algorithms External memory Algorithms Parallel Algorithms**Computational models for big data**• All models are wrong, • But some are useful. • George E. P. Box**What’s the Bottleneck?**• CPU speed approaching limit • Does it matter? • From CPU-intensive computing to data-intensive computing • Algorithm has to be near-linear, linear, or even sub-linear! • Data movement, i.e., communication is the bottleneck!**Random Access Machine Model**• Standard theoretical model of computation: • Unlimited memory • Uniform access cost • Simple model crucial for success of computer industry R A M**R**A M L 1 L 2 Hierarchical Memory • Modern machines have complicated memory hierarchy • Levels get larger and slower further away from CPU • Data moved between levels using large blocks**Slow I/O**• Disk access is 106 times slower than main memory access read/write arm “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) track 4835 1915 5748 4125 • Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes) • Important to store/access data to take advantage of blocks (locality) magnetic surface**running time**data size Scalability Problems • Most programs developed in RAM-model • Run on large datasets because OS moves blocks as needed • Moderns OS utilizes sophisticated paging and prefetching strategies • But if program makes scattered accesses even good OS cannot take advantage of block access Scalability problems!**External Memory Model**N = # of items in the problem instance B = # of items per disk block M = # of items that fit in main memory I/O: # blocks moved between memory and disk CPU time is ignored Successful model used extensively in massive data algorithms and database communities D Block I/O M P**Fundamental Bounds**Internal External • Scanning: N • Sorting: N log N • Permuting • Searching: • Note: • Linear I/O: O(N/B) • Permuting not linear • Permuting and sorting bounds are equal in all practical cases • B factor VERY important:**Queues and Stacks**• Queue: • Maintain push and pop blocks in main memory O(1/B) I/O per operation (amortized) • Stack: • Maintain push/pop block in main memory O(1/B) I/O per operation (amortized) Push Pop**Sorting**• Merge sort: • Create N/M memory sized sorted lists • Repeatedly merge lists together Θ(M/B) at a time phases using I/Os each I/Os**Sorting**• <M/B sorted lists (queues) can be merged in O(N/B) I/Os M/B blocks in main memory • The M/B head elements kept in a heap in main memory**Toy Experiment: Permuting**• Problem: • Input: N elements out of order: 6, 7, 1, 3, 2, 5, 10, 9, 4, 8 • Each element knows its correct position • Output: Store them on disk in the right order • Internal memory solution: • Just scan the original sequence and move every element in the right place! • O(N) time, O(N) I/Os • External memory solution: • Use sorting • O(N log N) time, I/Os**Searching in External Memory**• Store N elements in a data structure such that • Given a query element x, find it or its predecessor**B-trees**• BFS-blocking naturally corresponds to tree with fan-out • B-trees balanced by allowing node degree to vary • Rebalancing performed by splitting and merging nodes**(a,b)-tree**• T is an (a,b)-tree (a≥2 and b≥2a-1) • All leaves on the same level (contain between a and b elements) • Except for the root, all nodes have degree between a and b • Root has degree between 2 and b (2,4)-tree • (a,b)-tree uses linear space and has height • • Choosing a,b = each node/leaf stored in one disk block • • O(N/B) space and query**(a,b)-Tree Insert**• Insert: Search and insert element in leaf v DO v has b+1 elements/children Splitv: make nodes v’ and v’’ with and elements insert element (ref) in parent(v) (make new root if necessary) v=parent(v) • Insert touch nodes v v’ v’’**(a,b)-Tree Delete**• Delete: Search and delete element from leaf v DO v has a-1 elements/children Fusev with sibling v’: move children of v’ to v delete element (ref) from parent(v) (delete root if necessary) If v has >b (and ≤ a+b-1<2b) children split v v=parent(v) • Delete touch nodes v v**(a,b)-Tree**(2,3)-tree • (a,b)-tree properties: • Every update can cause O(logaN) rebalancing operations • If b>2a rebalancing operations amortized • Why? delete insert**Summary/Conclusion: B-tree**• B-trees: (a,b)-trees with a,b = • O(N/B) space • O(logBN)query • O(logB N)update • B-trees with elements in the leaves sometimes called B+-tree • Now B-tree and B+tree are synonyms • Construction in I/Os • Sort elements and construct leaves • Build tree level-by-level bottom-up**Internal Priority Queues**• Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 100 40 90 40 30 50 29 23 15 65 Insertion**Internal Priority Queues**• Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 100 40 90 40 30 65 50 23 15 29 Insertion**Internal Priority Queues**• Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 40 90 40 30 65 50 23 15 29 DeleteMax**Internal Priority Queues**• Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 90 40 40 30 65 50 23 15 29 DeleteMax**Internal Priority Queues**• Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 90 40 65 40 30 50 29 23 15 DeleteMax**How to Make the Heap I/O-Efficient**I/O Technique 1: Make it many-way I/O Technique 2: Buffering!**External Heap**insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent**External Heap: Insert**insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent**External Heap: Insert**insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent**External Heap: Insert**insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks sift-up may not be half-full Heap property: All elements in a child are smaller than those in its parent**External Heap: Insert**insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks sift-up may not be half-full Heap property: All elements in a child are smaller than those in its parent**External Heap: Insert**insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks swap may not be half-full Heap property: All elements in a child are smaller than those in its parent**External Heap: DeleteMax**insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent**External Heap: DeleteMax**insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent**External Heap: DeleteMax**insert buffer main memory in memory refill heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent**External Heap: DeleteMax**insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks refill may not be half-full Heap property: All elements in a child are smaller than those in its parent**External Heap: DeleteMax**insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks refill merge may not be half-full Heap property: All elements in a child are smaller than those in its parent**External Heap: I/O Analysis**• What is the I/O cost for a sequence of N mixed insertions / deletemax (analysis in paper too complicated) • Height of heap: Θ(logM/BN/B) • Insertions • Wait until insert buffer is full (served at least Ω(M) inserts) • Then do one (occasionally two) bottom-up chains of sift-ups. • Cost: O(M/B∙logM/BN/B) • Amortized cost per insert: O(1/B∙logM/BN/B) • DeleteMax: • Wait until root is below half full (served at least Ω(M) deletemax) • Then do one, two, sometimes a lot of refills… dead • Do one sift-up: this is easy**External Heap: I/O Analysis**• Cost of all refills: • Need a global argument • Idea: trace individual elements • Total amount of “work”: O(N logM/BN/B) • One unit of work: move one element up one level • Refills do positive work • sift-ups do both positive and negative work • |positive work done by refills| + |positive works done by sift-ups| – |negative work done by sift-ups| = O(N logM/BN/B) • But note: |positive works done by sift-ups| >|negative work done by sift-ups| • So, |positive work done by refills| = O(N logM/BN/B)