Download
big data n.
Skip this Video
Loading SlideShow in 5 Seconds..
BIG DATA PowerPoint Presentation

BIG DATA

174 Views Download Presentation
Download Presentation

BIG DATA

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. BIG DATA Algorithms

  2. Google trend

  3. Big data • everyone talks about it, • nobody really knows how to do it, • everyone thinks everyone else is doing it, • so everyone claims they are doing it...

  4. Is there anything fundamentally new? • Massive Data vs Big Data • The 3 V’s • Volume • Velocity • Variety

  5. Big data ecosystem

  6. Big data Applications

  7. Big Data Algorithms Distributed Algorithms Data stream Algorithms External memory Algorithms Parallel Algorithms

  8. Computational models for big data • All models are wrong, • But some are useful. • George E. P. Box

  9. What’s the Bottleneck? • CPU speed approaching limit • Does it matter? • From CPU-intensive computing to data-intensive computing • Algorithm has to be near-linear, linear, or even sub-linear! • Data movement, i.e., communication is the bottleneck!

  10. Random Access Machine Model • Standard theoretical model of computation: • Unlimited memory • Uniform access cost • Simple model crucial for success of computer industry R A M

  11. R A M L 1 L 2 Hierarchical Memory • Modern machines have complicated memory hierarchy • Levels get larger and slower further away from CPU • Data moved between levels using large blocks

  12. Slow I/O • Disk access is 106 times slower than main memory access read/write arm “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) track 4835 1915 5748 4125 • Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes) • Important to store/access data to take advantage of blocks (locality) magnetic surface

  13. running time data size Scalability Problems • Most programs developed in RAM-model • Run on large datasets because OS moves blocks as needed • Moderns OS utilizes sophisticated paging and prefetching strategies • But if program makes scattered accesses even good OS cannot take advantage of block access  Scalability problems!

  14. External Memory Model N = # of items in the problem instance B = # of items per disk block M = # of items that fit in main memory I/O: # blocks moved between memory and disk CPU time is ignored Successful model used extensively in massive data algorithms and database communities D Block I/O M P

  15. Fundamental Bounds Internal External • Scanning: N • Sorting: N log N • Permuting • Searching: • Note: • Linear I/O: O(N/B) • Permuting not linear • Permuting and sorting bounds are equal in all practical cases • B factor VERY important:

  16. Queues and Stacks • Queue: • Maintain push and pop blocks in main memory  O(1/B) I/O per operation (amortized) • Stack: • Maintain push/pop block in main memory  O(1/B) I/O per operation (amortized) Push Pop

  17. Sorting • Merge sort: • Create N/M memory sized sorted lists • Repeatedly merge lists together Θ(M/B) at a time  phases using I/Os each  I/Os

  18. Sorting • <M/B sorted lists (queues) can be merged in O(N/B) I/Os M/B blocks in main memory • The M/B head elements kept in a heap in main memory

  19. Toy Experiment: Permuting • Problem: • Input: N elements out of order: 6, 7, 1, 3, 2, 5, 10, 9, 4, 8 • Each element knows its correct position • Output: Store them on disk in the right order • Internal memory solution: • Just scan the original sequence and move every element in the right place! • O(N) time, O(N) I/Os • External memory solution: • Use sorting • O(N log N) time, I/Os

  20. Searching in External Memory • Store N elements in a data structure such that • Given a query element x, find it or its predecessor

  21. B-trees • BFS-blocking naturally corresponds to tree with fan-out • B-trees balanced by allowing node degree to vary • Rebalancing performed by splitting and merging nodes

  22. (a,b)-tree • T is an (a,b)-tree (a≥2 and b≥2a-1) • All leaves on the same level (contain between a and b elements) • Except for the root, all nodes have degree between a and b • Root has degree between 2 and b (2,4)-tree • (a,b)-tree uses linear space and has height •  • Choosing a,b = each node/leaf stored in one disk block •  • O(N/B) space and query

  23. (a,b)-Tree Insert • Insert: Search and insert element in leaf v DO v has b+1 elements/children Splitv: make nodes v’ and v’’ with and elements insert element (ref) in parent(v) (make new root if necessary) v=parent(v) • Insert touch nodes v v’ v’’

  24. (a,b)-Tree Insert

  25. (a,b)-Tree Delete • Delete: Search and delete element from leaf v DO v has a-1 elements/children Fusev with sibling v’: move children of v’ to v delete element (ref) from parent(v) (delete root if necessary) If v has >b (and ≤ a+b-1<2b) children split v v=parent(v) • Delete touch nodes v v

  26. (a,b)-Tree Delete

  27. (a,b)-Tree (2,3)-tree • (a,b)-tree properties: • Every update can cause O(logaN) rebalancing operations • If b>2a rebalancing operations amortized • Why? delete insert

  28. Summary/Conclusion: B-tree • B-trees: (a,b)-trees with a,b = • O(N/B) space • O(logBN)query • O(logB N)update • B-trees with elements in the leaves sometimes called B+-tree • Now B-tree and B+tree are synonyms • Construction in I/Os • Sort elements and construct leaves • Build tree level-by-level bottom-up

  29. Basic Structures: I/O-Efficient Priority Queue

  30. Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 100 40 90 40 30 50 29 23 15 65 Insertion

  31. Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 100 40 90 40 30 65 50 23 15 29 Insertion

  32. Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 40 90 40 30 65 50 23 15 29 DeleteMax

  33. Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 90 40 40 30 65 50 23 15 29 DeleteMax

  34. Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 90 40 65 40 30 50 29 23 15 DeleteMax

  35. How to Make the Heap I/O-Efficient I/O Technique 1: Make it many-way I/O Technique 2: Buffering!

  36. External Heap insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent

  37. External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent

  38. External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent

  39. External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks sift-up may not be half-full Heap property: All elements in a child are smaller than those in its parent

  40. External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks sift-up may not be half-full Heap property: All elements in a child are smaller than those in its parent

  41. External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks swap may not be half-full Heap property: All elements in a child are smaller than those in its parent

  42. External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent

  43. External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent

  44. External Heap: DeleteMax insert buffer main memory in memory refill heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent

  45. External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks refill may not be half-full Heap property: All elements in a child are smaller than those in its parent

  46. External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks refill merge may not be half-full Heap property: All elements in a child are smaller than those in its parent

  47. External Heap: I/O Analysis • What is the I/O cost for a sequence of N mixed insertions / deletemax (analysis in paper too complicated) • Height of heap: Θ(logM/BN/B) • Insertions • Wait until insert buffer is full (served at least Ω(M) inserts) • Then do one (occasionally two) bottom-up chains of sift-ups. • Cost: O(M/B∙logM/BN/B) • Amortized cost per insert: O(1/B∙logM/BN/B) • DeleteMax: • Wait until root is below half full (served at least Ω(M) deletemax) • Then do one, two, sometimes a lot of refills… dead • Do one sift-up: this is easy

  48. External Heap: I/O Analysis • Cost of all refills: • Need a global argument • Idea: trace individual elements • Total amount of “work”: O(N logM/BN/B) • One unit of work: move one element up one level • Refills do positive work • sift-ups do both positive and negative work • |positive work done by refills| + |positive works done by sift-ups| – |negative work done by sift-ups| = O(N logM/BN/B) • But note: |positive works done by sift-ups| >|negative work done by sift-ups| • So, |positive work done by refills| = O(N logM/BN/B)