240 likes | 375 Vues
I/O-Algorithms. Thomas M ølhave Spring 2012 February 9 , 2012. Massive Data. Pervasive use of computers and sensors Increased ability to acquire/store/process data → Massive data collected everywhere Society increasingly “data driven” → Access/process data anywhere any time
E N D
I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012
Massive Data • Pervasive use of computers and sensors • Increased ability to acquire/store/process data → Massive data collected everywhere • Society increasingly “data driven” → Access/process data anywhere any time Nature/Science special issues • 2/06,9/08, 2/11 • Scientific data size growing exponentially, while quality and availability improving • Paradigm shift: Science will be about mining data • Obviously not only in sciences: • Economist 02/10: • From 150 Billion Gigabytes five years ago • to 1200 Billion today • Managing data deluge difficult; doing so • will transform business/public life
I/O-Algorithms Example: LIDAR Terrain Data • Massive (irregular) point sets (~1m resolution) • Becoming relatively cheap and easy to collect • Sub-meter resolution using mobile mapping
~ 1,5 m betweenmeasurements ~1,2 km ~ 280 km/h at 1500-2000m I/O-Algorithms Example: LIDAR Terrain Data
I/O-Algorithms Example: LIDAR Terrain Data • ~2 million points at 30 meter (<1GB) • ~18 billion points at 1 meter (>1TB)
Example: Detailed Data Essential • Mandø with 2 meter sea-level raise 80 meter terrain model 2 meter terrain model
I/O-Algorithms Random Access Machine Model • Standard theoretical model of computation: • Infinite memory • Uniform access cost • Simple model crucial for success of computer industry R A M
R A M L 1 L 2 I/O-Algorithms Hierarchical Memory • Modern machines have complicated memory hierarchy • Levels get larger and slower further away from CPU • Data moved between levels using large blocks
read/write head read/write arm track magnetic surface I/O-Algorithms Slow I/O • Disk access is 106 times slower than main memory access “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) • Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes) • Important to store/access data to take advantage of blocks (locality)
running time data size I/O-Algorithms Scalability Problems • Most programs developed in RAM-model • Run on large datasets because OS moves blocks as needed • Moderns OS utilizes sophisticated paging and prefetching strategies • But if program makes scattered accesses even good OS cannot take advantage of block access Scalability problems!
I/O-Algorithms External Memory Model N = # of items in the problem instance B = # of items per disk block M = # of items that fit in main memory T = # of items in output I/O: Move block between memory and disk D Block I/O M P
I/O-Algorithms Scalability Problems: Block Access Matters • Example: Traversing linked list (List ranking) • Array size N = 10 elements • Disk block size B = 2 elements • Main memory size M = 4 elements (2 blocks) • Large difference between N and N/B large since block size is large • Example: N = 256 x 106, B = 8000 , 1ms disk access time N I/Os take 256 x 103 sec = 4266 min = 71 hr N/B I/Os take 256/8 sec = 32 sec 1 5 2 6 3 8 9 4 10 1 2 10 9 5 6 3 4 7 7 8 Algorithm 1: N=10 I/Os Algorithm 2: N/B=5 I/Os
I/O-algorithms Fundamental Bounds Internal External • Scanning: N • Sorting: N log N • Searching: • Note: • Linear I/O: O(N/B) • B factor VERY important: • Cannot sort optimally with search tree
I/O-algorithms External Search Trees • Binary search tree: • Standard method for search among N elements • Search traces at least one root-leaf path • If nodes stored arbitrarily on disk • Search in I/Os
I/O-algorithms External Search Trees • BFS blocking: • Block height • Output elements blocked • • Rangesearch in I/Os • Optimal: O(N/B) space and query
I/O-algorithms External Search Trees • Maintaining BFS blocking during updates? • Balance normally maintained in search trees using rotations • Seems very difficult to maintain BFS blocking during rotation x y y x
I/O-algorithms B-trees • BFS-blocking naturally corresponds to tree with fan-out • B-trees balanced by allowing node degree to vary • Rebalancing performed by splitting and merging nodes
I/O-algorithms (a,b)-tree • T is an (a,b)-tree (a≥2 and b≥2a-1) • All leaves on the same level and contain between a and b elements • Except for the root, all nodes have degree between a and b • Root has degree between 2 and b (2,4)-tree • (a,b)-tree uses linear space and has height • • Choosing a,b = each node/leaf stored in one disk block • • O(N/B) space and query
I/O-algorithms (a,b)-Tree Insert • Insert: Search and insert element in leaf v DO v has b+1 elements/children Splitv: make nodes v’ and v’’ with and elements insert element (ref) in parent(v) (make new root if necessary) v=parent(v) • Insert touch nodes v v’ v’’
I/O-algorithms (2,4)-Tree Insert
I/O-algorithms (a,b)-Tree Delete • Delete: Search and delete element from leaf v DO v has a-1 elements/children Fusev with sibling v’: move children of v’ to v delete element (ref) from parent(v) (delete root if necessary) If v has >b (and ≤ a+b-1<2b) children split v v=parent(v) • Delete touch nodes v v
I/O-algorithms (2,4)-Tree Delete
I/O-algorithms Summary/Conclusion: B-tree • B-trees: (a,b)-trees with a,b = • O(N/B) space • O(logB N) query • O(logB N) update • B-trees with elements in the leaves sometimes called B+-tree • Construction in I/Os • Sort elements and construct leaves • Build tree level-by-level bottom-up
I/O-algorithms Merge Sort • Merge sort: • Create N/M memory sized sorted runs • Merge runs together M/B at a time phases using I/Os each • Distribution sort similar (but harder – partition elements)