1 / 18

CSE332: Data Abstractions Lecture 8: AVL Delete; Memory Hierarchy

CSE332: Data Abstractions Lecture 8: AVL Delete; Memory Hierarchy. Dan Grossman Spring 2010. Structural properties Binary tree property Balance property: balance of every node is between -1 and 1 Result: Worst-case depth is O ( log n ) Ordering property Same as for BST.

colton
Télécharger la présentation

CSE332: Data Abstractions Lecture 8: AVL Delete; Memory Hierarchy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE332: Data AbstractionsLecture 8: AVL Delete; Memory Hierarchy Dan Grossman Spring 2010

  2. Structural properties Binary tree property Balance property:balance of every node isbetween -1 and 1 Result: Worst-casedepth isO(logn) Ordering property Same as for BST The AVL Tree Data Structure 8 5 11 2 6 10 12 4 7 9 13 14 15 CSE332: Data Abstractions

  3. AVL Tree Deletion • Similar to insertion: do the delete and then rebalance • Rotations and double rotations • Imbalance may propagate upward so rotations at multiple nodes along path to root may be needed (unlike with insert) • Simple example: a deletion on the right causes the left-left grandchild to be too tall • Call this the left-left case, despite deletion on the right • insert(6) insert(3) insert(7) insert(1)delete(7) 1 2 6 3 1 7 1 3 0 0 1 6 1 0 CSE332: Data Abstractions

  4. Properties of BST delete We first do the normal BST deletion: • 0 children: just delete it • 1 child: delete it, connect child to parent • 2 children: put successor in your place, delete successor leaf Which nodes’ heights may have changed: • 0 children: path from deleted node to root • 1 child: path from deleted node to root • 2 children: path from deleted successor leaf to root Will rebalance as we return along the “path in question” to the root 12 5 15 2 9 20 7 10 CSE332: Data Abstractions

  5. Case #1 Left-left due to right deletion • Start with some subtree where if right child becomes shorter we are unbalanced due to height of left-left grandchild h+3 a h+2 h h+1 b Z h+1 h X Y • A delete in the right child could cause this right-side shortening CSE332: Data Abstractions

  6. Case #1: Left-left due to right deletion h+3 h+2 a b h+2 h h+1 h+1 a b h+1 Z h+1 h X h h h+1 X Y Y Z • Same single rotation as when an insert in the left-left grandchild caused imbalance due to X becoming taller • But here the “height” at the top decreases, so more rebalancing farther up the tree might still be necessary CSE332: Data Abstractions

  7. Case #2: Left-right due to right deletion h+3 h+2 c a h+2 h h+1 h+1 b a b h+1 h+1 Z h h h c h+1 U h-1 X V h Z X h U h-1 V • Same double rotation when an insert in the left-right grandchild caused imbalance due to c becoming taller • But here the “height” at the top decreases, so more rebalancing farther up the tree might still be necessary

  8. No third right-deletion case needed So far we have handled these two cases: left-left left-right h+3 h+3 a a h+2 h+2 h h h+1 b b h+1 h+1 Z Z h+1 h c X Y h X h U h-1 V But what if the two left grandchildren are now both too tall (h+1)? • Then it turns out left-left solution still works • The children of the “new top node” will have heights differing by 1 instead of 0, but that’s fine CSE332: Data Abstractions

  9. And the other half • Naturally there are two mirror-image cases not shown here • Deletion in left causes right-right grandchild to be too tall • Deletion in left causes right-left grandchild to be too tall • (Deletion in left causes both right grandchildren to be too tall, in which case the right-right solution still works) • And, remember, “lazy deletion” is a lot simpler and often sufficient in practice CSE332: Data Abstractions

  10. Pros and Cons of AVL Trees • Arguments for AVL trees: • All operations logarithmic worst-case because trees are alwaysbalanced. • The height balancing adds no more than a constant factor to the speed of insert and delete. • Arguments against AVL trees: • Difficult to program & debug • More space for height field • Asymptotically faster but rebalancing takes a little time • Most large searches are done in database-like systems on disk and use other structures (e.g., B-trees, our next data structure) • If amortized (later, I promise) logarithmic time is enough, use splay trees (skipping, see text) CSE332: Data Abstractions

  11. Now what? • We have a data structure for the dictionary ADT that has worst-case O(logn) behavior • One of several interesting/fantastic balanced-tree approaches • We are about to learn another balanced-tree approach: B Trees • First, to motivate why B trees are better for really large dictionaries (say, over 1GB = 230 bytes), need to understand some memory-hierarchy basics • Don’t always assume “every memory access has an unimportant O(1) cost” • Learn more in CSE351/333/471 (and CSE378), focus here on relevance to data structures and efficiency CSE332: Data Abstractions

  12. A typical hierarchy CPU • instructions (e.g., addition): 230/sec • get data in L1: 229/sec = 2 insns • get data in L2: 225/sec = 30 insns • get data in main memory: • 222/sec = 250 insns • get data from “new place” on disk: • 27/sec =8,000,000 insns • “streamed”: 218/sec “Every desktop/laptop/server is different” but here is a plausible configuration these days L1 Cache: 128KB = 217 L2 Cache: 2MB = 221 Main memory: 2GB = 231 Disk: 1TB = 240 CSE332: Data Abstractions

  13. Morals It is much faster to do: Than: 5 million arithmetic ops 1 disk access 2500 L2 cache accesses 1 disk access 400 main memory accesses 1 disk access Why are computers built this way? • Physical realities (speed of light, closeness to CPU) • Cost (price per byte of different technologies) • Disks get much bigger not much faster • Spinning at 7200 RPM accounts for much of the slowness and unlikely to spin faster in the future • Speedup at higher levels makes lower levels relativelyslower • Later in the course: more than 1 CPU! CSE332: Data Abstractions

  14. “Fuggedaboutit”, usually The hardware automatically moves data into the caches from main memory for you • Replacing items already there • So algorithms much faster if “data fits in cache” (often does) Disk accesses are done by software (e.g., ask operating system to open a file or database to access some data) So most code “just runs” but sometimes it’s worth designing algorithms / data structures with knowledge of memory hierarchy • And when you do, you often need to know one more thing… CSE332: Data Abstractions

  15. Block/line size • Moving data up the memory hierarchy is slow because of latency (think distance-to-travel) • May as well send more than just the one int/reference asked for (think “giving friends a car ride doesn’t slow you down”) • Sends nearby memory because: • It’s easy • And likely to be asked for soon (think fields/arrays) • The amount of data moved from disk into memory is called the “block” size or the “(disk) page” size • Not under program control • The amount of data moved from memory into cache is called the “line” size • Not under program control CSE332: Data Abstractions

  16. Connection to data structures • An array benefits more than a linked list from block moves • Language (e.g., Java) implementation can put the list nodes anywhere, whereas array is typically contiguous memory • Suppose you have a queue to process with 223 items of 27 bytes each on disk and the block size is 210 bytes • An array implementation needs 220 disk accesses • If “perfectly streamed”, > 16 seconds • If “random places on disk”, 8000 seconds (> 2 hours) • A list implementation in the worst case needs 223 “random” disk accesses (> 16 hours) – probably not that bad • Note: “array” doesn’t mean “good” • Binary heaps “make big jumps” to percolate (different block) CSE332: Data Abstractions

  17. BSTs? • Since looking things up in balanced binary search trees is O(logn), even for n = 239 (512GB) we don’t have to worry about minutes or hours • Still, number of disk accesses matters • AVL tree could have height of 55 (see lecture7.xlsx) • So each find could take about 0.5 seconds or about 100 finds a minute • Most of the nodes will be on disk: the tree is shallow, but it is still many gigabytes big so the tree cannot fit in memory • Even if memory holds the first 25 nodes on our path, we still need 30 disk accesses CSE332: Data Abstractions

  18. Note about numbers; moral • All the numbers in this lecture are “ballpark” “back of the envelope” figures • Even if they are off by, say, a factor of 5, the moral is the same: If your data structure is mostly on disk, you want to minimize disk accesses • A better data structure in this setting would exploit the block size and relatively fast memory access to avoid disk accesses… CSE332: Data Abstractions

More Related