430 likes | 511 Vues
Learn about disk workings, memory sizes, and database design impact. Compare disk features of IBM Deskstar and others. Understand the implications for storage strategies in databases.
 
                
                E N D
Seminar 1 – CG171 Disk architecture
Intention • Using the web and other resources: • Find out more about how disks work • What do the terms seek time, rotational delay, transfer time, platter, sector, block, track and cylinder mean? • For two different manufacturer’s disk-packs (say IBM mainframe, Microsoft PC) find typical magnitudes for the above features. • Also identify typical main memory sizes for the corresponding computers. • Suggest how the characteristics affect physical design of databases
Picture of Disk Shows pack: http://www.cs.unb.ca/profs/hyslop/cs2003/disks.pdf Gives some definitions.
Disk and Memory parameters Ramakrishnan p.199 IBM Deskstar 14GPX is: 3.5 in, 14.4 GB hard disk, average seek time 9.1 millisecs (msecs) – minimum 2.2, max 15.5. average rotational delay (latency) 4.17 msecs 5 double-sided platters that spin at 7,200 revs/minute each platter holds 3.35Gb of data (density = 2.6 gigabit/square inch) data transfer rate is 13MB per second. Typical disk access takes 10 msecs (check with above) Main memory access is less than 60 nanoseconds
Latest performance data Some disks now revolve at over 10,000rpm Some performance data: http://www.pctechguide.com/04disks.htm By late 2001 the fastest high-performance drives were capable of an average latency of less than 3ms, an average seek time of between 4 and 7ms and maximum data transfer rates in the region of 50 and 60MBps for EIDE and SCSI-based drives respectively. Note the degree to which these maximum DTRs are below the bandwidths of the current versions of the drive's interfaces - Ultra ATA/100 and UltraSCSI 160 - which are rated at 100MBps and 160MBps respectively.
Further terminology http://www.makeitsimple.com/articles/hdguide/hd_guide2.htm
Relative Times • Implication for strategy: • Small databases should be loaded into main memory • Blocks should be as large as possible • Preference for holding related data: • First on same track • Second on same cylinder • Third on adjacent cylinder
Seminar 2 Nick Rossiter
Heap Files • Consider a problem of data storage/retrieval: • Available on the machine for an application: • 64Mb main memory • 1Mb main memory allocation for buffer (area of main memory in which data from disk is placed) • 64Gb disk
Application • The application has 10,000,000 records of 100 bytes each. • The key value is a single attribute employee_number.
Questions and Answers 1-3 • Does the data fit on the disk? Yes, 1Gb (1,000,00,000) is the requirement and 64Gb is available. • Does all the data fit in main memory at once? No, main memory even if all available holds only 64Mb (64,000,000). • Assume that 2,000 records fit per disk page. • How many pages are required to hold the data? So R=2,000 (records/page) and there are 10,000,000 records so number of pages needed (B) is 10,000,000/2,000 = 5,000. • With a disk access taking 0.015 msecs, how long does it take for an exhaustive search of the file stored as a heap? B(D+RC) = 5,000(0.015 + 2,000*10-7) = 76
Questions and Answers 4 • Repeat 3) with a) 100 records per disk page and b) 10,000 records/disk page. • If R=100, then B=100,000 so B(D+RC) = 100,000(0.015+ 100*10-7) = 1,501 secs • If R=10,000 then B= 1,000 so B(D+RC) = 1,000(0.015+10,000*10-7) = 16.0 secs Note that simply packing more records/page dramatically reduces the process time. So increase R to as high a figure as the amount of main memory available.
Questions and Answers 5-6 • Can the number of records/page be increased further? No as buffer size of 1Mb (1,000,000) in main memory can only fit 10,000 records at 100 bytes each • What is the average cost of finding a particular record in this file with 2,000 records/page using: • a heap access method? B(D+RC)/2=5,000(0.015 + 2,000*10-7) = 38 secs • a sequential (sorted) access method? D log2B+Clog2R = (0.015*13)+(10-7*11) = 0.195+(1.1*10-6)  0.195 secs So sequential is much faster than heaped in finding a particular record. Note: here log to base 2 of X is the power of 2 needed to reach or exceed X. These logs are whole numbers for our purposes. E.g. log28 = 3; log2256=8; log2300=9.
Questions and Answers 7 • What is the average cost of inserting a particular record into this file with 2,000 records/page using: • a heap access method? 2D + C = (2*0.015)+10-7 0.03 secs • a sequential (sorted) access method? D log 2B + C log 2R + 2*(0.5B(D+RC)) = (0.015*13)+(10-7*11)+(5,000*(0.015+2,000*10-7)) = 0.195+(1.1*10-6)+76  76.2 secs So heap is far faster for insertions.
Improvements • Any suggestions for making the system work faster?? Coming soon!!
Seminar 3 Nick Rossiter
Hashed Files • Consider a problem of data storage/retrieval: • Available on the machine for an application: • 64Mb main memory • 1Mb main memory allocation for buffer (area of main memory in which data from disk is placed) • 64Gb disk
Application • The application has 5,000,000 records of 100 bytes each. • The key value is a single attribute employee_number of the form: • Annnnnnn (7 n’s) • where A is a capital letter and n is a number
Questions and Answers 1-2 • What is the total number of possible employee numbers? Annnnnnn – nnnnnnn gives the range 0000000-9999999 (10,000,000 combinations), A gives 26 possibilities (A-Z) so the key can be represented in 260,000,000 ways, ample for 5,000,000 actual records. • Can you find a hash function which maps the data into 1,000 pages? Yes: remove initial letter, divide by 1,000 and take the remainder as the page number giving the range of page numbers 0 .. 999
Questions and Answers 3-5 • How many records will this produce per page? Number of records in file/number of pages in the file, that is 5,000,000/1,000 = 5,000 • Is this the number of records per page to be used in the implementation? No, aim for about 80% packing so have perhaps 6,000 records/page. • What assumptions does this hash function make about distribution of key values? That they are random with respect to the hash function. If disproportionately high numbers of key values ended in 000, page 0 would become full very quickly leading to excessive collisions.
Questions and Answers 6 • How long will it take to find the record with key E0011223? Trace through the steps taken. Apply hash function to key value: Remove initial letter giving 00112233 Find remainder after dividing 00112233 by 1000: giving page number of 233. Retrieve page 233 from disk (transfer to main memory, cost = D) Search page 233 in main memory (assume found on average by searching half page, cost = 0.5RC). Total Cost = D + 0.5RC = 0.015 + (0.5*5000*10-7) = 0.01530 secs (very fast)
Questions and Answers 7 • How long will it take to insert a record with key J9910657? Trace through the steps taken. Apply hash function to key value: Remove initial letter giving 9910657 Find remainder after dividing 9910657 by 1000: giving page number of 657. Retrieve page 657 from disk (transfer to main memory, cost = D) Insert new record into page 657 in main memory (assume anywhere as in heap, cost = C) Write page 657 back to disk (cost = D) Cost = D + C + D= 0.015 + 10-7 + 0.015 = 0.0300001 secs (very fast)
Questions 8-11 • How long will it take to find all employee numbers in the range K0011200..K0011299? • How does the cost here compare with that in sorted and heap files? • How long will it take to find all employee numbers beginning with K? • How does the cost here compare with that in sorted and heap files?
Q8 – Hash Search • K0011200..K0011299 • How many pages will this be distributed over? • Not known. Really need idea of denseness - does every possible key exist? • Maximum is over pages 200, 201, …., 299 so 100 pages; minimum is over 0 pages (does not exist) • Because of uncertainty, hash method with any range (even on key) involves complete scan • Cost = 1000*(D + RC) =1000*( 0.015 + (5000*10-7)) = 15.5 secs (very long)
Q9 – Sorted/Heap • Heap cost: search all of file – cost = B(D+RC) = 833*(0.015 + 6,000*10-7) = 13.0 secs (very long, B=833 as can pack 6,000 records per page at 100% full) • Sorted cost: find initial page by binary chop then find all records in one, possibly two, accesses. Cost = D log2B+Clog2R = (0.015*10)+ (10-7*13) = 0.15+(1.1*10-6)  0.15 secs (much the fastest) • There is a minor extra cost for sorted. What is it?
Q10/Q11 – Cost K? • Cannot say how many records there are. • Hashed -- cost will be search of whole of file =B*(D + RC) =1000*( 0.015 + (5000*10-7)) = 15.5 secs (very long, longer than heap as <100% full) • Heap – cost will be search of all of file = B(D+RC) = 833(0.015 + 6,000*10-7) = 13.0 secs (very long) • Sorted – cost will be D log2B+Clog2R = (0.015*10)+ (10-7*13)  0.15 secs • So sorted is very much faster • Note hashing gives fast single-record access but is slow for searching on ranges
Seminar 4 Nick Rossiter
ISAM example • See word document • ISAM Example • On BB
Questions 1-3 • What do you assume as the value for R (number of records per page)? 3 (very low for efficiency purposes, but easier to draw!) • How many levels does the index have? 2 • If the index is say 100k, can this fit into main memory? Probably – small in relation to normal main memory availability. Would speed up searches if held in main memory.
Question 4 • Trace the mechanism by which: • Record with key 3 is found • Pointers to left of key value go to nodes/pages with lower key values • Pointers to right of key values go to nodes/pages with equal or higher key values So in root node, take pointer left-hand of 20 to node on left at next level, then take pointer left of 7 to page 0 in data, retrieve 3 by serial search of page. • Record with key 4 is not found Similar route to page 0 (not page 1 as this page contains records with key values from 7-12). Serial search of page reports not there.
Questions 5-7 • Suggest some key values for pages 3 and 4. Any 3 values 20-25 for page 3, 26-32 for page 4 • What is the cost of finding record 3? D+0.5*RC=0.015+(0.5*3*10-7) = 0.01500015 secs (index in main memory, half a page is searched to find a record on average) • If the whole file of 1,000 pages is to be scanned: • What is the strategy? Process data file only in one complete pass (by-pass index) • What is the cost? B(D+RC) = 1000(0.015+3*10-7 ) = 15.0003 secs
Question 8 • The user wants to add record 4. • Where does this go? Should go in page 0 but no room here as space is occupied with 100% packing density. Form overflow page linked to page 0 with a pointer. • What problems does it cause? Now may need 2 disk accesses to search for record 4. If get intense local growth in key values will get uneven access times across the file. Eventually re-organise the file to sort overflow records into the main sequence and re-build the index.
Seminar 5 Nick Rossiter
B-trees • An example B-tree is given on BB
Capacity • Take a B-tree with: • Order 200 (d = 200) • How many records fit into leaf nodes in tree at: • one level? 1 node at 400 records (400) • two levels? 401 nodes at 400 records (160,400) • three levels? 401*401 nodes at 400 records (64,320,400) • four levels? 401*401*401 nodes at 400 records (25,792,480,400)
Search Times • What is the cost (seconds) of finding record with key 13 in the B-tree provided? If root node is in memory, the cost is for searching 2 levels of tree and on average finding the record half-way through the page. (T-1)*(D+0.5RC) = (3-1)*(0.015+(0.5*4*10-7)) = 2*0.0150002 = 0.0300004 secs • Is the search cost the same for all other records? Yes, all of actual data is in leaf nodes
Search Times • What is the cost of finding record 13 and sequentially processing the preceding three records? Cost for q2 – finding record 13 (0.0300004 secs). Add on 2 further (disk read + search of whole page) = 2(D+RC) = 2*(0.015+(4*10-7)) = 2*0.0150004 = 0.0300008. Total = 0.0600012 secs
Differences from ISAM • What are the essential differences between ISAM and B-Trees? ISAM is static approach, B-tree a dynamic one. B-tree height adjusts automatically as data inserted and deleted. ISAM height is fixed at initial load of data. B-trees are more rigorously defined than ISAM ones.
Insertions • Add in turn, records with key values: • 5 • 6 • 8 • 10 • 12 • Give the tree (or at least the critical part of it) after each insertion. • The state returned after the insertion must ensure that each node has d…2d records.
Adding records • Add 5 (re-distribute) • Child nodes: 1,2,3,4 5,7,9 13,14 • Parent node: 5,13 • Add 6 (into node directly – no adjustments) • Child nodes: 1,2,3,4 5,6,7,9 13,14 • Parent node: 5,13
Adding records • Add 8 (re-distribute) • Child nodes: 1,2,3,4 5,6,7,8, 9,13,14 • Parent node: 5,9 • Add 10 (into node directly – no adjustments) • Child nodes: 1,2,3,4 5,6,7,8, 9,10,13,14 • Parent node: 5,9
Adding records - Splitting • Add 12 (push up and split) • No redistribution possible, all siblings full with 4 records each • Right-hand sibling is 9,10,12,13,14 (breaks rules) • Push middle record (12) up tree • Split child node into two nodes • Child nodes: 1,2,3,4 5,6,7,8, 9,10 12,13,14 • Parent node: 5,9,12
Adding records - Splitting • Parent node can still grow by one more record • When it overflows, its central record is pushed upwards • When overflow occurs at top of tree, tree grows in height by one level