CMPT 454

Data Storage and Disk Access CMPT 454

Data Storage and Disk Access • Memory hierarchy • Hard disks • Architecture • Processing requests • Writing to disk • Hard disk reliability and efficiency • RAID • Solid State Drives • Buffer management • Data storage

Memory

DBMS and Memory Cache Main Memory DBMS Disk Virtual Memory File System Tertiary Storage

Memory Hierarchy • Primary memory: volatile • Main memory • Cache • Secondary memory: non-volatile • Solid State Drive (SSD) • Magnetic Disk (Hard Disk Drive, HDD) • Tertiary memory: non-volatile • CD/DVD • Tape - sequential access • Usually used as backup or for long-term storage cost speed

Main Memory vs Secondary Storage • Speed • Main memory is much faster than secondary memory • 10 – 100 nanoseconds to move data in main memory • 0.00001 to 0.0001 milliseconds • 10 milliseconds to read a block from an HDD • 0.1 milliseconds to read a block from a SSD • Cost • Main memory is around 100 times more expensive than secondary memory • SSDs are more expensive than HDDs

Main Memory vs Secondary Storage • System limitations • On a 32 bit system only 232 bytes can be directly referenced • Many databases are larger than that • Volatility • Data must be maintained between program executions which requires non-volatile memory • Nonvolatile storage retains its contents when the device is turned off, or if there is a power failure • Main memory is volatile, secondary storage is not

Hard Disk Drives

Magnetic Disks • Database data is usually stored on disks • A database will often be too large to be retained in main memory • When a query is processed data will need to be retrieved from storage • Data is stored on disk blocks • Also referred to as blocks, or in relation to the OS, pages • A contiguous sequence of bytes and • The unit in which data is written to, and read from • Block size is typically between 4 and 16 kilobytes

Magnetic Disk Structure • A hard disk consists of a number of platters • Platters can store data on either one or both of its surfaces so is referred to as • Single-sided or double sided • Surfaces are composed of concentric rings called tracks • The set of all tracks with the same diameter is called a cylinder • Sectors are arcs of a track • And are typically 4 kilobytes in size • Block size is set when the disk is initialized, usually a small multiple of the sector size (hence 4 to 16 kilobytes)

Diagram of a Disk surfaces platter 3* 2* track cylinder *statistics for Western Digital Caviar Black 1 TB hard drive

Disk Heads • Data is transferred to or from a surface by a disk head • There is one disk head for each surface • These disk heads are moved as a unit (called a disk head array) • Therefore all the heads are in identical positions with respect to their surfaces • To read or write a block a disk head must be positioned over it • Only one disk head can read or write at a time

Disk Anatomy the disk spins – around 7,200rpm disk head array track moves in and out platters

Disk Controller • Disk drives are controlled by a processor called a disk controller which • Controls the actuator that moves the head assembly • Selects sectors and determines when the disk has rotated to a sector • Transfers data between the disk and main memory • Some controllers buffer data from tracks in the expectation that the data will be required

Accessing Data in a Disk The disk constantly spins 7,200 rpm* The head pivots over the desired track The desired block is read as it passes underneath the head * Western Digital Caviar Black 1 TB hard drive (again)

Accessing A Block • The disk head is moved in or out to the track • This seek time is typically  10 milliseconds • WD Caviar Caviar Black 1TB: 8.9 ms • Wait until the block rotates under the disk head • This rotational delay is typically  4 milliseconds • WD Caviar Caviar Black 1TB : 4.2 ms • The data on the block is transferred to memory • This transfer time is the time it takes for the block to completely rotate past the disk head • Typically less than 1 millisecond

Transfer Time • The seek time and rotational delay depend on • Where the disk head is before the request, • Which track is being requested, and • How far the disk has to rotate • The transfer time depends on the request size • The transfer time (in ms) for one block equals • (60,000 / disk rpm) / blocks per track • The transfer time (in ms) for an entire track equals • (60,000 / disk rpm)

Main Memory versus Disk • Typical access time for a block on a hard disk • 15 milliseconds • Typical access time for a main memory frame • 60 nanoseconds • What’s the difference? • 1 millisecond = 1,000,000 nanoseconds • 60 ns = 0.000,060 ms • Accessing a hard drive is around 250,000 times slower than accessing main memory

Reducing Disk Access Time • Disk latency (access time) has three components • seek time + rotational delay + transfer time • The overall access time can be shortened by reducing, or even eliminating seek time and rotational delay • Related data should be stored in close proximity • Accessing two records in adjacent blocks on a track • Seek the desired track, rotate to first block, and transfer two blocks = 10 + 4 + 2*1 = 16ms • Accessing two records on different tracks • Seek the desired track, rotate to the block, and transfer the block, then repeat = (10 + 4 + 1)*2 = 30ms

Order of Closeness • What does it mean to say that related data should be stored close to each other? • The term close refers not to physical proximity but to how the access time is affected • In order of closeness: • Same block • Adjacent blocks on the same track • Same track • Same cylinder, but different surfaces • Adjacent cylinders • …

Which is Closer • Is 2, or 3 "closer" to 1? • 2 is in the adjacent track • And is clearly physically closer, but • The disk head must be moved to access it • 3 is in the same cylinder • The disk head does not have to be moved • Which is why 3 is closer 1 x x 2 x 3

Fulfilling Disk Requests • A fairalgorithm would take a first-come, first-serve approach • Insert requests in a queue and process them in the order in which they are received

Elevator Algorithm • The elevator algorithm usually performs better than FIFO • Requests are buffered and the disk head moves in one direction, processing requests • The arm then reverses direction

Requests – Discussion • The elevator algorithm gives much better performance than FIFO on average • And is a relatively fairalgorithm • The elevator algorithm is not optimal • The shortest-seek first algorithm is closer to optimal but can result in a high variance in response time • And may even result in starvation for distant requests • In some cases the elevator algorithm can perform worse than FIFO

Modifying a Record • To modify an existing record (on a disk) the following steps must be taken • Read the record • Modify the record in main memory • Write the modified record back to disk • It is important to remember that the smallest unit of transfer to / from a disk is a block • A single disk block usually contains many records

Read – Modify – Write Cycle Read one block into main memory …

Read – Modify – Write Cycle Read one block into main memory … … modify the desired record …

Read – Modify – Write Cycle Read one block into main memory … … modify the desired record … … and write it back.

Inserting Records • Consider creating a new record • The user enters the data for the record • Through some application interface • The record is created in main memory • And then written to disk • Does this process require a read-modify-write process? • YES! • Because, otherwise, the existing contents of the disk block will be overwritten

Disk Failures • Intermittent failure • Multiple attempts are required to read or write a sector • Media decay • A bit or a number of bits are permanently corrupted and it is impossible to read a sector • Write failure • A sector cannot be written to or retrieved • Often caused by a power failure during a write • Disk crash • The entire disk becomes unreadable

Checksums • An intermittent failure may result in incorrect data being read by the disk controller • Such incorrect data can be detected by a checksum • Each sector contains additional bits whose values are based on the data bits in the sector • A simple single-bit checksum is to maintain an even parity on the sector • If there is an odd number of 1s the parity is odd • If there is an even number of 1s the parity is even

Parity Bits • Assume that there are seven data bits and a single checksum bit • Data bits 0111011 – parity is odd • Checksum bit is set to 1 so that the overall parity is even • Using a single checksum bit allows errors of only one bit to be detected reliably • Several checksum bits can be maintained to reduce the chance of failing to notice an error • e.g. maintain 8 checksum bits, one for each bit position in the data bytes

Stable Storage • Checksums can detect errors but can't correct them • Stable storage can be implemented on a disk to allow errors to be corrected • Sectors are paired, with each pair representing a single sector • Pairs are usually referred to as Left and Right • Errors in a sector (L or R) are detected using checksums • Stable storage can cope with media failures and write failures

Stable Storage Policy • For writing, write the value of some sector X into XL • Check that the value is correct (using checksums) • If the value is not correct after a given number of attempts then assume that the sector has failed • A spare sector should be substituted for XL • Repeat the process for XR • For reading, XL and XR are read from in turn until a correct value is returned

RAID

Problems with Hard Disks • Hard disks act as bottlenecks for processing • DB data is stored on disks, and must be fetched into main memory to be processed, and • Disk access is considerably slower than main memory processing • There are also reliability issues with disks • Disks contain mechanical components that are more prone to failure than electronic components • One solution is to use multiple disks

Multiple Disks • Multiple disks • Each disk contains multiple platters • Disks can be read in parallel, and • Different disks can read from different cylinders • e.g. the first disk can access data from cylinder 6,000, while the second disk is accessing data from cylinder 11,000 • Single disk • Multiple platters • Disk heads are always over the same cylinder

Improving Efficiency • Using multiple disks to store data improves efficiency as the disks can be read in parallel • To satisfy a request the physical disks and disk blocks that the data resides on must be identified • The data may be on a single disk, or it may be split over multiple disks • The way in which data is distributed over the disks affects the cost of accessing it • In the same way that related data should be stored close to each other on a single disk

Data Striping • A disk array gives the user the abstraction of a single, large, disk • When an I/O request is issued the physical disk blocks to be retrieved have to be identified • How the data is distributed over the disks in the array affects how many disks are involved in an I/O request • Data is divided into partitions called striping units • The striping unit is usually either a block or a bit • Striping units are distributed over the disks using a round robin algorithm

disk 1 disk 2 disk 3 disk 4 Striping Notional File – the data is divided into striping units of a given size The striping units are distributed across a RAID system in a round robin fashion The size of the striping unit has an impact on the behaviour of the system

Striping Units – Block Striping Assume that a file is to be distributed across a four disk RAID system, using block striping, and that, Purely for the sake of illustration, the block size is just one byte! Notional File – the numbers represent a sequence of individual bits in the file Distribute these bits across a 4 disk RAID system using BLOCK striping: Disk 1 Disk 2 Disk 3 Disk 4 Block 1 Block 2 Block 3

Striping Units – Bit Striping Here is the same file to be distributed across a four disk RAID system, this time using bit striping, and again remember that Purely for the sake of illustration , the block size is just one byte! Notional File – the numbers represent a sequence of individual bits in the file Distribute these bits across a 4 disk RAID system using BIT striping: Disk 1 Disk 2 Disk 3 Disk 4 Block 3 Block 1 Block 2

Disk Array Performance • Assume that a disk array consists of D disks • Data is distributed across the disks using data striping • How does it perform compared to a single disk? • To answer this question we must specify the kinds of requests that will be made • Random read – reading multiple, unrelated records • Random write • Sequential read – reading a number of records (such as one file or table), stored on more than D blocks • Sequential write

The Basic Idea … • Use all D disks to improve efficiency, and distribute data using block striping • Random read performance • Very good – up to D different records can be read at once • Depending on which disks the records reside on • Random write performance – same as read performance • Sequential read performance • Very good – as related data are distributed over all D disks performance is D times faster than a single disk • Sequential write performance – same as read performance • But what about reliability …

Reliability • Hard disks contain mechanical components and are less reliable than other, purely electronic, components • Increasing the number of hard disks decreases reliability, reducing the mean-time-to-failure (MTTF) • The MTTF of a hard disk is  50,000 hours, or 5.7 years • In a disk array the overall MTTF decreases • Because the number of disks is greater • MTTF of a 100 disk array is 21 days – (50,000/100) / 24 • This assumes that failures occur independently and • The failure probability does not change over time • Reliability is improved by storing redundant data

Redundancy • Reliability of a disk array can be improved by storing redundant data • If a disk fails the redundant data can be used to reconstruct the data lost on the failed disk • The data can either be stored on a separate check disk or • Distributed uniformly over all the disks • Redundant data is typically stored using one of two methods • Mirroring, where each disk is duplicated • A parity scheme, where sufficient redundant data is maintained to recreate the data in any one disk • Other redundancy schemes provide greater reliability

Parity Scheme • For each bit on the data disks there is a parity bit on a check disk • If the sum of the data disks bits is even the parity bit is set to zero • If the sum of the bits is odd the parity bit is set to one • The data on any one failed disk can be recreated bit by bit 4 data disk system showing individual bit values 5th check disk containing parity data

Parity Scheme Read and Write • Reading • The parity scheme does not affect reading • Writing • A naïve approach would be to calculate the new value of the parity bit from all the data disks • A better approach is to compare the old and new values of the disk that is written to • And change the value of a parity bit if the corresponding bits have changed

Introducing RAID • A RAID system consists of several disks organized to increase performance and improve reliability • Performance is improved through data striping • Reliability is improved through redundancy • RAID stands for Redundant Arrays of Independent Disks • There are several RAID schemes or levels • The levels differ in terms of their • Read and write performance, • Reliability, and • Cost

RAID Level 0 • All D disks are used to improve efficiency, and data is distributed using block striping • No redundant information is kept • Read and write performance is very good • But, reliability is poor • Unless data is regularly backed up a RAID 0 system should only be used when the data is not important • A RAID 0 system is the cheapest of all RAID levels • As there are no disks used for storing redundant data

CMPT 454

CMPT 454

Presentation Transcript

CMPT 371

CMPT 371

CMPT 454

CMPT 454

CMPT 371

CMPT 225

CMPT 454

CMPT 371

CMPT 225

CMPT 466

CMPT 454

CMPT 371

CMPT 371

CMPT 371

CMPT 225

CMPT 361

cmpt-225

CMPT 401

CMPT 371

CMPT 225

CMPT 454

454