STORAGE AND I/O Jehan-François Pâris firstname.lastname@example.org
Chapter Organization • Availability and Reliability • Technology review • Solid-state storage devices • I/O Operations • Reliable Arrays of Inexpensive Disks
Reliability and Availability • Reliability • Probability R(t) that system will be up at time t if it was up at time t = 0 • Availability • Fraction of time the system is up • Reliability and availability do not measure the same thing!
Which matters? • It depends: • Reliability for real-time systems • Flight control • Process control, … • Availability for many other applications • DSL service • File server, web server, …
MTTF, MMTR and MTBF • MTTF is mean time to failure • MTTR is mean time to repair • 1/MTTF is failure rate l • MTTBF, the mean time between failures, is MTBF = MTTF + MTTR
Reliability • As a first approximationR(t) = exp(–t/MTTF) • Not true if failure rate varies over time
Availability • Measured by(MTTF)/(MTTF + MTTR) = MTTF/MTBF • MTTR is very important • A good MTTR requires that we detect quickly the failure
The nine notation • Availability is often expressed in "nines" • 99 percent is two nines • 99.9 percent is three nines • … • Formula is –log10 (1 – A) • Example:–log10 (1 – 0.999) = –log10 (10-3) = 3
Example • A server crashes on the average once a month • When this happens, it takes 12 hours to reboot it • What is the server availability?
Solution • MTBF = 30 days • MTTR = 12 hours = ½ day • MTTF = 29 ½ days • Availability is 29.5/30 =98.3 %
Keep in mind • A 99 percent availability is not as great as we might think • One hour down every 100 hours • Fifteen minutes down every 24 hours
Example • A disk drive has a MTTF of 20 years. • What is the probability that the data it contains will not be lost over a period of five years?
Example • A disk farm contains 100 disks whose MTTF is 20 years. • What is the probability that no data will be lost over a period of five years?
Solution • The aggregate failure rate of the disk farm is 100x1/20 =5 failures/year • The mean time to failure of the farm is 1/5 year • We apply the formula R(t) = exp(–t/MTTF) = -exp(–5×5) = 1.4 ×10-11 • Almost zero chance!
Disk drives • See previous chapter • Recall that the disk access time is the sum of • The disk seek time (to get to the right track) • The disk rotational latency • The actual transfer time
Flash drives • Widely used in flash drives, most MP3 players and some small portable computers • Similar technology as EEPROM • Two technologies
What about flash? • Widely used in flash drives, most MP3 players and some small portable computers • Several important limitations • Limited write bandwidth • Must erase a whole block of data before overwriting them • Limited endurance • 10,000 to 100,000 write cycles
Storage Class Memories • Solid-state storage • Non-volatile • Much faster than conventional disks • Numerous proposals: • Ferro-electric RAM (FRAM) • Magneto-resistive RAM (MRAM) • Phase-Change Memories (PCM)
Phase-Change Memories No moving parts A data cell Crossbarorganization
Phase-Change Memories • Cells contain a chalcogenide material that has two states • Amorphous with high electrical resistivity • Crystalline with low electrical resistivity • Quickly cooling material from above fusion point leaves it in amorphous state • Slowly cooling material from above crystallization point leaves it in crystalline state
Projections • Target date 2012 • Access time 100 ns • Data Rate 200–1000 MB/s • Write Endurance 109 write cycles • Read Endurance no upper limit • Capacity 16 GB • Capacity growth > 40% per year • MTTF 10–50 million hours • Cost < $2/GB
Interesting Issues (I) • Disks will remain much cheaper than SCM for some time • Could use SCMs as intermediary level between main memory and disks Main memory SCM Disk
A last comment • The technology is still experimental • Not sure when it will come to the market • Might even never come to the market
Interesting Issues (II) • Rather narrow gap between SCM access times and main memory access times • Main memory and SCM will interact • As the L3 cache interact with the main memory • Not as the main memory now interacts with the disk
Today’s Motivation • We use RAID today for • Increasing disk throughput by allowing parallel access • Eliminating the need to make disk backups • Disks are too big to be backed up in an efficient fashion
RAID LEVEL 0 • No replication • Advantages: • Simple to implement • No overhead • Disadvantage: • If array has n disks failure rate is n times the failure rate of a single disk
RAID level 0 RAID levels 0 and 1 Mirrors RAID level 1
RAID LEVEL 1 • Mirroring: • Two copies of each disk block • Advantages: • Simple to implement • Fault-tolerant • Disadvantage: • Requires twice the disk capacity of normal file systems
RAID LEVEL 2 • Instead of duplicating the data blocks we use an error correction code • Very bad idea because disk drives either work correctly or do not work at all • Only possible errors are omission errors • We need an omission correction code • A parity bit is enough to correct a single omission
RAID levels 2 and 3 Check disks RAID level 2 Parity disk RAID level 3
RAID LEVEL 3 • Requires N+1 disk drives • N drives contain data (1/N of each data block) • Block b[k] now partitioned into N fragments b[k,1], b[k,2], ... b[k,N] • Parity drive contains exclusive or of these N fragments p[k] = b[k,1] b[k,2] ...b[k,N]
How parity works? • Truth table for XOR (same as parity)
Recovering from a disk failure • Small RAID level 3 array with data disks D0 and D1 and parity disk P can tolerate failure of either D0 or D1
Block Chunk Chunk Chunk Chunk How RAID level 3 works (I) • Assume we have N + 1 disks • Each block is partitioned into N equal chunks N = 4 in example
Parity How RAID level 3 works (II) • XOR data chunks to compute the parity chunk • Each chunk is written into a separate disk Parity
How RAID level 3 works (III) • Each read/write involves all disks in RAID array • Cannot do two or more reads/writes in parallel • Performance of array not better than that of a single disk
RAID LEVEL 4 (I) • Requires N+1 disk drives • N drives contain data • Individual blocks, not chunks • Blocks with same disk address form a stripe x x x x ?
RAID LEVEL 4 (II) • Parity drive contains exclusive or of theN blocks in stripe p[k] = b[k] b[k+1] ...b[k+N-1] • Parity block now reflects contents of several blocks! • Can now do parallel reads/writes
RAID levels 4 and 5 Bottleneck RAID level 4 RAID level 5
RAID LEVEL 5 • Single parity drive of RAID level 4 is involved in every write • Will limit parallelism • RAID-5 distribute the parity blocks among the N+1 drives • Much better
The small write problem • Specific to RAID 5 • Happens when we want to update a single block • Block belongs to a stripe • How can we compute the new value of the parity block p[k] b[k] b[k+1] b[k+2] ...
First solution • Read values of N-1 other blocks in stripe • Recompute p[k] = b[k] b[k+1] ...b[k+N-1] • Solution requires • N-1 reads • 2 writes (new block and new parity block)
Second solution • Assume we want to update block b[m] • Read old values of b[m] and parity block p[k] • Compute p[k] = new b[m] old b[m] old p[k] • Solution requires • 2 reads (old values of block and parity block) • 2 writes (new block and new parity block)
RAID level 6 (I) • Not part of the original proposal • Two check disks • Tolerates two disk failures • More complex updates
RAID level 6 (II) • Has become more popular as disks become • Bigger • More vulnerable to irrecoverable read errors • Most frequent cause for RAID level 5 array failures is • Irrecoverable read error occurring while contents of a failed disk are reconstituted
RAID level 6 (III) • Typical array size is 12 disks • Space overhead is 2/12 = 16.7 % • Sole real issue is cost of small writes • Three reads and three writes: • Read old value of block being updated,old parity block P, old party block Q • Write new value of block being updated, new parity block P, new party block Q
CONCLUSION (II) • Low cost of disk drives made RAID level 1 attractive for small installations • Otherwise pick • RAID level 5 for higher parallelism • RAID level 6 for higher protection • Can tolerate one disk failure and irrecoverable read errors