STORAGE AND I/O

STORAGE AND I/O Jehan-François Pâris jfparis@uh.edu

Chapter Organization • Availability and Reliability • Technology review • Solid-state storage devices • I/O Operations • Reliable Arrays of Inexpensive Disks

DEPENDABILITY

Reliability and Availability • Reliability • Probability R(t) that system will be up at time t if it was up at time t = 0 • Availability • Fraction of time the system is up • Reliability and availability do not measure the same thing!

Which matters? • It depends: • Reliability for real-time systems • Flight control • Process control, … • Availability for many other applications • DSL service • File server, web server, …

MTTF, MMTR and MTBF • MTTF is mean time to failure • MTTR is mean time to repair • 1/MTTF is failure rate l • MTTBF, the mean time between failures, is MTBF = MTTF + MTTR

Reliability • As a first approximationR(t) = exp(–t/MTTF) • Not true if failure rate varies over time

Availability • Measured by(MTTF)/(MTTF + MTTR) = MTTF/MTBF • MTTR is very important • A good MTTR requires that we detect quickly the failure

The nine notation • Availability is often expressed in "nines" • 99 percent is two nines • 99.9 percent is three nines • … • Formula is –log10 (1 – A) • Example:–log10 (1 – 0.999) = –log10 (10-3) = 3

Example • A server crashes on the average once a month • When this happens, it takes 12 hours to reboot it • What is the server availability?

Solution • MTBF = 30 days • MTTR = 12 hours = ½ day • MTTF = 29 ½ days • Availability is 29.5/30 =98.3 %

Keep in mind • A 99 percent availability is not as great as we might think • One hour down every 100 hours • Fifteen minutes down every 24 hours

Example • A disk drive has a MTTF of 20 years. • What is the probability that the data it contains will not be lost over a period of five years?

Example • A disk farm contains 100 disks whose MTTF is 20 years. • What is the probability that no data will be lost over a period of five years?

Solution • The aggregate failure rate of the disk farm is 100x1/20 =5 failures/year • The mean time to failure of the farm is 1/5 year • We apply the formula R(t) = exp(–t/MTTF) = -exp(–5×5) = 1.4 ×10-11 • Almost zero chance!

TECHNOLOGY OVERVIEW

Disk drives • See previous chapter • Recall that the disk access time is the sum of • The disk seek time (to get to the right track) • The disk rotational latency • The actual transfer time

Flash drives • Widely used in flash drives, most MP3 players and some small portable computers • Similar technology as EEPROM • Two technologies

What about flash? • Widely used in flash drives, most MP3 players and some small portable computers • Several important limitations • Limited write bandwidth • Must erase a whole block of data before overwriting them • Limited endurance • 10,000 to 100,000 write cycles

Storage Class Memories • Solid-state storage • Non-volatile • Much faster than conventional disks • Numerous proposals: • Ferro-electric RAM (FRAM) • Magneto-resistive RAM (MRAM) • Phase-Change Memories (PCM)

Phase-Change Memories No moving parts A data cell Crossbarorganization

Phase-Change Memories • Cells contain a chalcogenide material that has two states • Amorphous with high electrical resistivity • Crystalline with low electrical resistivity • Quickly cooling material from above fusion point leaves it in amorphous state • Slowly cooling material from above crystallization point leaves it in crystalline state

Projections • Target date 2012 • Access time 100 ns • Data Rate 200–1000 MB/s • Write Endurance 109 write cycles • Read Endurance no upper limit • Capacity 16 GB • Capacity growth > 40% per year • MTTF 10–50 million hours • Cost < $2/GB

Interesting Issues (I) • Disks will remain much cheaper than SCM for some time • Could use SCMs as intermediary level between main memory and disks Main memory SCM Disk

A last comment • The technology is still experimental • Not sure when it will come to the market • Might even never come to the market

Interesting Issues (II) • Rather narrow gap between SCM access times and main memory access times • Main memory and SCM will interact • As the L3 cache interact with the main memory • Not as the main memory now interacts with the disk

RAID Arrays

Today’s Motivation • We use RAID today for • Increasing disk throughput by allowing parallel access • Eliminating the need to make disk backups • Disks are too big to be backed up in an efficient fashion

RAID LEVEL 0 • No replication • Advantages: • Simple to implement • No overhead • Disadvantage: • If array has n disks failure rate is n times the failure rate of a single disk

RAID level 0 RAID levels 0 and 1 Mirrors RAID level 1

RAID LEVEL 1 • Mirroring: • Two copies of each disk block • Advantages: • Simple to implement • Fault-tolerant • Disadvantage: • Requires twice the disk capacity of normal file systems

RAID LEVEL 2 • Instead of duplicating the data blocks we use an error correction code • Very bad idea because disk drives either work correctly or do not work at all • Only possible errors are omission errors • We need an omission correction code • A parity bit is enough to correct a single omission

RAID levels 2 and 3 Check disks RAID level 2 Parity disk RAID level 3

RAID LEVEL 3 • Requires N+1 disk drives • N drives contain data (1/N of each data block) • Block b[k] now partitioned into N fragments b[k,1], b[k,2], ... b[k,N] • Parity drive contains exclusive or of these N fragments p[k] = b[k,1] b[k,2]  ...b[k,N]

How parity works? • Truth table for XOR (same as parity)

Recovering from a disk failure • Small RAID level 3 array with data disks D0 and D1 and parity disk P can tolerate failure of either D0 or D1

Block Chunk Chunk Chunk Chunk How RAID level 3 works (I) • Assume we have N + 1 disks • Each block is partitioned into N equal chunks N = 4 in example

    Parity How RAID level 3 works (II) • XOR data chunks to compute the parity chunk • Each chunk is written into a separate disk Parity

How RAID level 3 works (III) • Each read/write involves all disks in RAID array • Cannot do two or more reads/writes in parallel • Performance of array not better than that of a single disk

RAID LEVEL 4 (I) • Requires N+1 disk drives • N drives contain data • Individual blocks, not chunks • Blocks with same disk address form a stripe x x x x ?

RAID LEVEL 4 (II) • Parity drive contains exclusive or of theN blocks in stripe p[k] = b[k] b[k+1]  ...b[k+N-1] • Parity block now reflects contents of several blocks! • Can now do parallel reads/writes

RAID levels 4 and 5 Bottleneck RAID level 4 RAID level 5

RAID LEVEL 5 • Single parity drive of RAID level 4 is involved in every write • Will limit parallelism • RAID-5 distribute the parity blocks among the N+1 drives • Much better

The small write problem • Specific to RAID 5 • Happens when we want to update a single block • Block belongs to a stripe • How can we compute the new value of the parity block p[k] b[k] b[k+1] b[k+2] ...

First solution • Read values of N-1 other blocks in stripe • Recompute p[k] = b[k] b[k+1]  ...b[k+N-1] • Solution requires • N-1 reads • 2 writes (new block and new parity block)

Second solution • Assume we want to update block b[m] • Read old values of b[m] and parity block p[k] • Compute p[k] = new b[m]  old b[m]  old p[k] • Solution requires • 2 reads (old values of block and parity block) • 2 writes (new block and new parity block)

RAID level 6 (I) • Not part of the original proposal • Two check disks • Tolerates two disk failures • More complex updates

RAID level 6 (II) • Has become more popular as disks become • Bigger • More vulnerable to irrecoverable read errors • Most frequent cause for RAID level 5 array failures is • Irrecoverable read error occurring while contents of a failed disk are reconstituted

RAID level 6 (III) • Typical array size is 12 disks • Space overhead is 2/12 = 16.7 % • Sole real issue is cost of small writes • Three reads and three writes: • Read old value of block being updated,old parity block P, old party block Q • Write new value of block being updated, new parity block P, new party block Q

CONCLUSION (II) • Low cost of disk drives made RAID level 1 attractive for small installations • Otherwise pick • RAID level 5 for higher parallelism • RAID level 6 for higher protection • Can tolerate one disk failure and irrecoverable read errors

STORAGE AND I/O

STORAGE AND I/O

Presentation Transcript

Designing systems for continuous availability - multi-node with remote file storage

Digital Logic Design

Chapter 5 Data Storage Technology

Huawei Storage Innovations

Chapter 12: Mass-Storage Systems

Integrity Testing of Aboveground Storage Tanks

Chapter 11: Storage and File Structure

Hands-On Microsoft Windows Server 2008

Chapter 7 Storage Systems

Chapter 2

Intelligent Storage Systems

IBM Tivoli Storage Manager Operational Reporting Walk-Through Tivoli Storage Architecture

Chapter 10: Storage and File Structure

EMC Velocity SE Portfolio

Chapter 8. Storage Management and Indexing Techniques

Host and Storage System Environment

Chapter 11: Storage and File Structure

Smart Storage and Linux An EMC Perspective

ANSI/ISA S88.01 Batch Standard