1 / 54

Design Tradeoffs for SSD Performance

Design Tradeoffs for SSD Performance. Ted Wobber Principal Researcher Microsoft Research, Silicon Valley. Rotating Disks vs. SSDs. We have a good model of how rotating disks work… what about SSDs?. Rotating Disks vs. SSDs Main take-aways.

trapper
Télécharger la présentation

Design Tradeoffs for SSD Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design Tradeoffs for SSD Performance Ted Wobber Principal Researcher Microsoft Research, Silicon Valley

  2. Rotating Disks vs. SSDs We have a good model ofhow rotating disks work… what about SSDs?

  3. Rotating Disks vs. SSDsMain take-aways • Forget everything you knew about rotating disks. SSDs are different • SSDs are complex software systems • One size doesn’t fit all

  4. A Brief Introduction Microsoft Research – a focus on ideas and understanding

  5. Will SSDs Fix All Our Storage Problems? • Excellent read latency; sequential bandwidth • Lower $/IOPS/GB • Improved power consumption • No moving parts • Form factor, noise, … Performance surprises?

  6. Performance/Surprises • Latency/bandwidth • “How fast can I read or write?” • Surprise: Random writes can be slow • Persistence • “How soon must I replace this device?” • Surprise: Flash blocks wear out

  7. What’s in This Talk • Introduction • Background on NAND flash, SSDs • Points of comparison with rotating disks • Write-in-place vs. write-logging • Moving parts vs. parallelism • Failure modes • Conclusion

  8. What’s *NOT* in This Talk • Windows • Analysis of specific SSDs • Cost • Power savings

  9. Full Disclosure • “Black box” study based on the properties of NAND flash • A trace-based simulation of an “idealized” SSD • Workloads • TPC-C • Exchange • Postmark • IOzone

  10. BackgroundNAND flash blocks • A flash block is a grid of cells 1 1 1 1 0 1 1 1 0 0 1 1 • Erase: Quantum release for all cells • Program: Quantuminjection for some cells • Read: NAND operationwith a page selected 4096 + 128 bit-lines 64 pagelines Can’t reset bits to 1 except with erase

  11. Background4GB flash package (SLC) Serial out Register Reg Reg Reg Reg Reg Reg Plane Plane 3 Plane 3 Plane 0 Plane 0 Plane 1 Plane 1 Plane 2 Plane 2 Reg Reg Block ’09? 20μs Die 1 Die 0 MLC (multiple bits in cell): slower, less durable

  12. BackgroundSSD Structure Flash Translation Layer (Proprietary firmware) Simplified block diagram of an SSD

  13. Write-in-place vs. Logging(What latency can I expect?)

  14. Write-in-Place vs. Logging • Rotating disks • Constant map fromLBA to on-disk location • SSDs • Writes always to new locations • Superseded blocks cleaned later

  15. Log-based WritesMap granularity = 1 block Flash Block LBA to Block Map P P P0 P1 Write order Block(P) Pages are moved – read-modify-write,(in foreground): Write Amplification

  16. Log-based WritesMap granularity = 1 page LBA to Block Map P Q P P0 • Q0 P1 Page(P) Page(Q) Blocks must be cleaned(in background): Write Amplification

  17. Log-based WritesSimple simulation result • Map granularity = flash block (256KB) • TPC-C average I/O latency = 20 ms • Map granularity = flash page (4KB) • TPC-C average I/O latency = 0.2 ms

  18. Log-based WritesBlock cleaning • Move valid pages so block can be erased • Cleaning efficiency: Choose blocks to minimize page movement LBA to Page Map P Q R Q P R R0 P0 Q0 R0 P0 Q0 Page(P) Page(Q) Page(R)

  19. Over-provisioningPutting off the work • Keep extra (unadvertised) blocks • Reduces “pressure” for cleaning • Improves foreground latency • Reduces write-amplification due to cleaning

  20. Delete NotificationAvoiding the work • SSD doesn’t know what LBAs are in use • Logical disk is always full! • If SSD can know what pages are unused, these can treated as “superseded” • Better cleaning efficiency • De-facto over-provisioning “Trim” API: An important step forward

  21. Postmark trace One-third pages moved Cleaning efficiency improved by factor of 3 Block lifetime improved Delete NotificationCleaning Efficiency

  22. LBA Map Tradeoffs • Large granularity • Simple; small map size • Low overhead for sequential write workload • Foreground write amplification (R-M-W) • Fine granularity • Complex; large map size • Can tolerate random write workload • Background write amplification (cleaning)

  23. Write-in-place vs. LoggingSummary • Rotating disks • Constant map fromLBA to on-disk location • SSDs • Dynamic LBA map • Various possible strategies • Best strategy deeply workload-dependent

  24. Moving Parts vs. Parallelism(How many IOPS can I get?)

  25. Moving Parts vs. Parallelism • Rotating disks • Minimize seek time andimpact of rotational delay • SSDs • Maximize number ofoperations in flight • Keep chip interconnect manageable

  26. Improving IOPSStrategies • Request-queue sort by sector address • Defragmentation • Application-level block ordering Defragmentation for cleaning efficiencyis unproven: next write might re-fragment One request at a time per disk head Null seek time

  27. Flash Chip Bandwidth • Serial interface is performance bottleneck • Reads constrained by serial bus • 25ns/byte = 40 MB/s (not so great) Reg Reg Reg Reg Reg Reg 8-bit serial bus Reg Reg Die 1 Die 0

  28. SSD ParallelismStrategies • Striping • Multiple “channels” to host • Background cleaning • Operation interleaving • Ganging of flash chips

  29. Striping • LBAs striped across flash packages • Single request can span multiple chips • Natural load balancing • What’s the right stripe size? Controller 7 15 23 31 39 47 6 14 22 30 38 46 3 11 19 27 35 43 5 13 21 29 37 45 2 10 18 26 34 42 4 12 20 28 36 44 1 9 17 25 33 41 0 8 16 24 32 40

  30. Operations in Parallel • SSDs are akin to RAID controllers • Multiple onboard parallel elements • Multiple request streams are needed to achieve maximal bandwidth • Cleaning on inactive flash elements • Non-trivial scheduling issues • Much like “Log-Structured File System”, but at a lower level of the storage stack

  31. Interleaving • Concurrent ops on a package or die • E.g., register-to-flash “program” on die 0 concurrent with serial line transfer on die 1 • 25% extra throughput on reads, 100% on writes • Erase is slow, can be concurrent with other ops Reg Reg Reg Reg Reg Reg Reg Reg Die 1 Die 0

  32. InterleavingSimulation • TPC-C and Exchange • No queuing, no benefit • IOzone and Postmark • Sequential I/O component results in queuing • Increased throughput

  33. Intra-plane Copy-back • Block-to-block transfer internal to chip • But only within the same plane! • Cleaning on-chip! • Optimizing for this can hurt load balance • Conflicts with striping • But data needn’t crossserial I/O pins Reg Reg Reg Reg

  34. Cleaning with Copy-backSimulation • Copy-back operation for intra-plane transfer • TPC-C shows 40% improvement in cleaning costs • No benefit for IOzone and Postmark • Perfect cleaning efficiency

  35. Ganging • Optimally, all flash chips are independent • In practice, too many wires! • Flash packages can share a control bus with or/without separate data channels • Operations in lock-step or coordinated Shared-control gang Shared-bus gang

  36. Shared-bus GangSimulation • Scaling capacity without scaling pin-density • Workload (Exchange) requires 900 IOPS • 16-gang fast enough

  37. Parallelism Tradeoffs • No one scheme optimal for all workloads With faster serial connect, intra-chip ops are less important

  38. Moving Parts vs. ParallelismSummary • Rotating disks • Seek, rotational optimization • Built-in assumptions everywhere • SSDs • Operations in parallel are key • Lots of opportunities forparallelism, but with tradeoffs

  39. Failure Modes(When will it wear out?)

  40. Failure ModesRotating disks • Media imperfections, loose particles, vibration • Latent sector errors [Bairavasundaram 07] • E.g., with uncorrectable ECC • Frequency of affected disks increases linearly with time • Most affected disks (80%) have < 50 errors • Temporal and spatial locality • Correlation with recovered errors • Disk scrubbing helps

  41. Failure ModesSSDs • Types of NAND flash errors (mostly when erases > wear limit) • Write errors: Probability varies with # of erasures • Read disturb: Increases with # of reads • Data retention errors: Charge leaks over time • Little spatial or temporal locality(within equally worn blocks) • Better ECC can help • Errors increase with wear: Need wear-leveling

  42. Wear-levelingMotivation • Example: 25% over-provisioning to enhance foreground performance

  43. Wear-levelingMotivation • Premature worn blocks = reduced over-provisioning = poorer performance

  44. Wear-levelingMotivation • Over-provisioning budget consumed : writes no longer possible! • Must ensure even wear

  45. Wear-levelingModified "greedy" algorithm Expiry Meter for block A Cold content Block B Block A Q R P Q R Q0 R0 Q0 R0 P0 • If Remaining(A) < Throttle-Threshold, reduce probability of cleaning A • If Remaining(A) < Migrate-Threshold, • clean A, but migrate cold data into A • If Remaining(A) >= Migrate-Threshold, • clean A

  46. Wear-leveling Results • Fewer blocks reach expiry with rate-limiting • Smaller standard deviation of remaining lifetimes with cold-content migration • Cost to migrating cold pages (~5% avg. latency) Block wear in IOzone

  47. Failure ModesSummary • Rotating disks • Reduce media tolerances • Scrubbing to deal with latentsector errors • SSDs • Better ECC • Wear-leveling is critical • Greater density  more errors?

  48. Rotating Disks vs. SSDs • Don’t think of an SSD as just a faster rotating disk • Complex firmware/hardware system with substantial tradeoffs ≠

  49. SSD Design Tradeoffs • Write amplification more wear

More Related