1 / 24

Challenges in Getting Flash Drives Closer to CPU

Challenges in Getting Flash Drives Closer to CPU. Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas. Take-away. Leveraging PCIe bus as storage interface ≠ conventional memory system interconnects ≠ thin storage interfaces

camden
Télécharger la présentation

Challenges in Getting Flash Drives Closer to CPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas

  2. Take-away • Leveraging PCIe bus as storage interface • ≠ conventional memory system interconnects • ≠ thin storage interfaces • Requires new SSD architecture and storage stack • Motivation: there are not many studies focusing on the system characteristics of these emerging PCIe SSD platforms. • Contributions: we quantitatively analyze the challenges faced by PCIe SSDs in getting flash memory closer to CPU • Memory consumption • Computation resource requirement • Performance as a shared storage system • Latency impact on their storage-level queuing mechanisms

  3. Bandwidth Trend • Bandwidth improvement (150MB/s ~ 600MB/s)

  4. Bandwidth Trend SSDs begin to blur the distinction between block and memory access semantic devices • SSDs have improved their bandwidth 4x

  5. Flash Storage Migration Core Core Core Core Core Core PCIe interface is by far one of the easiest ways to integrate flash memory into the processor-memory complex Interface Bottleneck Taking SSDs out from the I/O controller hub and locating them as close to the CPU side as possible Flash Flash Flash Flash Flash Flash

  6. Flash Integration • Bridge-based PCIe SSD (BSSD) • From-scratch PCIe SSD (FSSD)

  7. Bridge-based PCIe SSD (BSSD) multiple traditional SAS/SATA SSD controllers Bridge controller exposing an aggregated SAS/SATA SSD performance RC = Root Complex, CTRL = Controller EP = Endpoint, HBA = Host Block Adapter

  8. Bridge-based PCIe SSD (BSSD) PROS High Compatibility Fast Development Process CONS Redundant Control Logics Computational Overheads En-decoding Overheads RC = Root Complex, CTRL = Controller EP = Endpoint, HBA = Host Block Adapter

  9. From-scratch PCIe SSD (FSSD) • PCIe endpoints (EPs) has upstream and downstream buffers, which control in-bound and out-bound I/O requests • PCIe EPs and switch are implemented as a form of native PCIe controller Point-to-point PCIe link network • FSSD has been built bottom to top by directly interconnecting the NAND flash interface and the external PCIe link RC = Root Complex, CTRL = Controller EP = Endpoint, HBA = Host Block Adapter

  10. From-scratch PCIe SSD (FSSD) PROS Highly scalable • Exposing flash performance CONS Protocol design/implementation Tailoring SW/HW Resource competition RC = Root Complex, CTRL = Controller EP = Endpoint, HBA = Host Block Adapter

  11. Flash Software Stack File System Database Host Block Storage Layer HBA Device Driver Logical Block I/O Interface • Buffer cache • Address mapping • Wear-leveling Host Interface Layer (NVMHC) Storage Flash Software (FTL) Hardware Abstraction Layer

  12. Experimental Setup • Host configuration • Quad Core i7 Sandy Bridge 3.4GHz • External extra HDD (for logging the footprints) • 16GB Memory (4GB DDR3-1333 DIMM * 4) most performance values observed with FSSD are about 40% better than BSSD

  13. Tool • Synthesized micro-benchmark workloads of Iometer • Modified Iometer • Time series evaluation: a script that generates log-data per every sec. • Memory usage evaluation: added a module in calling system API GlobalMemoryStatusEx() into Iometer

  14. Memory Usage (Overall) • Physical memory consumption FSSD consumes 2.5x more memory space 0.6 GB (BSSD) 0.6 GB (BSSD) [Writes] [Reads] FSSD consumes 3x~16x more memory space • Request sizes (1 ~ 512 sectors )

  15. Memory Usage (BSSD) • submits I/Os whenever device is available • 128 entries BSSD requires only 0.6GB memory space regardless of the I/O type and size. • Memory consumption

  16. Memory Usage (FSSD) 2GB memory requirements As the I/O process progresses, the amount of memory usage keeps increasing in logarithmic fashion and reach 10GB 10GB memory usage to manage only the underlying SSD may not be acceptable in many applications

  17. CPU Usage (BSSD) • Host-level CPU usages BSSD consumes 15%~30% of total CPU cycles for handling I/O requests • Time series

  18. CPU Usage (FSSD) I/O service with queue-mode operation requires 50% more CPU cycles 60% of the cycles on host-side CPU FSSD requires much higher CPU usages (50%~ 90%) A CPU usage over 60% for just I/O processing might be able to degrade overall system performance

  19. FSSD performance (multi-threads) worse than four workers by 118% FSSD offers very stable and predictable performance • Latency • Throughput worse than single workers by 289 % 2.2x better than single worker

  20. FSSD resource usages (multi-threads) Require 134% more memory space Require 201% more computation resources • Memory consumption • CPU usages the advantage decreases because of high memory requirement and CPU usages

  21. BSSD resource usages (multi-threads) offers similar memory requirements (less than 0.66GB) irrespective of # of threads offers similar CPU usages (less than 30%) irrespective of # of threads • Memory consumption • CPU usages

  22. BSSD performance (multi-threads) worse than four workers by 289% There exist no differences with varying number of workers • Latency • Throughput worse than single workers by 708 % Write-cliff occurs (garbage collection impact)

  23. Latency Impact on a Queuing Method worse than a legacy req. by 106x worse than a legacy req. by 99x • FSSD • BSSD worse than a legacy req. by 86x worse than a legacy req. by 184x

  24. Summary • Design trade-off between performance and resource utilization • All-Flash-Array • Data-center/HPC local node SSD • Software stack optimization • Co-operative approaches • Unified/direct file systems • Garbage collection schedulers • Queue control • We are constructing an environment for automated SSD evaluation in camelab.org

More Related