Triple-A: A Non-SSD Based A utonomic A ll-Flash A rray for High Performance Storage Systems

Triple-A: A Non-SSD Based Autonomic All-Flash Array for High Performance Storage Systems Myoungsoo Jung (UT-Dallas) Wonil Choi (UT-Dallas), John Shalf(LBNL), Mahmut Kandemir (PSU)

Executive Summary • Challenge: SSD array might not be suitable for a high-performance computing storage • Our goal: propose a new high-performance storage architecture • Observation • High maintenance cost: caused by worn-out flash-SSD replacements • Performance degradation: caused by shared resource contentions • Key Ideas • Cost reduction: by taking bare NAND flash out from SSD box • Contention resolve: by distributing excessive I/O generating bottlenecks • Triple-A: a new architecture for HPC storages • Consists of non-SSD bare flash memories • Automatically detects and resolves the performance bottlenecks • Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than a traditional SSD array

Outline • Motivations • Triple-A Architecture • Triple-A Management • Evaluations • Conclusions

HPC starts to employ SSDs HPC • SSD arrays are in position to (partially) replace HDD arrays SSD-Cache on HDD Arrays SSD-buffer on Compute-Node HDD Arrays SSD Arrays

High-cost Maintenance of SSD Arrays Abandon! Live! • As time goes by, worn-out SSD should be replaced • The thrown-away SSD has complex internals • Other parts are still useful, only flash memories are useless Worn-out! Replace! Dead! SSD Arrays

I/O Services Suffered in SSD Arrays • Varying data locality in an array, which consist of 80 SSDs • Hot region is a group of SSDs having 10% of total data • Arrays without a hot region show reasonable latency • As the number ofhot regions increases, the performance of SSD arrays degrades

Why Latency Delayed? Link Contention IDLE! • A single data bus is shared by a group of SSDs • When target SSD is ready and the shared bus is idle, the I/O request can get service right away • When excessive I/Os destined to a specific group of SSDs SSD-4 SSD-2 SSD-3 SSD-1 SSD-5 SSD-6 SSD-7 SSD-8 Dest- 3 Dest- 4 Dest- 1 Dest- 2 Dest-8 Dest-6 READY! IDLE! READY!

Why Latency Delayed? Link Contention IDLE! BUSY! • Whenthe shared bus is busy, even though the target SSD is ready, I/O requests should stay in the buffer • This stall is because SSDs in a group share a data bus  link contention SSD-7 SSD-6 SSD-5 SSD-8 SSD-3 SSD-2 SSD-1 SSD-4 Dest- 1 Dest- 4 Dest- 2 Dest-6 READY! READY! READY! IDLE! BUSY! READY! STALL

Why Latency Delayed? Storage Contention • When excessive I/Os destined to a specific SSD SSD-1 SSD-2 SSD-3 SSD-4 SSD-5 SSD-6 SSD-7 SSD-8 Dest- 3 Dest-8 Dest-8 Dest-8 Dest-8 Dest-1 Dest-2 Dest-4 BUSY! READY! BUSY! READY!

Why Latency Delayed? Storage Contention • When excessive I/O destined to a specific SSD • When the target SSD is busy, even though the link is available, I/O request should stay in buffer • This stall is because a specific SSD is continuously busy  storage contention SSD-8 SSD-6 SSD-5 SSD-7 SSD-3 SSD-2 SSD-1 SSD-4 Dest-3 Dest-4 Dest-1 Dest-8 Dest-8 Dest-8 Dest-2 Dest-8 BUSY! READY! BUSY! BUSY! READY! BUSY! STALL

Unboxing SSD for Cost Reduction Still useful, so reusable Host interface controller • Worn-out flash packages should be replaced • Many logics in SSDs including H/W controllers and firmware are wasted, when worn-out SSDs are replaced • Instead of a whole SSD, let’s use only bare flash packages Flash controllers 35~50% of total SSD cost Microprocessors DRAM buffers Firmware Useless, so replaced Bare NAND flash packages SSD Internal

Use of Unboxed Flash Packages, FIMM • Multiple NAND flash packages integrated into a board • Looks like passive memory device such as DIMM • Referred to as Flash Inline Memory Module (FIMM) • Control-signalsand pin-assignmentare defined • For convenient replacement of worn-out FIMMs • A FIMM has hot swappable connector • NV-DDR2 interface design by ONFi Flash Package

How FIMMs Connected? PCI-E endpoint endpoint switch switch switch FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM root complex FIMM FIMM FIMM FIMM FIMM • PCI-E technology provides high-performance interconnect • Root complex – I/O start point component • Switch – middle-layer components • Endpoint – where FIMMs directly attached • Link – bus connecting components HPC

Connection between FIMMs and PCI-E • PCI-E Endpoint is where “PCI-E fabric” and “FIMMs” meet • Front-end PCI-E protocol for PCI-E fabric • Back-end ONFi NV-DDR2 interface for FIMMs • Endpoint consists of three parts • PCI-E device layers: handle PCI-E interface • Control logic: handles FIMMs over ONFi interface • Upstream/downstream buffers: control traffic communication

Connection between FIMMs and PCI-E • Communication example • (1) PCI-E packet arrived at target endpoint • (2) PCI-E device layers disassemble the packet • (3) The disassembled packet is enqueued into downstream buffer • (4) HAL dequeues the packet and constructs a NAND flash command • Hot-swappable connector for FIMMs • ONFi 78-pin NV-DDR2 slot

Triple-A Architecture Multi-cores Multi-cores DRAMs • PCI-E allows architect to configure any configuration • Endpoint is where FIMMs are directly attached • Triple-A comprises a set of FIMMs using PCI-E • Useful parts of SSDs are aggregated on top of PCI-E fabric PCI-E Fabric RCs Endpoint Endpoint Endpoint Endpoint Switches FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM

Triple-A Architecture Hosts CNs Multi-cores Multi-cores DRAMs Management Module • Flash control logic is also moved out of SSD internal • Address translation, garbage collection, IO scheduler, and so on • Autonomic I/O contention managements • Triple-A architectures interact with hosts or compute nodes PCI-E Fabric RCs Endpoint Endpoint Endpoint Endpoint Switches FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM

Link Contention Management shared data bus FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM (1) Hot cluster detection – I/O stalled due to link contention PCI-E Switch End Point End Point End Point End Point Hot Cluster BUSY!!! FIMM FIMM FIMM FIMM

Link Contention Management FIMM Cold Cluster IDLE!!! FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM FIMM • Hot cluster detection – I/O stalled due to link contention • Cold cluster securement – clusters with free link • Autonomic data migration – from hot to cold cluster FIMM PCI-E Switch End Point End Point End Point End Point FIMM Shadow-cloning can hide the migration overheads Hot Cluster BUSY!!! FIMM FIMM FIMM FIMM

Storage Contention Management Laggard FIMM-1 FIMM-1 FIMM-3 FIMM-2 FIMM-3 FIMM-4 FIMM-4 FIMM-2 FIMM FIMM FIMM FIMM Switch • Laggard detection – I/O stalled due to storage contention • Autonomic data-layout reshaping for stalled I/O in queue End Point End Point … Issued Issued Issued QUEUE REQ-3 REQ-3 REQ-1 REQ-3 REQ-3 REQ-2 Stalled Stalled Stalled

Storage Contention Management Laggard FIMM-3 FIMM-2 FIMM-3 FIMM-1 FIMM-2 FIMM-4 FIMM-1 FIMM-4 FIMM FIMM FIMM FIMM Switch 3  4 • Laggard detection – I/O stalled due to storage contention • Autonomic data-layout reshaping for stalled I/O in queue Write I/O – physical data-layout reshaping (to no-laggard neighbors) Read I/O – shadow copying (to no-laggard neighbors) & reshaphing End Point End Point … 3  2 Issued Issued Issued Issued Issued Issued 3  4 QUEUE REQ-2 REQ-3 REQ-3 REQ-1 REQ-3 REQ-3 Stalled Stalled Stalled

Experimental Setup • Flash array network simulation model • Captures PCI-E specific characteristics • Data movement delay, switching and routing latency (PLX 3734), contention cycles • Configures diverse system parameters • Will be available in the public (preparing an open-source framework) • Baseline all-flash array configuration • 4 switches x 16 endpoints x 4 FIMMs (64GB) = 16TB • 80 clusters, 320 FIMM network evaluation • Workloads • Enterprise workloads (cfs, fin, hm, mds, msnfs, …) • HPC workload(eigensolversimulated at LBNL supercomputer) • Micro-benchmarks (read/write, sequential/random)

Latency Improvement • Triple-Alatency normalized to non-autonomic all-flash array • Real-world workloads: enterprise and HPC I/O traces • On average, x5 shorter latency • Specific workloads (cfs and web) generate no hot clusters

Throughput Improvement • Triple-A IOPS normalized for system throughput • On average, x6 higher IOPS • Specific workloads (cfs and web) generate no hot clusters • Triple-A boosts the storage system by resolving contentions

Queue Stall Time Decrease • Queue stall time come from two resource contentions • On average, stall time shortened by 81% • According to our analysis, Triple-A decreases dramatically link-contention time • msnfs shows low I/O ratio on hot clusters

Network Size Sensitivity • By increasing the number of clusters (endpoints) • Execution time broken-down into stall times andstorage lat. • Triple-A shows better performance on larger networks • PCI-E components stall times are effectively reduced • FIMM latency is out of Triple-A’s concern non-autonomic array Triple-A

Related Works (1) • Market products (SSD array) • [Pure Storage] one-large pool storage system with 100% NAND flash based SSDs • [Texas Memory Systems] 2D flash-RAID • [Violin Memory] flash memory array of 1000s of flash cells • Academia study (SSD array) • [A.M.Caulfield, ISCA’13] proposed SSD-based storage area network (QuickSAN) by integrating network adopter into SSDs • [A.Akel, Hotstorage’11] proposed a prototype of PCM based storage array (Onyx) • [A.M.Caulfield, MICRO’10] proposed a high-performance storage array architecture for emerging non-volatile memories

Related Works (2) • Academia study (SSD RAID) • [M.Balakrishnan, TS’10] proposed SSD-optimized RAID for better reliability by creating age disparities within arrays • [S.Moon, Hotstorage’13] investigated the effectiveness of SSD-based RAID and discussed the reliability potential • Academia study (NVM usage for HPC) • [A.M.Caulfield, ASPLOS’09] exploited flash memory to clusters for the performance and power consumption • [A.M.Caulfield, SC’10] explored the impact of NVMs on HPC

Conclusions • Challenge: SSD array might not be suitable for high-performance storage • Our goal: propose a new high-performance storage architecture • Observation • High maintenance cost: caused by worn-out flash-SSD replacements • Performance degradation: caused by shared resource contentions • Key Ideas • Cost reduction: by taking bare NAND flash out from SSD box • Contention resolve: by distributing excessive I/O generating bottlenecks • Triple-A: a new architecture suitable for HPC storages • Consists of non-SSD bare flash memories • Automatically detects and resolves the performance bottlenecks • Results: non-SSD all-flash arrays expect to save 35~50% of cost and offer a 53% higher throughput than traditional SSD arrays

Backup

Data Migration Overhead Naïve migration • A data migration comprises a series of • (1) Data read from source FIMM • (2) Data move to parental switch • (3) Data move to target endpoint • (4) Data write to target FIMM • Naïve data migration activity shares all-flash array resources with normal I/O requests • I/O latency delayed due to resource contention

Data Migration Overhead Shadow cloning Naïve migration • Data read of data migration (first step) hurts seriously the system performance • Shadow cloning overlaps normal read I/O request and data read of data migration • Shadow cloning successfully hides the data migration overhead and minimizes the system performance degradation

Real Workload Latency (1) • CDF of workload latency for non-autonomic all-flash array and Triple-A • Triple-A significantly improves I/O request latency • Relatively low latency improvement in msnfs • Ratio of I/O requests heading to hot clusters is not very high • Hot clusters detected, but not that hot (less hot) proj msnfs

Real Workload Latency (2) • Prxy experiences great latency improved by Triple-A • Websql did not get more benefit than expected • Due to more and hotter clusters than proxy • But, all hot clusters are located in the same switch • In addition to 1) hotness, 2) balance of I/O requests among switches determines the effectiveness of Triple-A prxy websql

Network Size Sensitivity • Triple-A successfully reduces both contention time • By distributing extra load of hot clusters • Data migration and physical data reshaping • Link contention time is all most completely eliminated • Storage contention time is steadily reduced • It is bounded by the number of I/O requests to target clusters Normalized to non-autonomic all-flash arrays

Why Latency Delayed? Storage Contention • Regardless of array condition, independent SSDs can be busy or idle (ready to serve a new I/O) • When the SSD where an I/O destined is ready, I/O can get service right away • When the SSD where an I/O destined is busy, I/O should wait SSD-1 SSD-2 SSD-3 SSD-4 SSD-5 SSD-6 SSD-7 SSD-8 Dest- 3 Dest-8 READY! BUSY!

Triple-A: A Non-SSD Based A utonomic A ll-Flash A rray for High Performance Storage Systems