Prefetching with Adaptive Cache Culling for Striped Disk Arrays

2008 USENIX Annual Technical Conference Sung Hoon Baek and Kyu Ho Park shbaek@core.kaist.ac.kr kpark@ee.kaist.ac.kr Korea Advanced Institude of Science and Technology (KAIST) School of Electrical Engineering and Computer Science Prefetching with Adaptive Cache Culling for Striped Disk Arrays

Introduction Our Work

Disk Prefetching Schemes • Accurate Prediction • Offline Prefetching • History-Based Prefetching • Application-Hint-Based Prefetching • Sequential Prediction • Sequential Prefetching • Most widely used, never beneficial to non-sequential accesses • Our Scheme: • Goal • Beneficial to non-seq. read as well as seq. reads • Very practical as much as Sequential Prefetching • Approach • Low prefetch cost while sacrificing prediction accuracy • Consider both prefetch buffer management and prefetching • For Striped Disk Arrays: RAID-0, RAID-5, RAID-6, SSD, etc Impractical High overhead

Prior Work Prior Work • Buffer Management for Prefetched Data • Related Work: TIP [1,2] • Deterministic cost estimation makes errors • Scan overhead: search the least-valuable block: O(N) • Adaptive Strip Prefetching: the proposed scheme • Practical Scheme. • Low overhead: O(1) • Inspired by ARC and SARC, which are for cached data • More analytical method, for prefetched data • Specialized for RAID [1] R.H. Patterson and et al “Informed Prefetching and Caching,” ACM OSP, (Dec 1995) [2] A. TOMKINS, et al “Informed multiprocess prefetching and caching. ACM Int’l Conf. on MMCS (June 1997)

Prior Works vs. Our Work Buffer Management for Prefetched Data (TIP) [1] Adaptive Cache Management (ARC [2], SARC) Prior Works Cached Data Mgmt. O(N) Similar Goal Similar Method Prefetched Data Mgmt., More Analytical Method O(1) (2) Prefetch Buffer Management Our Work Resolve bad cache utilization (1) A New Prefetching non-seq. read, seq. read, very practical, for RAID + (3) An Online Cost Estimator Tightly Integrated [1] R.H. Patterson and et al “Informed Prefetching and Caching,” ACM OSP, (Dec 1995) [2] Megiddo and Modha, “ARC: A self-tuning, low overhead replacement cache”, USENIX FAST, 2003

Problem: Independency RAID Layout

My Work: Adaptive Strip Prefetching Adaptive Strip Prefetching (ASP) • Strip Prefetching • Read all blocks of a strip • Segment prefetching • Segment = Strip • Bad cache utilization, unused data pollutes the cache • Adaptive Cache Culling • Buffer Management for Prefetched data • Differential Feedback • Online Prefetch Cost Estimation

Strip Prefetching Non-sequential reads may be beneficial to SP nor not. However, Most non-sequential reads in real workloads also exhibit spatial locality unlike random reads on a huge workspace. So in many cases, SP provides performance gain. For random reads on a huge workspace, SP is deactivated by the online disk simulator.

Best Segment Size for a Segment Prefetching? Half strip, twice strip One strip, twice strip Three strip half strip One strip Two strip One strip Two strip Bandwidth / prefetch size Request size / Strip size: 128 KiB /128 KiB 256 KiB /256 KiB 128 KiB /256 KiB 256 KiB /128 KiB 384 KiB /128KiB Two strips One strip • 200 threads performs random read in a fixed read size • Three UltraSCSI disks (15krpm) Three strips twice strip size

My Work : Adaptive Strip Prefetching Adaptive Strip Prefetching (ASP) • Strip Prefetching • Bad cache utilization, useless data pollutes the cache • Adaptive Cache Culling (prefetch buffer management) • Mitigate the disadvantage of strip prefetching • Buffer Management for Prefetched Data • Cull uselessly prefetched data • Maximize total hit rate = prefetch hit rate + cache hit rate • In a given cache management • A differential feedback (an automatic manner) • Prefetch hit: A request on a prefetched block • Cache hit: A request on a cached block • Online Prefetching Cost Estimation

Block States in Adaptive Strip Prefetching Downstream Upstream

Basic Operations of ASP (1/2) Empty block Strip cache Prefetched block Cached block Adding a new strip cache to the upstream Culling Upstream NU: # of strip caches, variable Downstream Get free block caches

Basic Operations of ASP (2/2) Cache hit Cache hit Cache hit Cache miss Cache miss : strip prefetching Upstream NU: max. # of strip caches, adaptively controlled variable

Cache Replacement Policy Prefetch Buffer Management Culling (ASP) Hit pointing Hit Eviction (no ASP) MRU LRU A Global LRU list Cache Replacement Policy Global Bottom

NU vs. hit rate Hit rate for each position Prefetch hit: hit on prefetched block ΔP: partial prefetch hit rate (hit rate on prefetched block ) ΔC: partial cache hit rate (hit rate on cached block ) position NU = 9 Hit rate for each position Reduced prefetch hit rate Additional cache hit rate position NU = 7 Additional cached blocks

Total Hit Rate vs. NU (1/2) • Find the optimal NUthat maximizes the total hit rate • Feedback Control: NU←NU+s× slope

Total Hit Rate vs. NU (2/2) • Monotonically Increasing Function • Slope ≥ 0 • NU←min(NU+C× slope, NUmax) • Force NU to be the maximum value • Monotonically Decreasing Function • Slope ≤ 0 • NU←max(NU+C× slope, NUmin} • Force NU to ZERO

Derivative vs. Marginal Utility bottom Additional allocation • Derivative Original upstream • Marginal Utility (inspired by SARC)

Differential Feedback Upstream Downstream • ΔP: # of prefetching hits in Ubduring a time interval • ΔC: # of cache hits in Gb during a time interval culling Upstream Bottom (Ub) Global Bottom (Gb ) Proportional control Further work: PID (proportional-integral-derivative) control

Differential Feedback Diagram workload Cache with Strip Prefetching ΔP + ΔC - α + + NU ZOH + S + delay It maximizes the total hit rate in a given buffer management, and resolves the disadvantage of strip prefetching.

Initial Condition • Overlappedtwo bottoms Upstream No Downstream Upstream Bottom & Global Bottom • Na ← cache size / strip size • Init: NU ← Na • No feedback until NU + ND <= Na • Force to perform Strip Prefetching until NU + ND <= Na

Ghosts Cache miss eviction Downstream Upstream Ghosts Past cached block, which was the cached block before it become a ghost culling Downstream Upstream Culling: do not evict either past cached blocks or cached blocks

Which become a ghost strip? • Our goal: easy implementation • RAID drivers manages destage caches in terms of the stripe. • A stripe cache includes its strip caches • Example • Stripe2 has live strip caches for strip2A and strip2B • strip2A is evicted then it becomes a ghost • strip2B is evicted then they are completely removed

Online Cost Estimation (1/2) • The differential feedback resolves the disadvantage of strip prefetching • But it is not beneficial to random reads • Random reads cause rare prefetch hits and cache hits. • The Online Cost Estimation • Investigates which choice is better between Strip Prefetching and no prefetching • Activate/deactivate Strip Prefetching

Online Cost Estimation (2/2) • Low Overhead • O(1) Complexity

Evaluation • Implemented a RAID-5 driver in Linux 2.6.18 • Five SCSI320 disks (15krpm, 73GB) • Dual Xeon 3.0GHz, 1GB of memory • Combinations • ASP+MSP • ASP+SEQP • MSP+SEQP • ASP+MSP+SEQP • SEQP: Sequential Prefetching of Linux • SEQPX : SEQP with X KiB of prefetching size • SP: Strip Prefetching • ASP: Adaptive Strip Prefetching • Measurement: six repetition, low deviation

PCMark05 Over-provisioned memory • General Application Usage • Word, Winzip, PowerCrypt, Antivirus, Winamp, WMP, Internet, etc 2.2 times

Dbench • Dbench: Realistic workload like a file server 11 times 2.2 times 30 %

Tiobench: Decision Correctness No prefetching Random Reads: Extremely low cache/prefetching hit rate Feedback does not work The online cost estimator makes the decision

Maximum Latency & CPU Load • Tiobench (random read) Maximum latency CPU load / Throughput

IOZone : Independency • IOZone Benchmark • Concurrent sequential reads Parallelism loss Including MSP The best Independency loss Independency loss Including SEQP Parallelism loss

IOZone: Stride/Reverse Read • Stride Read 40 times ASP included Sequential Prefetching • Reverse Read

TPC BenchmarkTM H • TPC-H: business-oriented database server benchmark • DBMS: MySQL • Stride reads and non-sequential reads The gain of ASP+MSP over SEQP128 27% 37% 24% 52% 141% 41% 134% 721% 73% 199% 20% 27%

Real Scenarios • cscope: C source file indexing of the kernel source • cscope1: exclude object files • cscope2: include object files 107% 44% 10% 116% • glimpse: text file indexing (/usr/share/doc) for cross reference • link: linking kernel object codes

Linux Booting 30%

Summary • Non-sequential reads as well as sequential reads • Database Queries, Building Search Indices • Link, Booting, File server • General application usage • Prefetch Buffer Management (Differential Feedback) • Resolves the bad cache utilization of strip prefetching • Online Disk Cost Simulation • Resolve the bad prefetch cost of strip prefetching • Practical, Low overhead, Great performance gain for practical RAID systems

Q&A

Step response Realistic NU NU Initial NU Real NU by the feedback control Desired NU Time

Backup Slides

Prior Work: for parallelism Massive Stripe Prefetching • Adaptive Strip Prefetching (ASP) • Good for large numbers of concurrent IOs • Bad Parallelism for small numbers of concurrent IOs • Massive Stripe Prefetching (MSP) • Our Prior Work • Resolve Parallelism Loss • Activated for a small number of concurrent sequential reads • Prefetching multiple stripes • Perfect parallelism of disks

Proposed scheme: for parallelism The Prefetching Size and Time of MSP Prefetch size Stripe size MSP is aligned in stripe MSP MSP + SEQP SEQP The amount of sequential accesses in a file

The Coefficient α The amount of memory in the increased region in U = the amount of memory in the reduced region in D

Further Work • Optimal S ? or Dynamically controlling S • Optimal Size of Upstream Bottom |Ub| ? • Ideal Derivative, Great Errors • Impractical

Prefetching with Adaptive Cache Culling for Striped Disk Arrays

Prefetching with Adaptive Cache Culling for Striped Disk Arrays

Presentation Transcript

Low-Cost Adaptive Data Prefetching

Destage Algorithms for Disk Arrays with Non-Volatile Caches

Disk Storage Arrays

Parity Declustering for Continous Operation in Redundant Disk Arrays

Adaptive Cache Compression for High-Performance Processors

Visibility Culling

Prefetching for RC

Prefetching

Outperforming LRU with an Adaptive Replacement Cache Algorithm

Gecko: Contention-Oblivious Disk Arrays for Cloud Storage

APOGEE: Adaptive Prefetching on GPU for Energy Efficiency

Cache Tables: Paving the way for an Adaptive Database Cache

A Case for Heterogeneous Disk Arrays

Gecko: Contention-Oblivious Disk Arrays for Cloud Storage

ARC (Adaptive Replacement Cache)

Culling Techniques

Disk Arrays

Adaptive Line Placement with the Set Balancing Cache

CPU Cache Prefetching Timing Evaluations of Hardware Implementation

Data Cache Prefetching using a Global History Buffer

Exploiting Flash for Energy Efficient Disk Arrays

Disk Arrays