“Five minute rule ten years later and other computer storage rules of thumb” Authors: Jim Gray, Goetz Graefe Reviewed by: Nagapramod Mandagere Biplob Debnath
Outline • Problem Statement • Motivation • Importance and Relevance • Main Contributions and Validation • Key Ideas • Illustrations • New Metrics • Assumptions • Re-write Today • Questions
Problem Statement • Broader Problem: Viewing developments over a long period of time to try and extract important technology trends. • Specific Instance: Inferring rules of thumb for buffer replacement policies in a number of settings, including RAID environments. • Given: Trends over time for parameters such as memory cost, disk cost, tape cost • Find: Rules of thumb for deciding where to store the data and when to replace data from memory buffer • Objectives: Simple rules, extensible rules • Constraints: Hierarchical Storage Model
Typical Database Administrators Dilemma The performance isn’t good. Am I doing something wrong? Should I cache on the client? Should I cache this data in memory? Should store data back on disk? (local or network disk) Should I move data to tape?
Importance & Relevance • Different rates at which parameters changes • seek/second & Disk capacity – 10x to 100x • Disk MB/K$ & DRAM MB/K$ - 1000x
Importance & Relevance • The location of data is very important • Main Memory: Very Fast, Expensive, limited size • Disk Storage: Lot slower that main memory, inexpensive, close to unlimited size • Tape Storage: Slowest, dirt cheap, unlimited capacity • How can one decide what data resides where? • System Learns from data access patterns and adapts (Admins hate to give up control) • Administrator controls data locality by using some experience or historical performance info (rules of thumb)
Main Contributions & Validation • The Five minute rule • Randomly accessed buffer pages can be replaced if unused for more than 5 minutes. • Sequentially accessed buffer pages can be replaced if unused for more than 1 minute. • Metrics for storage performance characterization • Cost/Access • Maps: Megabyte accesses per second • Scan: Time it takes to sequentially read or write all the data in the device • Validation Methodology - Examples • Examples • Random access • On pass sort • Two pass sort • Trends observed over a period of time
Key Ideas • Tradeoff between the cost of RAM and the cost of disk accesses. • The tradeoff is that caching pages in the extra memory can save disk IOs. • The break-even point is met when the rent on the extra memory for cache ($/page/sec) exactly matches the savings in disk accesses per second ($/disk_access/sec).
Illustration – Typical System in 1997 • For a system with following characteristics • PagesPerMBofRAM = 128 pages/MB (8KB pages) • AccessesPerSecondPerDisk = 64 access/sec/disk • PricePerDiskDrive = 2000 $/disk (9GB + controller) • PricePerMBofDRAM = 15 $/MB_DRAM • The Inter reference interval is 266 seconds ~ 5 minutes
Illustration • One pass algorithms • reads data and never references it, • no need to cache the data in RAM. • system needs only enough buffer memory to allow data to stream from disk to main memory. • Typically, two or three one-track buffers (~100 KB) are adequate per disk to buffer disk operations and allow the device to stream data to the application.
Illustration • Two pass algorithms • sequential operations that read a large dataset and then revisit parts of the data. • Database join, cube, rollup, and sort operators • Sorting uses two pass if memory size is smaller than the data set size • Inter reference time is typically about a minute (sequential data access)
Illustration – Two Pass Sort • One pass sort needs larger amount of memory • Memory needed grows faster with size of input file • For files bigger than memory size, two pass is the only option
Disk vs Tape tradeoff • Tape vs Disk Trade off ????? • Tape - larger penalty (slower access, least cost) • Solution – Larger breakeven point, bigger page size
New Metrics • Data flow applications which stream huge amounts of data like data mining applications, multimedia applications • New Metrics • Kaps • Kilo byte accesses per second • Maps • Mega byte accesses per second • Scan • Time taken to sequentially read or write all data on a device • These metrics combined with rent costs provide a price/performance metric
Assumptions • Disk storages have same characteristics (cost/performance). It assumes that the disk storage systems is homogenous and does not consider the more recent shift towards hierarchical/heterogeneous storage systems. • The trade off only consider the performance aspect, the security and fault tolerance issues are assumed to be uniform throughout.
Re-write • Re-evaluate the rules of thumb considering more recent costs and the more recent trends in storage systems like heterogeneous/hierarchical storage • Take into account SAN, NAS characteristics
Questions??? • Does Five minute rule hold good today??? • No (With Reservations) • If one changes the Page Size to MegaByte range, five minute rule still applies. • Pages/MB of RAM = 16 (8 K pages) • Access/sec/disk = 64 • Price/disk drive = $400 • Price/MB of RAM = $0.1 • Break even point ~ 1000s • Further Evidence - Jim (Keynote in FAST 2004) Grayhttp://www.usenix.org/events/fast05/