Green HDFS

GreenHDFS Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev

Presentation plan • 1. Current energy issues with HDFS and large server farms • 2. Past approaches and solutions for energy conservation and cost cut • 3. GreenHDFS unique design and solution • 4. Conclusions and references

Current Energy issues with HDFS • The purpose of HDFS was to build a scalable file system run on large number of commodity servers – currently ~ 155,500 at Yahoo

Current Energy issues with HDFS • Large number of servers generate heat and consume energy in very large quantities • Over the lifetime of a server, the operating energy cost is comparable to the initial acquisition costs and ownership costs grow – power, cooling etc. • A lot of efforts and research put into solution for energy-conservation for extremely large scale server farms

Past approaches and solutions • One of the commonly used is “Scale-down” approach– transitioning servers into low power consuming state • Example: Many datacenters transfer workloads and their state to a fewer number of servers during low activity hours • Problem? Above approach works only when servers are state-less – i.e. get all of their data from NAS/SAN

Past approaches and solutions • “Scale-down” approaches work only with NAS/SAN since all of the data is stored on dedicated storage devices – possible to migrate workload to fewer number

Problem with the past solutions • Hadoop distributes all of its files among many server – any of the thousand nodes can be participating at any moment

GreenHDFS solution • Self-adaptive – depends only on HDFS and file access patterns • Applies Data-Classification techniques • Does energy-aware placement of data • Trades cost, performance, and power by separating cluster into logical zones

Key observations during research • Team did a detailed analysis of files in a production Yahoo! Hadoop cluster: • Files are heterogeneous in access and lifespan patterns – some are rarely accessed, some get deleted shortly, some stay a while • 60% of data is “cold” or dormant – meaning lying without getting accessed – “need to exist for history files”

Key observations during research • 95-98% of files had a very short “hotness” lifespan of less than 3 days – meaning it was actively used during the first 3 days • 90% of files in the top-level directory were dormant or “cold” for more than 18 days • Majority of the data had a news-server-like access pattern – where most of the computation happens soon after its creation

GreenHDFS design • GreenHDFS organizes servers into logical Hot and ColdZones using different policies – FMP, SCP, FRP FMP Performance, Cost and Power Hot Zone Files currently accessed and newly created High energy usage and Performance Cold Zone Files with low to rare access Low energy use and Sleeping mode

GreenHDFS design • The goal of GreenHDFS is to have maximum number of servers in the Hot Zone and minimize the number in the Cold Zone • Servers in Cold Zone are storage-heavy • GreenHDFS heavily relies on the “temperature” of the files – higher the dormancy ( rarely accessed) the lower the temperature and vice versa • Dormancy is determined simply by getting the last access information upon file read

FMP – File Migration Policy • FMP monitors the dormancy of the files and runs in the Hot Zone • This gives higher storage efficiency for the Hot Zone as less accessed files are moved to the Cold zone • Also gives significant energy-conservation Hot Zone Heavy Computations FMP Coldness > Threshold Cold Zone Idle Servers Hotness > Threshold

Server Power Conserver Policy • SCP runs in the Cold Zone and determines which servers can go to standby/sleep mode • SCP uses hardware techniques to transfer CPU, Disks and DRAM into low power state • SCP wakes the server up only if: • Data on that server is accessed • New data needs to placed on that server

File Reversal Policy • FRP runs in the Cold Zone and ensures that QoS, bandwidth, and response time is managed well if the files become “popular” • If the number of accesses to certain file becomes higher than the threshold – then file metadata is changed and gets “moved” to the Hot Zone • All the threshold values of FMP,SCP, FRP should be chosen so that it results in maximum energy efficiency

File Lifespan – files are not equal • File goes to several stages in its lifetime: • File Creation – just created • Hot period – frequently used • Dormant period – not accessed • Deletion • GreenHDFS introduced various lifespan metrics and analyzed lifespan distributions to determine optimal threshold values for their policies • FileLifeSpanCFR- file creation to first read • FileLifeSpanCLR– file creation to last read • FileLifeSpanLRD – last read access and deletion • FileLifeSpanFLR – first read access and last read • FileLifeTime - from the creation to deletion

FileLifeSpanCFR- first read

FileLifeSpanCLR –last read/Hotness • Majority of files have short hotness lifespan

FileLifeSpanLRD -file dormancy • 80% of files in d have dormancy period > 20 days

GreenHDFS simulation • Simulation to test energy-conservation

Energy savings with GreenHDFS • 24 % reduction in energy consumption ~ $2.1 million for 38,000 servers or $8.5 million saved on 155K servers today

Storage efficiency with GreenHDFS • More servers and space available = better performance

Conclusion • GreenHDFS is a policy-driven, self-adaptive, variant of HDFS • It relies on data classification driven data placement that gives significant periods of idleness on a subset of servers • It categorizes files into 2 zones: Hot and Cold • Applies sets of policies to classify files into Hot and Cold

Conclusion • Energy consumption reduced by 24% and saved $2.1ml for 38,000 servers at that time. Today could be more than $8.5 million saved • Storage efficiency also increased since dormant files get moved to the Cold Zone • More space and better utilization of Hot Zone leads to better performance for HDFS/MapReduce

References • http://www.cs.odu.edu/~mukka/cs775s11/Presentations/papers/kaushik.pdf • http://images.google.com/ • http://cloudera.com/ • http://hadoop.apache.org/

Green HDFS

Green HDFS

Presentation Transcript

HDFS/GFS

HDFS and S3 plugins

HDFS & MapReduce

Vertica to HDFS Capstone Project

Hadoop&HDFS

HDFS: Hadoop Distributed FS

Graduate Study in HDFS

Welcome to HDFS 221!

HDFS 361—Research Methods

Graduate Study in HDFS

HDFS & MapReduce

HDFS - Hadoop Overview 2-

HDFS Hadoop Distributed File System

Cloud Computing GFS and HDFS

Process of Hive in HDFS

HDFS 361—Research Methods

HDFS Yarn Architecture

Green HDFS

Green HDFS

Presentation Transcript

HDFS/GFS

HDFS and S3 plugins

HDFS &amp; MapReduce

Vertica to HDFS Capstone Project

Hadoop&amp;HDFS

HDFS: Hadoop Distributed FS

Graduate Study in HDFS

Welcome to HDFS 221!

HDFS 361—Research Methods

Graduate Study in HDFS

HDFS &amp; MapReduce

HDFS - Hadoop Overview 2-

HDFS Hadoop Distributed File System

Cloud Computing GFS and HDFS

Process of Hive in HDFS

HDFS 361—Research Methods

HDFS Yarn Architecture

HDFS & MapReduce

Hadoop&HDFS

HDFS & MapReduce