On Availability of Intermediate Data in Cloud Computations

On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, ImranulHoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group (DPRG) University of Illinois at Urbana-Champaign

Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds

Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks

Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data

Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data • Outline of a solution

Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data • Outline of a solution • This talk • Builds up the case • Emphasizes the need, not the solution

Dataflow Programming Frameworks • Runtime systems that execute dataflow programs • MapReduce (Hadoop), Pig, Hive, etc. • Gaining popularity for massive-scale data processing • Distributed and parallel execution on clusters • A dataflow program consists of • Multi-stage computation • Communication patterns between stages

Example 1: MapReduce • Two-stage computation with all-to-all comm. • Google introduced, Yahoo! open-sourced (Hadoop) • Two functions – Map and Reduce – supplied by a programmer • Massively parallel execution of Map and Reduce Stage 1: Map Shuffle (all-to-all) Stage 2: Reduce

Example 2: Pig and Hive • Pig from Yahoo! & Hive from Facebook • Built atop MapReduce • Declarative, SQL-style languages • Automatic generation & execution of multiple MapReduce jobs

Example 2: Pig and Hive • Multi-stage with either all-to-all or 1-to-1 Stage 1: Map Shuffle (all-to-all) Stage 2: Reduce 1-to-1 comm. Stage 3: Map Stage 4: Reduce

Usage

Usage • Google (MapReduce) • Indexing: a chain of 24 MapReduce jobs • ~200K jobs processing 50PB/month (in 2006) • Yahoo! (Hadoop + Pig) • WebMap: a chain of 100 MapReduce jobs • Facebook (Hadoop + Hive) • ~300TB total, adding 2TB/day (in 2008) • 3K jobs processing 55TB/day • Amazon • Elastic MapReduce service (pay-as-you-go) • Academic clouds • Google-IBM Cluster at UW (Hadoop service) • CCT at UIUC (Hadoop & Pig service)

One Common Characteristic • Intermediate data • Intermediate data? data between stages • Similarities to traditional intermediate data • E.g., .o files • Critical to produce the final output • Short-lived, written-once and read-once, & used-immediately

One Common Characteristic • Intermediate data • Written-locally & read-remotely • Possibly very large amount of intermediate data (depending on the workload, though) • Computational barrier Stage 1: Map Computational Barrier Stage 2: Reduce

Computational Barrier + Failures • Availability becomes critical. • Loss of intermediate data before or during the execution of a task=> the task can’t proceed Stage 1: Map Stage 2: Reduce

Current Solution • Store locally & re-generate when lost • Re-run affected map & reduce tasks • No support from a storage system • Assumption: re-generation is cheap and easy Stage 1: Map Stage 2: Reduce

Hadoop Experiment • Emulab setting (for all plots in this talk) • 20 machines sorting 36GB • 4 LANs and a core switch (all 100 Mbps) • Normal execution: Map–Shuffle–Reduce Map Shuffle Reduce

Hadoop Experiment • 1 failure after Map • Re-execution of Map-Shuffle-Reduce • ~33% increase in completion time Map Shuffle Reduce Map Shuffle Reduce

Re-Generation for Multi-Stage • Cascaded re-execution: expensive Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce

Importance of Intermediate Data • Why? • Critical for execution (barrier) • When lost, very costly • Current systems handle it themselves. • Re-generate when lost: can lead to expensive cascaded re-execution • No support from the storage • We believe the storage is the right abstraction, not the dataflow frameworks.

Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data • Outline of a solution • Why is storage the right abstraction? • Challenges • Research directions

Why is Storage the Right Abstraction? • Replication stops cascaded re-execution. Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce

So, Are We Done? • No! • Challenge: minimal interference • Network is heavily utilized during Shuffle. • Replication requires network transmission too. • Minimizing interference is critical for the overall job completion time. • Any existing approaches? • HDFS (Hadoop’s default file system): much interference (next slide) • Background replication with TCP-Nice: not designed for network utilization & control (no further discussion, please refer to our paper)

Modified HDFS Interference • Unmodified HDFS • Much overhead with synchronous replication • Modification for asynchronous replication • With an increasing level of interference • Four levels of interference • Hadoop: original, no replication, no interference • Read: disk read, no network transfer, no actual replication • Read-Send: disk read & network send, no actual replication • Rep.: full replication

Modified HDFS Interference • Asynchronous replication • Network utilization makes the difference • Both Map & Shuffle get affected • Some Maps need to read remotely

Our Position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Dataflow programming frameworks • The importance of intermediate data • Outline of a new storage system design • Why is storage the right abstraction? • Challenges • Research directions

Research Directions • Two requirements • Intermediate data availability to stop cascaded re-execution • Interference minimization focusing on network interference • Solution • Replication with minimal interference

Research Directions • Replication using spare bandwidth • Not much network activity during Map & Reduce computation • Tight B/W monitoring & control • Deadline-based replication • Replicate every N stages • Replication based on a cost model • Replicate only when re-execution is more expensive

Summary • Our position • Intermediate data as a first-class citizen for dataflow programming frameworks in clouds • Problem: cascaded re-execution • Requirements • Intermediate data availability • Interference minimization • Further research needed

Backup

Default HDFS Interference • Replication of Map and Reduce outputs

Default HDFS Interference • Replication policy: local, then remote-rack • Synchronous replication

On Availability of Intermediate Data in Cloud Computations

On Availability of Intermediate Data in Cloud Computations

Presentation Transcript

A Model of High Availability Cloud System

Summary of Recommendations on Data Availability, Access and Use

Low Latency Computations on Massive Data

Stencil Computations on CPUs

Making Cloud Intermediate Data Fault-Tolerant

On the Security of Data Stored in the Cloud

Motivation: Availability of Urban Data

Assessment of Data Availability of Gender Indicators in MDGs

Surface Data Availability

Overview of Data Extraction/Availability

COMPUTATIONS ON THE SPHEROID

Sources of Constraints in Computations

Availability of LiDAR data in Washington State

DEEPFISHMAN Data and data availability

Lower bounds on data stream computations

Data availability in participating countries

Cost Effective Privacy Preserving Of Intermediate Data Set In Cloud Storage

Securing Mechanism of Healthcare Data on Cloud

Enhancing Availability of Data In Mixed Homomorphic Encryption In Cloud

Cloud Availability and Combatting Downtime

High Availability Cloud Dedicated Servers