Enhancing Resource Management in Cloud Computing for Optimal Performance

CS 294-42: Project Suggestions September 14, 2011 Ion Stoica (http://www.cs.berkeley.edu/~istoica/classes/cs294/11/)

Projects • This is a project oriented class • Reading papers should be means to a great project not a goal in itself! • Strongly prefer groups of two • Perfectly fine to have the same project at cs262 • Today, I’ll present some suggestions • But, you are free to come up with your own proposal • Main goal: just do a great project

Where I’m Coming From? • Key challenge: maximize economic value of data, i.e., • Extract value from data while reducing costs (e.g., storage, computation)

Where I’m Coming From? • Tools to extract value from big-data • Scalability • Response time • Accuracy • Provide high cluster utilization for heterogeneous workloads • Support diverse SLAs • Predictable performance • Isolation • Consistency

Caveats • Cloud computing is HOT, but lot of NOISE! • Not easy to • differentiate between narrow engineering solutions and fundamental tradeoffs • predict the importance of the problem you solve • Cloud computing it’s akin Gold Rush!

Background: Mesos • Rapid innovation in cloud computing • No single framework optimal for all applications • Running each framework on its dedicated cluster • Expensive • Hard to share data Dryad Cassandra Hypertable Pregel Need to run multiple frameworks on same cluster

Background: Mesos – Where We Want to Go uniprogramming multiprogramming Today: static partitioning Mesos: dynamic sharing Hadoop Pregel Shared cluster MPI

Background: Mesos – Solution • Mesos is a common resource sharing layer over which diverse frameworks can run Hadoop MPI Mesos … Node Node Node Node

Background: Workload in Datacenters Priority Response High Low Interactive (low-latency) Batch

Datacenter OS: Resource Management, Scheduling

Hierarchical Scheduler (for Mesos) • Allow administrators to organize into groups • Provide resource guarantees per group • Share available resources (fairly) across groups • Research questions • Abstraction (when using multiple resources)? • How to implement using resource offers? • What policies are compatible at different levels in the hierarchy?

Cross Application Resource Management • An app uses many services (e.g., file systems, key-value storage, databases, etc) • If an app has high priority and the service it uses doesn’t, the app SLA (Service Level Agreement) might be violated • Research questions • Abstraction, e.g., resource delegation, priority propagation? • Clean-slate mechanisms vs. incremental deployability • This is also highly challenging in single node OSes!

Resource Management using VMs • Most cluster resource managers use Linux containers (e.g., Mesos) • Thus, schedulers assume no task migration • Research questions: • Develop scheduler for VM environments (e.g., extend DRF) • Tradeoffs between migration, delay, and preemption

Task Granularity Selection (Yanpei Chen) • Problem: number of tasks per stage in today’s MapRed apps (highly) sub-optimal • Research question: • Derive algorithms to pick the number of tasks to optimize various performance metrics, e.g., • utilization, response time, network traffic • subject to various constraints, e.g., • capacity, network

Resource Revocation • Which task we should revoke/preempt? • Two questions • Which slot has least impact on the giving framework? • Is the slot acceptable to receiving framework? • Research questions • Identify feasible slot for receiving framework with least impact on giving framework • Light-weight protocol design

Control Plane Consistency Model • What type of consistency is “good-enough” for various control plane functions • File system metadata (Hadoop) • Routing (Nicira) • Scheduling • Coordinated caching • … • Research question • What are trade-off between performance and consistency? • Develop generic framework for control plane

Decentralized vs. Centralized Scheduling • Decentralized schedulers • E.g., Mesos, Hadoop 2.0 • Delegate decision to apps (i.e., frameworks, jobs) • Advantages: scale and separation of concerns (i.e., apps know the best where and which tasks to run) • Centralized schedulers • Knows all app requirements • Advantages: optimal • Research challenge: • Evaluate centralized vs. decentralized schedulers • Characterize class of workloads for which decentralized scheduler is good enough

Opportunistic Scheduling • Goal: schedule interactive jobs (e.g., <100ms latency) • Existing schedulers: high overhead (e.g., Mesos needs to decide on every offer) • Research challenge: • Tradeoff between utilization and response time • Evaluate hybrid approach

Background: Dominant Resource Fairness • Implement fair (proportional) allocation for multiple types of resources • Key properties • Strategy proof: users cannot get an advantage by lying about their demands • Sharing incentives: users are incentivized to share a cluster rather than partitioning it

DRF for Non-linear Resources/Demands • DRF assume resources & demands are additive • E.g., task 1 needs (1CPU, 1GB) and task 2 needs (1CPU, 3GB)  both tasks need (2CPU, 4GB) • Sometime demands are non-linear • E.g., shared memory • Sometime resources are non-linear • E.g., disk throughput, caches • Research challenge: • DRF-like scheduler for non-linear resources & demands (could be two projects here!)

DRF for OSes • DRF designed for clusters using resource offer mechanism • Redesign DRF to support multi-core OSes • Research questions: • Is resource offer best abstraction? • How to best leverage preemption? (in Mesos tasks are not preempted by default) • How to support gang scheduling?

Storage & Data Processing

Resource Isolation for Storage Services • Share storage (e.g., key-value store) between • Frontend, e.g., web services • Backend, e.g., analytics on freshest data • Research challenge • Isolation mechanism: protect front-end performance from back-end workload

“Quicksilver” DB • Goal: interactive queries with bounded error on “unbounded” data • Trade between efficiency and accuracy • Query response time target: < 100ms • Approach: random pre-sampling across different dimensions (columns) • Research question: given a query and an error bound, find • Smallest sample to compute result • Sample minimizing disk (or memory) access times • (Talk with Sameer, if interested)

Split-Privacy DB (1/2) result fprivate fpublic Public DB Private DB • Partition data & computation • Private • Public (stored on cloud) • Goal: use cloud without revealing the computation result • Example: • Operation f(x, y) = x + y, where • x: private • y: public • Pick random number a, and compute x’ = x + a • compute f(x’, y) = r’ = x’ + y • recover result: r = r’ – a = (x’ – a) + y = x + y

Split-Privacy DB (2/2) result fprivate fpublic Public DB Private DB • Partition data & computation • Private • Public (stored on cloud) • Example: patient data (private), public clinical and genomics data sets • Goal: use cloud without revealing the computation result • Research questions: • What types of computation can be implemented? • Any more powerful than privacy-preserving computation / Data Mining?

RDDs as an OS Abstraction • Resilient Data Sets (RDDs) • Fault-tolerant (in-memory) parallel data structures • Allows Spark apps to efficiently reuse data • Design cross-application RDDs • Research questions • RDD reconstruction (track software and platform changes) • Enable users to share intermediate results of queries (identify when two apps compute same RDD) • RDD cluster-wide caching

Provenance-based Efficient Storage (Peter B and Patrick W) • Reduce storage by deleting data that can be recreated • Generalization of previous project • Research challenges: • Identify data that can deterministically recreated and the code to do so • Use hints? • Tradeoff between re-creation and storage • May take into account access patter, frequency, performance

Very-low Latency Streaming • Challenge: straglers, failures • Approaches to reduce latency: • Redundant computations • Speculative execution • Research questions • Theoretical trade-off between response time and accuracy? • Achieve target latency and accuracy, while minimizing the overhead

Enhancing Resource Management in Cloud Computing for Optimal Performance

Enhancing Resource Management in Cloud Computing for Optimal Performance

Presentation Transcript

CS 294-8 Consensus Revisited http://www.cs.berkeley.edu/~yelick/294

CS 268: Project Suggestions

294

Project Suggestions

CS 294-42: Technology Trends

CS 268: Project Suggestions

CS 268: Project Suggestions

CS 294-8 Self-Stabilizing Systems cs.berkeley/~yelick/294

CS 294-8 Abstraction Functions cs.berkeley/~yelick/294

CS 294-8 ISTORE: Hardware Overview and Software Challenges cs.berkeley/~yelick/294

CS 268: Project Suggestions

CS 294-5: Statistical Natural Language Processing

CS 42

CS 294-8 Consensus cs.berkeley/~yelick/294

CS 294-8 Distributed Data Structures cs.berkeley/~yelick/294

CS 294-8 Extended Static Checking cs.berkeley/~yelick/294

CS 294-8 The Spec Language cs.berkeley/~yelick/294

CS 294-12 -- October 2002

CS 294-110: Project Suggestions

CS 294-110: Technology Trends

CS 268: Project Suggestions

CS 294-42: Project Suggestions