ROAR: Increasing the Flexibility and Performance of Distributed Search

ROAR: Increasing the Flexibility and Performance of Distributed Search Costin Raiciu University College London Joint work with Felipe Huici, Mark Handley, David S. Rosenblum

We Rely on Distributed Search Everyday • Distributed search apps • Web search (Google, Bing, etc.) • Online database search (Wikipedia, Amazon, eBay, etc.) • More general: parallel databases • Characteristics • Data too big to fit on one server • Latency too high if queries are run on one server

P: Number of Query Partitions P: Number of Query Partitions Servers N=6 1/3 Data 1/3 Data 1/3 Data Query is Partitioned, P=3 Query Distributed Search At Work [Barroso et al., 2003] Data is replicated R=2 Data is replicated R=2 Data Query is Partitioned, P=3 P R=N Frontend Server

P: Number of Query Partitions P Affects System Behavior • It dictates how much data each node stores • It impacts • Query Latency • Overheads Problem P is difficult to change Our Contribution A system that can change P efficiently at runtime

P: Number of Query Partitions Work to be done Partitioning Determines Latency and Cost P=4 P=2 Query Query

P: Number of Query Partitions P Lower Bound For 6q/s P Lower Bound For 10q/s Partitioning Dictates Latency

P: Number of Query Partitions Partitioning Dictates Cost

P: Number of Query Partitions The Problem • P is very difficult to change with existing solutions • Google change it out of necessity when web index outgrows memory • Not changing it dynamically means • The system is either inefficient OR • Misses the delay latency for some workloads

P: Number of Query Partitions Cluster 1’ Cluster 2’ Cluster 3’ Cluster 1 Cluster 2 How Google Changes P [Jeffrey Dean, Google, 2009] Queries • Requires over-provisioning • Copies a lot of data • Our estimate: 20TB/data center Queries

Our proposal: Rendez-Vous On A Ring (ROAR) • Key Observation • We do not need clusters to ensure each query meets all the data! • Changes the way data are partitioned and replicated • Allows on-the-fly reconfiguration with minimal bandwidth cost

P: Number of Query Partitions 0.5 0.4 Server Range Rendez-Vous On A Ring (ROAR) • Uses consistent hashing [Karger et al.] • Parameter P: number of query partitions 0 1

P: Number of Query Partitions Object Object Object ROAR: Storing Data Hashed ID • P=4

P: Number of Query Partitions ROAR: Storing Data • P=4

P: Number of Query Partitions Query ROAR: Running Queries • P=4 Start here

P: Number of Query Partitions Object Object A A matched Twice! Matched Twice! PQ=5 PQ=5 PQ=5 B PQ=5 PQ=5 Query ROAR Can Run Queries at Higher P P=4 Matched once!

P: Number of Query Partitions Object Object Set P=5 Set P=5 Set P=5 Set P=5 Object Object Object Object ROAR Copies Zero Data to Increase P [P=4] Delay is higher than target => change P to 5

P: Number of Query Partitions Minimal Data is Copied When P Decreases [P=5] Delay is lower than target => change P to 4 • Frontend tells servers to switch to P=4 • Starts using P=4 when servers finish loading replicas • Copying done when latency is low

P: Number of Query Partitions PQ=8 PQ=8 Query ROAR Tolerates Faults P=4 X

Experimental Evaluation • We have implemented ROAR • Tested • Extensively on ~50 servers in UCL’s HEN testbed • Briefly on 1000 servers in Amazon EC2

Application • Privacy Preserving Search (PPS) • Servers match encrypted queries against encrypted data • PPS is CPU-bound • Applications with different bottlenecks should have qualitatively similar behavior

P: Number of Query Partitions Can ROAR Repartition Dynamically? • Workload • Index of 1M files • Generate 4 random queries / second • Target average delay 1s • Frontend server change P dynamically based on average delay • Start network with P=40

P: Number of Query Partitions Frontend Changes P To 5 ROAR Changes P Efficiently Frontend Changes P to 10 Query Delays Are Stable During Change Query Delays Stay Stable

Other experiments in the paper • Fault tolerance • Load balancing • Energy savings • Scaling:1000 servers on Amazon EC2 • Unexpected delay variation caused by high packet loss rate

Conclusion • Today’s cluster-based distributed search is rigid: • Locks the system to specific values of P • When load exceeds expectation: target delay missed • When load undershoots: resources wasted • Changing P is costly • We don’t have to accept a fixed operating point • ROAR dynamically adapts P to fluctuations in load • With minimal resources • Without disrupting queries • Tolerates faults and balances load

Backup Slides

ROAR Tolerates Failures Gracefully

Roar Scales to 1000 servers on Amazon EC2 • Frontend Overhead: • 25ms to schedule query on 1000 servers • Matching delay at each server decreases as expected • Unexpected problem: huge variation of end-to-end query delay

Potential Energy Savings

Does ROAR tolerate failures? • Experiment • Set P=20 (R ~ 2) • Generate 6 queries/second • Kill one server • Measure query delay, load on neighbors + rest of servers • Expect • No disruption to queries • Load to increase by 10% on neighbors

ROAR Balances Load Properly

ROAR Load Balancing: Fast Solution • Experiment • 43 servers of which 15 more powerful (8x faster) • Equal ranges, 3 queries/sec • Use PQ> P to dynamically reduce query delay • Faster servers get more work - implicitly balances load

ROAR: Increasing the Flexibility and Performance of Distributed Search

ROAR: Increasing the Flexibility and Performance of Distributed Search

Presentation Transcript

Budget Flexibility

Two Techniques For Improving Distributed Database Performance

We are staff, hear us roar; Here we come back for more Hoah !

Part 2: Fault-Tolerance Distributed Systems 2010

Flexibility 12 th Grade

Mix Flexibility

Focus on increasing station membership and net income

Motive System for ROAR

DISTRIBUTED COMPONENT OBJECT MODEL -

Performance Measurement and Management

Distributed Breadth-First Search with 2-D Partitioning

WMS: Let’s ROAR to Success!

R* Optimizer Validation and Performance Evaluation for Distributed Queries

Ceph: A Scalable, High-Performance Distributed File System

A Distributed DNA Search Database System

CSE6809-Distributed Search Techniques

On-The-Job Performance Factors

Budget Flexibility

Distributed Scheduling

Magpie: Distributed request tracking for realistic performance modelling

Performance and Analysis Workflow Issues

What is Flexibility?