LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

BETTER LUCK NEXT TIME!

Problem Q1 Q4 Q2 Q3

Goals Eliminate redundant I/O to improve query throughput • Batch queries with that exhibit data sharing • Pre-process queries to identify data sharing • Co-schedule queries that access the same data • Access contentious data first to maximize sharing • Starvation resistance • Avoid indefinite queuing times (response time) • Enforce some constraints on completion order

Target Applications • Data intensive scan queries • Executed against a clustered index • Clustered and federated databases (e.g. joins that correlate multiple nodes) • Peta-scale astronomy (Pan-STARRS) • Data are partitioned spatially • Many queries scan full DB and last hours or days • Cross-match • Probabilistic spatial join across multiple databases

Filter and Refine • Filter queries • Pre-process queries to determine join buckets • Buckets B1,…,Bn and queries Q1,…, Qm • Workload Wij denote objects from Qi that overlap Bj • Refinement • Read buckets one-at-a-time • Sort-merge join (sort by HTM ID) • Query specific predicates applied on output tuples

Workload Throughput Metric • Greedily in order of decreasing workload throughput • Exploits data regions that experience contention • May starve requests • Favors buckets experiencing frequent reuse • No guarantee a particular bucket or query receives service

Aged Workload Throughput Metric • Inspired by disk-drive head scheduling • Balance arrival order (low response time) with contention (high throughput) • Adaptive trade-offs based on workload saturation • Maximize rate at which objects are joined during saturated workloads • Enforce completion order (queuing times) to prevent indefinite starvation during low saturation

Scheduling Behavior Qi Qj Qk Qk Sub-divide queries by bucket: • Assumptions: • Inter-query time of 1 sec • I/O for each bucket of 1 sec • Cache size of 2 • Join cost is negligible Qi – Qi1, Qi2, Qi3 Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8 Qj – Qj5, Qj6 , Qj7, Qj8

Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qk End Qi End Qj End Qi1 Qi2 Qi3 Qk8 Qj7 Qj1 Qj6 Qj8 Qk1 Qj3 Qk4 Qj4 B1 B2 B3 B7 B1 B1 B3 B6 B4 B8 B4 B8 Arrival order with no sharing … Completion Times: Qi – 3 sec Qj – 8 sec Qk – 13 sec Avg – 8 sec Tp – .2 qry/sec

Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qi End Qj End Qk End Qi1 Qi2 Qi5 Qj4Qk4 Qj7Qk7 Qj6Qk6 Qj1Qk1 Qi3Qj3 Qj8Qk8 B1 B2 B5 B3 B1 B4 B7 B8 B6 Age based scheduling (bias 1) Completion Times: Qi – 3 sec Qj – 7 sec Qk – 7 sec Avg – 5.6 sec Tp – .33 qry/sec

Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qi End Qj End Qk End Qi1 Qi2 Qi3Qj3 Qk5 Qj6Qk6 Qj7Qk7 Qj8Qk8 Qj1Qk1Qj4Qk4 B1 B2 B5 B3 B7 B8 B1 B4 B6 Contention based scheduling (bias 0) Completion Times: Qi – 7 sec Qj – 5 sec Qk – 6 sec Avg – 6 sec Tp – .38 qry/sec (5.6) (.33)

Throughput Performance

Tuning theage bias • Throughput performance gap grows while response time gap is insensitive to saturation • Increasing age bias is more attractive at low saturation

Parameter tuning using trade-off curves

Discussion • Impact of caching strategies • Workload overflow • Large intermediate join results • Migrate pairs of workload and bucket • Beyond completion order • Higher priority for interactive queries • Batch processing in a clustered environment P. Agrawal, D.Kifer, and C. Olston. Scheduling Shared Scans of Large Data Files. In VLDB, 2008.

WHAT ABOUT US?

Filter and refine • Partition data into buckets

Average Response Time

Outline • Motivation • Goals for data-driven, batch scheduling • Target application (SkyQuery) • LiftRaft scheduler • Filter and refine queries • Throughput maximizing metric • Starvation resistance • Differences in outcomes • Workload adaptive parameter selection

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases

Presentation Transcript

Exploratory Data Analysis

Business Driven Information Systems 2e

Data Analysis Overview

Introduction to Batch Files

Electronic Data Processing Systems

XML and Databases

Natural Language Processing for Information Retrieval

SAP FI Accounts Payable

Intro to Databases (using Microsoft Access)

Temporal Databases

Data

DATA DRIVEN INSTRUCTION

Chapter 19: Distributed Databases

Data Mining: Concepts and Techniques

What can data-driven linguistics tell us about culture?

C File Processing

Parallel Processing with OpenMP

Data Workflow Management, Data Stewardship

Chapter 13 Databases and Information Management

Data Exploration, Analysis, and Representation: Integration through Visual Analytics

SNP Resources: Finding SNPs Databases and Data Extraction

15-826: Multimedia Databases and Data Mining