Network-Aware Join Processing in Global-Scale Database Federations

Network-Aware Join Processing in Global-Scale Database Federations X. Wang, R. Burns, A. Terzis Johns Hopkins University A. Deshpande University of Maryland

55 min, 107 pesos 40 min, 157 pesos Time/Cost Trade-off for Reaching Isla Mujeres Downtown Puerto Juarez 5 min, 30p 30 min, 70p 20 min, 7p Isla Mujeres Playa Tortuga 35 min, 150p 5 min, 7p You are here

Outline • Target Application • Join scheduling in SkyQuery • Incorporating network structure • Balanced network utilization metric • Exploit high throughput paths • Limitations • Algorithms • Two-approximate, MST-based solution • Heuristic extensions (clustering, semi-joins, bushy plans)

SkyQuery • Publicly accessible federation of sky surveys (a virtual telescope) • Autonomous, heterogeneous, and geographically distributed sites (30 across NA, EA, EU) • Data intensive workload • Terabyte data sets • Hundred megabyte intermediate join results • Queries take ten to over a hundred seconds • Network transfers consume up to 70% of the time • Principal federated query is cross-match

Cross-Match Queries • Join by increasing cardinality (count *) • Minimal I/O • Fewer bytes on the network Mediator Query Probe Query Result Result Result Count: 800 Count: 100 Count: 30

Incorporating Network Structure

Balanced Network Utilization Metric • Exploit excess capacity and avoid long haul paths • Minimizes aggregative time on the network • Similar metrics used for stream-processing, multicast, and optimal link layer routing (Bertsekas & Gallager) • Minimizes response time for serial schedules • Avoid over utilizing resources for bushy schedules • Does not account for I/O

How to Extract Network Structure?

Volatility in TCP Throughput

Limitations • Perfect join selectivity assumption • Observations against the same sky • Allows for polynomial-time solutions • No attribute aggregation • Address heuristically • Local optimizations at the mediators • Decentralized to achieve scale using aggregate stats • Routing at the application layer • Improve end performance and preserve I/O

Spanning Tree Approximation (STA) min G H A F B E C D mediator

STA: Find MST min G H A F B E C D mediator

STA: Join Using Paths on the MST min 7 G H 6 A 9 4 F 10 5 1 8 B E 2 3 13 12 C 11 D mediator

STA: Shortcutting in Metric Regions min 6 G 5 H A 4 F 1 7 B 8 E 2 3 C 9 D mediator 10

C-STA: Clustering TCP Throughput

C-STA: Combine STA & Count * min 1 G H A F 2 3 4 B 7 E 5 6 9 8 C D mediator

STA-SJ: Semi-joins and Attribute Agg. min 7 G Join Attr. H 6 A 9 Aggregation 4 F 10 5 1 8 B E 2 3 13 12 C 11 D mediator

STA-BP: Exploring Bushy Plans • Poly-time DP Algorithm that explores bushy plans using MST paths • Evaluates regions in parallel when beneficial (avoids sending data down the tree) • May operate on larger intermediate results • Intuition: Do not need to traverse STA paths twice if sites have low cardinality R R ≤ 2R > 2R

Experiments: Network Utilization

Experiments: I/O Overhead

Experiments: Algorithms Compared

Discussion • DP solution w/o selectivity, aggregation, MST-based assumptions – T: O(n3n), S: O(n2n) • Applicability beyond SkyQuery (distributed OLAP/DSS) • May tolerate exponential complexity • Value in capturing network structure • Don’t address multi-query optimization • Incomplete info about link layer • Global knowledge incurs high overhead

55 min, 107 pesos Which Path to Choose? Downtown Puerto Juarez 5 min, 30p 30 min, 70p 20 min, 7p Isla Mujeres Playa Tortuga You are here

Questions ???

Network-Aware Join Processing in Global-Scale Database Federations

Network-Aware Join Processing in Global-Scale Database Federations

Presentation Transcript

Database Processing

SQL Database Federations Tips and Tricks

SQL Database Federations Tips and Tricks

Join Processing in Database Systems with Large Main Memories

Processing Data Intensive Queries in Scientific Database Federations

Database Processing

Join Processing in Database Systems with Large Main Memories (part 2)

Network-aware OS

Database Processing

In Network Processing

Network-aware OS

In-Network processing

Database Processing

Network-aware OS

Network-aware OS

Massively Distributed Database Systems In-Network Query Processing (Ad-Hoc Sensor Network)

Network-aware OS

Network Aware Module

Join Processing in Database Systems with Large Main Memories

Network-aware OS

Database Processing

Network-aware OS