1 / 24

Processing Data Intensive Queries in Scientific Database Federations

Processing Data Intensive Queries in Scientific Database Federations. Problem. Data avalanche in scientific databases Exponential growth in data size (Pan-STARRS) Accumulation of data at multiple data sources (clustered and federated databases) Exploring massive, widely distributed data

mulan
Télécharger la présentation

Processing Data Intensive Queries in Scientific Database Federations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Processing Data Intensive Queries in Scientific Database Federations

  2. Problem • Data avalanche in scientific databases • Exponential growth in data size (Pan-STARRS) • Accumulation of data at multiple data sources (clustered and federated databases) • Exploring massive, widely distributed data • Joins to find correlations across multiple databases • Queries are data intensive: large transfers over the network, and scan large portions of the data • Query throughput limits scale of exploration • To improve overall query throughput but potentially sacrifice performance of individual queries

  3. Target Application • SkyQuery Federation of Astronomy Databases • Dozens of multi-terabyte databases across three Continents • Queries that perform full db scans lasting hours or days • Intermediate join results that are hundreds of MBs • Scalability concerns both in data size and number of sites • Cross-match • Probabilistic spatial join across multiple databases • Join results are accumulated, shipped from site to site, and delivered to scientists

  4. Cross-Match Workload • A forward looking analysis shows that network dominates 90% of performance • A quarter of the cross-match queries execute for minutes to several hours

  5. Incorporating Network Structure

  6. Network-Aware Join Processing • Capture heterogeneity in global-scale federations • Metric to exploit high throughput paths • Decentralized, local optimizations using aggregate stats • Routing at the application layer • Two-approximate, MST-based solution with extensions that employ semi-joins and explore bushy plans • Clustering to explore trade-offs with computation cost • Over a ten-fold reduction in network utilization for large joins

  7. A Case for Batch Processing • Top ten buckets accessed by 61% of queries and reuse occur close temporally • 2% of buckets capture more than half of the workload and should be cached

  8. LifeRaft: Data-Driven Batch Proc. • Eliminate redundant I/O to improve query throughput • Batch queries with that exhibit data sharing • Pre-process queries to identify data sharing • Co-schedule queries that access the same data • Access contentious data first to maximize sharing • Improves performance by two-fold

  9. Discussion • Cache replacement for LifeRaft • Benefits contentions data regions that experience reuse (Cache hit for LifeRaft is 40% compared with 7% for arrival order processing) • Evaluate strategies that exploit I/O behavior of batch workloads (segmented strategy) • Buffering and workload overflow • Large intermediate join results • Migrate pairs of workload and bucket • Better support for interactive queries • Short and selective queries that focus on small region • Indefinite queuing times in presence of batch workloads

  10. Discussion (cont.) • Batch processing in a distributed environment • Network-aware scheduling does not consider computation cost • Batch processing for a single system environment • Federating LifeRaft • Coordinate exec. of query that join multiple DBs • Batch proc. requires databases to buffer results • Maximize overall batch size while alleviating memory used for buffering and network cost

  11. Exploring Alt. Join Schedules

  12. Discussion (cont.) • Explore both join schedules and opportunities for batching simultaneously • Bushy and semi-join plans increase computation while clustering decrease computation • Skew in join workload (ie. sites close to end user) • Quantify trade-offs with computation cost (ie. number of buckets in batch processing) • Users submit cross-match queries in batches • Applying LifeRaft to other data-intensive, temporal-spatial data such as Turbulence database

  13. Supplementary Slides

  14. Cross-Match Queries • Join by increasing cardinality (count *) • Minimal I/O • Fewer bytes on the network Mediator Query Probe Query Result Result Result Count: 800 Count: 100 Count: 30

  15. Spanning Tree Approximation (STA) min G H A F B E C D mediator

  16. STA: Find MST min G H A F B E C D mediator

  17. STA: Join Using Paths on the MST min 7 G H 6 A 9 4 F 10 5 1 8 B E 2 3 13 12 C 11 D mediator

  18. Filter and refine • Partition data into buckets

  19. Scheduling Behavior Qi Qj Qk Qk Sub-divide queries by bucket: • Assumptions: • Inter-query time of 1 sec • I/O for each bucket of 1 sec • Cache size of 2 • Join cost is negligible Qi – Qi1, Qi2, Qi3 Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8 Qj – Qj5, Qj6 , Qj7, Qj8

  20. Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qk End Qi End Qj End Qi1 Qi2 Qi3 Qk8 Qj7 Qj1 Qj6 Qj8 Qk1 Qj3 Qk4 Qj4 B1 B2 B3 B7 B1 B1 B3 B6 B4 B8 B4 B8 Arrival order with no sharing … Completion Times: Qi – 3 sec Qj – 8 sec Qk – 13 sec Avg – 8 sec Tp – .2 qry/sec

  21. Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qi End Qj End Qk End Qi1 Qi2 Qi5 Qj4Qk4 Qj7Qk7 Qj6Qk6 Qj1Qk1 Qi3Qj3 Qj8Qk8 B1 B2 B5 B3 B1 B4 B7 B8 B6 Age based scheduling (bias 1) Completion Times: Qi – 3 sec Qj – 7 sec Qk – 7 sec Avg – 5.6 sec Tp – .33 qry/sec

  22. Qi Qj Qk Qk Qi Arr Qj Arr Qk Arr Qi End Qj End Qk End Qi1 Qi2 Qi3Qj3 Qk5 Qj6Qk6 Qj7Qk7 Qj8Qk8 Qj1Qk1Qj4Qk4 B1 B2 B5 B3 B7 B8 B1 B4 B6 Contention based scheduling (bias 0) Completion Times: Qi – 7 sec Qj – 5 sec Qk – 6 sec Avg – 6 sec Tp – .38 qry/sec (5.6) (.33)

  23. Parameter tuning using trade-off curves

  24. Tuning theage bias • Throughput performance gap grows while response time gap is insensitive to saturation • Increasing age bias is more attractive at low saturation

More Related