1 / 24

Servicing Range Queries on Multidimensional Datasets with Partial Replicas

Servicing Range Queries on Multidimensional Datasets with Partial Replicas. Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz. Outline. Introduction Motivation Partial Replicas Considered system overview Query execution and algorithm design Computing goodness value

kathie
Télécharger la présentation

Servicing Range Queries on Multidimensional Datasets with Partial Replicas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

  2. Outline • Introduction • Motivation • Partial Replicas Considered • system overview • Query execution and algorithm design • Computing goodness value • Replica selection algorithm • Experimental results • Related work • Conclusions CCGRID 2005

  3. Motivating Applications Oil Reservoir Management Magnetic Resonance Imaging Data-driven applications from science, Engineering, biomedicine: Oil Reservoir Management Water Contamination Studies Cancer Studies using MRI Telepathology with Digitized Slides Satellite Data Processing Virtual Microscope Oil Reservoir Management is used by us as a case study. CCGRID 2005

  4. Motivation • The combination of large dataset sizes, geographic distribution of users and resources and complex analysis results in the requirements of efficient access and high-performance processing. • To achieve good performance for various query types and increasing need of clients, we need to harness an optimization technique Partial Replication. • Under a distributed environment, assembling required data efficiently from replicas and the original dataset for a query is an interesting challenge. CCGRID 2005

  5. Partial Replicas Considered • Replica information file describes the replicas created by users. • Hot range • Use a group of representative queries to identify the portions of the dataset to be replicated. • Chunking • Allow flexible chunk shapes and sizes. • Affect data read cost. • Dimension order • Layout chunks following different dimension sequences. • Affect data seek cost. CCGRID 2005

  6. Partial Replicas Considered • To maximize I/O parallelism, users need to partition each chunk of one replica across all available data source nodes. • After re-organizing, re-distributing and re-ordering hot ranges of the dataset, there will not be one-to-one mapping between data chunks in the original dataset and those in replicas. CCGRID 2005

  7. System Overview The Replica Selection Module is coupled tightly with our prior work on supporting SQL Select queries on scientific datasets in a cluster environment. CCGRID 2005

  8. STORM Runtime System A middleware to support data selection, data partitioning, and data transfer operations on flat-file datasets hosted on a parallel system. Services Query service Data source service Indexing service Filtering service Partition generation service Data mover service CCGRID 2005

  9. Outline • Introduction • Motivation • Partial Replicas Considered • system overview • Query execution and algorithm design • Computing goodness value • Replica selection algorithm • Experimental results • Related work • Conclusions CCGRID 2005

  10. Computing Goodness Value • goodness = useful dataper-chunk / costper-chunk • Full chunks and partial chunks of a partial replica • Chunk retrieval cost • Cost = k1 * Cread-operation + k2 * Cseek-operation • k1 : average read time for a page • Cread-operation : number of pages fetched • k2 : average seek time • Cseek-operation : number of seeks • Fragment • intermediate unit between a replica and its chunks • a group of full or partial chunks having same goodness value in a replica • goodness = useful dataper-fragment / costper-fragment CCGRID 2005

  11. Replica 1 3 full chunks and 2 partial chunks 3 fragments Replica 2 10 full chunks 1 fragment An Example – Query and Intersecting Replicas CCGRID 2005

  12. Replica Selection Algorithm Greedy Strategy Q : an issued query R : the partial replicas D : the original dataset F : all fragments intersecting with the query boundary Fmax : the fragment with the maximum goodness value in F S : the ordered list of the candidate fragments in decreasing order of their goodness value The runtime complexity is O(m2), where m is the number of fragments intersecting the query boundary. Input Q, R, D Remove Fmax from F Overlap with Fmax exists in F? Calculate the fragment set F Yes F is null? No Append Fmax Into S No Yes Subtract the overlap Re-compute the goodness value Add D if needed Output S CCGRID 2005

  13. Assume Fragment 4 has the maximum goodness value. Candidate fragments set is { 1, 2 (with overlap), 3, 4, D }. An Example –4 Fragments from 2 Replicas CCGRID 2005

  14. Replica Selection Algorithm Extension to the greedy algorithm S : the ordered list of the candidate fragments in decreasing order of their goodness value Fi : a fragment in S C : a chunk in Fi r : the union range contained by the filtered areas of other fragments The runtime complexity is O(n2), where n is the number of chunks intersecting the query boundary. Redundant I/O exists? C ∈ r ? Recommended fragments Input S No Yes Foreach Fi ∈ S from the head of S Foreach chunk C in Fi No Yes Drop it from Fi Modify other fragments in S to retrieve C Output CCGRID 2005

  15. Final recommendation Overlap region has been deleted from Fragment 4 and retrieved in Fragment 2 instead. We get fewer I/O operations and less filtering computation. An Example –Recommended Chunks CCGRID 2005

  16. Outline • Introduction • Motivation • Partial Replicas Considered • system overview • Query execution and algorithm design • Computing goodness value • Replica selection algorithm • Experimental results • Related work • Conclusions CCGRID 2005

  17. Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. • Scalability test when increasing data size; • Performance test when the number of nodes hosting dataset is varied; • Showing the robustness of the proposed algorithm. CCGRID 2005

  18. CCGRID 2005

  19. Query #1 • SELECT * from IPARS where TIME>=1000 and TIME<=TIMEVAL and X>=0 and X<=11 • and Y>=0 and Y<=28 and Z>=0 and Z<=28; • Set #1 in the previous table 1. • Our algorithm has chosen {0,1,2,4} out of 6 replicas in Set #1. • The query filters 83% of the retrieved data when using the original dataset only; however, • it need to filter about 25% of the retrieved data in the presence of replicas as in set #1. CCGRID 2005

  20. Query #2 • SELECT * from IPARS where TIME>=1000 and TIME<=1599 and X>=0 and X<=11 • and Y>=0 and Y<=31 and Z>=0 and Z<=31; • Set #1 in the previous table 1. • Our algorithm has chosen {0,1,2,4} out of 6 replicas in Set #1. • Upto 4 nodes, query execution time scales linearly. • Due to the dominating seek cost in the total I/O overhead, execution time is not reduced by half while using 8 nodes. CCGRID 2005

  21. Query #3 • SELECT * from IPARS where TIME>=1000 and TIME<=TIMEVAL and X>=0 and X<=15 • and Y>=0 and Y<=63 and Z>=0 and Z<=63; • Set #1 in the previous table 1. • Our algorithm extension could detect the redundant I/O in the candidate replicas for this query. The final recommendation is to avoid using replicas. CCGRID 2005

  22. Query #4 • SELECT * from IPARS where TIME>=1000 and TIME<=1199; • An accurate cost modeling should take into account both the seek cost and the read cost. CCGRID 2005

  23. Related Work • Parallel file systems and I/O libraries • Supporting regular strided access to uniform distributed datasets • File level and dataset level replication and replica management • Exact replica copies • Availability and reliability • Data caching • Remote memory • Cooperative caches • Active semantic cache CCGRID 2005

  24. Conclusions • We have investigated a compiler-runtime approach for execution of range queries on distributed environment when employing partial replication. • We have proposed a cost metric and algorithm to select the set of replicas and possibly the original dataset to answer a given query efficiently. • Experimental results demonstrate the efficacy, scalability and robustness of our algorithm. CCGRID 2005

More Related