Servicing Range Queries on Multidimensional Datasets with Partial Replicas

Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Outline • Introduction • Motivation • Partial Replicas Considered • system overview • Query execution and algorithm design • Computing goodness value • Replica selection algorithm • Experimental results • Related work • Conclusions CCGRID 2005

Motivating Applications Oil Reservoir Management Magnetic Resonance Imaging Data-driven applications from science, Engineering, biomedicine: Oil Reservoir Management Water Contamination Studies Cancer Studies using MRI Telepathology with Digitized Slides Satellite Data Processing Virtual Microscope Oil Reservoir Management is used by us as a case study. CCGRID 2005

Motivation • The combination of large dataset sizes, geographic distribution of users and resources and complex analysis results in the requirements of efficient access and high-performance processing. • To achieve good performance for various query types and increasing need of clients, we need to harness an optimization technique Partial Replication. • Under a distributed environment, assembling required data efficiently from replicas and the original dataset for a query is an interesting challenge. CCGRID 2005

Partial Replicas Considered • Replica information file describes the replicas created by users. • Hot range • Use a group of representative queries to identify the portions of the dataset to be replicated. • Chunking • Allow flexible chunk shapes and sizes. • Affect data read cost. • Dimension order • Layout chunks following different dimension sequences. • Affect data seek cost. CCGRID 2005

Partial Replicas Considered • To maximize I/O parallelism, users need to partition each chunk of one replica across all available data source nodes. • After re-organizing, re-distributing and re-ordering hot ranges of the dataset, there will not be one-to-one mapping between data chunks in the original dataset and those in replicas. CCGRID 2005

System Overview The Replica Selection Module is coupled tightly with our prior work on supporting SQL Select queries on scientific datasets in a cluster environment. CCGRID 2005

STORM Runtime System A middleware to support data selection, data partitioning, and data transfer operations on flat-file datasets hosted on a parallel system. Services Query service Data source service Indexing service Filtering service Partition generation service Data mover service CCGRID 2005

Computing Goodness Value • goodness = useful dataper-chunk / costper-chunk • Full chunks and partial chunks of a partial replica • Chunk retrieval cost • Cost = k1 * Cread-operation + k2 * Cseek-operation • k1 : average read time for a page • Cread-operation : number of pages fetched • k2 : average seek time • Cseek-operation : number of seeks • Fragment • intermediate unit between a replica and its chunks • a group of full or partial chunks having same goodness value in a replica • goodness = useful dataper-fragment / costper-fragment CCGRID 2005

Replica 1 3 full chunks and 2 partial chunks 3 fragments Replica 2 10 full chunks 1 fragment An Example – Query and Intersecting Replicas CCGRID 2005

Replica Selection Algorithm Greedy Strategy Q : an issued query R : the partial replicas D : the original dataset F : all fragments intersecting with the query boundary Fmax : the fragment with the maximum goodness value in F S : the ordered list of the candidate fragments in decreasing order of their goodness value The runtime complexity is O(m2), where m is the number of fragments intersecting the query boundary. Input Q, R, D Remove Fmax from F Overlap with Fmax exists in F? Calculate the fragment set F Yes F is null? No Append Fmax Into S No Yes Subtract the overlap Re-compute the goodness value Add D if needed Output S CCGRID 2005

Assume Fragment 4 has the maximum goodness value. Candidate fragments set is { 1, 2 (with overlap), 3, 4, D }. An Example –4 Fragments from 2 Replicas CCGRID 2005

Replica Selection Algorithm Extension to the greedy algorithm S : the ordered list of the candidate fragments in decreasing order of their goodness value Fi : a fragment in S C : a chunk in Fi r : the union range contained by the filtered areas of other fragments The runtime complexity is O(n2), where n is the number of chunks intersecting the query boundary. Redundant I/O exists? C ∈ r ? Recommended fragments Input S No Yes Foreach Fi ∈ S from the head of S Foreach chunk C in Fi No Yes Drop it from Fi Modify other fragments in S to retrieve C Output CCGRID 2005

Final recommendation Overlap region has been deleted from Fragment 4 and retrieved in Fragment 2 instead. We get fewer I/O operations and less filtering computation. An Example –Recommended Chunks CCGRID 2005

Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. • Scalability test when increasing data size; • Performance test when the number of nodes hosting dataset is varied; • Showing the robustness of the proposed algorithm. CCGRID 2005

CCGRID 2005

Query #1 • SELECT * from IPARS where TIME>=1000 and TIME<=TIMEVAL and X>=0 and X<=11 • and Y>=0 and Y<=28 and Z>=0 and Z<=28; • Set #1 in the previous table 1. • Our algorithm has chosen {0,1,2,4} out of 6 replicas in Set #1. • The query filters 83% of the retrieved data when using the original dataset only; however, • it need to filter about 25% of the retrieved data in the presence of replicas as in set #1. CCGRID 2005

Query #2 • SELECT * from IPARS where TIME>=1000 and TIME<=1599 and X>=0 and X<=11 • and Y>=0 and Y<=31 and Z>=0 and Z<=31; • Set #1 in the previous table 1. • Our algorithm has chosen {0,1,2,4} out of 6 replicas in Set #1. • Upto 4 nodes, query execution time scales linearly. • Due to the dominating seek cost in the total I/O overhead, execution time is not reduced by half while using 8 nodes. CCGRID 2005

Query #3 • SELECT * from IPARS where TIME>=1000 and TIME<=TIMEVAL and X>=0 and X<=15 • and Y>=0 and Y<=63 and Z>=0 and Z<=63; • Set #1 in the previous table 1. • Our algorithm extension could detect the redundant I/O in the candidate replicas for this query. The final recommendation is to avoid using replicas. CCGRID 2005

Query #4 • SELECT * from IPARS where TIME>=1000 and TIME<=1199; • An accurate cost modeling should take into account both the seek cost and the read cost. CCGRID 2005

Related Work • Parallel file systems and I/O libraries • Supporting regular strided access to uniform distributed datasets • File level and dataset level replication and replica management • Exact replica copies • Availability and reliability • Data caching • Remote memory • Cooperative caches • Active semantic cache CCGRID 2005

Conclusions • We have investigated a compiler-runtime approach for execution of range queries on distributed environment when employing partial replication. • We have proposed a cost metric and algorithm to select the set of replicas and possibly the original dataset to answer a given query efficiently. • Experimental results demonstrate the efficacy, scalability and robustness of our algorithm. CCGRID 2005

Servicing Range Queries on Multidimensional Datasets with Partial Replicas

Servicing Range Queries on Multidimensional Datasets with Partial Replicas

Presentation Transcript

Nonphotorealistic Visualization of Multidimensional Datasets SIGGRAPH 2001

Adaptively Parallelizing Distributed Range Queries

Multidimensional Range Search

Conjunctive, Subset, and Range Queries on Encrypted Data

Active Range Imaging Datasets for Indoor Surveillance

How range queries work

Inverse Queries for Multidimensional Spaces

Creating Complex Queries with Nested queries

Data Structures: Range Queries - Space Efficiency

Dynamic Skylines Considering Range Queries

Range Queries in Distributed Networks

Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries

Data Structures for Orthogonal Range Queries

HDF5 FastQuery Accelerating Complex Queries on HDF Datasets using Fast Bitmap Indices

Conjunctive, Subset, and Range Queries on Encrypted Data

Evaluation of Partial Path Queries on XML Data

How range queries work

Queries with Difference on Probabilistic Databases

More On Queries with SQL

Managing WMS and WCS on multidimensional NetCDF Datasets with Geoserver

Nonphotorealistic Visualization of Multidimensional Datasets SIGGRAPH 2001

Mercury: Scalable Routing for Range Queries