230 likes | 259 Vues
Explore the state-of-the-art in distributed query processing, covering client-server, peer-to-peer, and middleware architectures. Learn about query shipping, data shipping, hybrid approaches, optimizations, and distributed query plans.
E N D
The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco
Introduction • Distributed database technology is becoming an increasingly attractive enhancement to many database systems • Cost and scalability • Software integration • Legacy systems • New applications • Market forces
Introduction • Topics covered in this paper • Basics of distributed query processing • Client-server distributed DB models • Heterogeneous distributed DB models • Data placement techniques • Other distributed architectures
Client-Server Database Systems • Relationships between distributed nodes take a client-server form • Client: makes requests of the servers, usually the source of queries • Server: responds to client requests, usually the source of data • System architectures: peer-to-peer, strict client-server, middleware/multitier
Peer Node Server or Client Peer Node Server or Client Peer Node Server or Client Architectures: Peer-to-Peer • All nodes are equivalent • Each can be either a client or server on demand (can store data and/or make requests) • Ex: SHORE system
Client Query source Server Data source Architectures: Strict Client-Server • Client or server status is pre-defined and can never change • Clients supply queries, servers supply data • Most common architecture in commercial DBMS’s
Node 1 Client to Node 2 Node 2 Server to Node 1, Client to Node 3 Node 3 Server to Node 2 Architectures: Middleware/Multitier • Multiple levels of client-server interaction • Nodes act as clients to those below them and servers to those above • SAP R/3, web servers with DB backends
Architectures: Evaluation • Peer-to-Peer • Simplest setup • Equal load sharing • Strict Client-Server • Specialization • Administration for servers only • Middleware/Multitier • Functionality integration • Scalability
Client-Server Query Processing • Queries initiated at clients, data stored at servers • Where do we execute the query? • Query shipping: move the query down to the data • Data shipping: move the data up to the query • Hybrid shipping: combination of both
Query Shipping • SQL query code is sent down to the server • Server parses and evaluates query, returns result • Used in DB2, Oracle, MS SQL Server
Data Shipping • Client parses query and requests data from server • Server provides data, then client executes query • Data can be cached at client (main memory or disk)
Hybrid Shipping • Mix-and-match data shipping and query shipping • Query parts can be executed at any level according to query plan • Data is cached when beneficial
Evaluation • Query Shipping • Reliant on server performance • Scales poorly with increasing client load • Data Shipping • Good scalability • High communication costs • Hybrid • Potential to outperform other options • More complex optimizations
Hybrid Shipping Observations • Some observations of optimal performance using hybrid shipping • Preference to not use a client cache • If network transfer cost < client access cost • Shipping down cached data • If in main memory & execution at server • Multiple small updates • Maintain at client and post to server only when necessary
Query Optimization • Query plans must also specify where the query pieces are executed • Data shipping: all execution done at client • Query shipping: all execution done at server • Hybrid: choice can be made for each operator • Results display to user is always at client
Distributed Query Plans • Each operator is annotated with a logical site of execution – plans are shareable • client means an operator is executed from the client where the query is issued • server means: • for scan operators, execute at a location that has the necessary data • for updates, execute at all locations with the relevant data
Query Optimization: Where? • Should optimization occur at the client or the server? • At client: less load on servers, better scalability • At server: more information about system statistics, especially server loads • Potential solution: primary parsing and query rewriting at client, further optimization at server
Query Optimization: Statistics • Even when optimization is done at a server, that server does not usually have full knowledge of the system • System can either: • Guess the status of other servers – less accuracy, less cost • Ask other servers their status – fully accurate, additional communication costs
Query Optimization: When? • Tradeoff of accuracy vs. cost • Traditional-style: optimize once, store plan • No support for changing DB conditions • No incurred cost for query execution • Plan sets: optimize for possible scenarios • Generate a few query plans for diff. conditions • Choose plans based on runtime statistics • On-the-fly: observe intermediate results • Re-optimize query if different from expectations
Query Optimization: Two-Step • Compile-time: generate join order, etc. • Runtime: perform site selection • Reasonable cost at each end • Responds well to changing server loads • Fully utilizes client data caching
Two-Step Optimization: Downside • Optimal plan is generated traditional-style • Site selection is performed • True optimal plan was missed • Optimal was missed because first optimization step was done with no knowledge of the system
Query Execution Techniques • Standard fare: row blocking, multithread when possible • Issues: transactions with both updates and retrieval queries using hybrid shipping • We want to wait to propagate updates for efficiency’s sake • Other option: perform query before update and temporarily pad results
Questions? • Comments?