Distributed Databases

Distributed Databases

Outline • Distributed DBMS (DBMS) • Features • Detailed insight of its Objectives

Distributed DBMS (1) • A distributed DBMS coordinates the data access at various nodes for a distributed database. • Although each site will have a DBMS to manage local database at that site, a distributed DBMS will perform the following: • Keep track of data location in a data dictionary. • Determine location from which requested data can be retrieved. • Determine where each part of a distributed query can be processed

Distributed DBMS (2) • If required translate request at one node (using a local DBMS) into proper request to another node (using a different DBMS+data model) and return data to requesting node in its acceptable format. • Provide data management functions like security, dead lock control, concurrency, recovery, etc. • Provide consistency among data copies. • Use of global primary keys. • Scalability, replication and transparency. • Permit different DBMS on different nodes.

Distributed/ data repository Distributed/ data repository Distributed DBMS Application Programs Local Database User 1 Distributed DBMS Application Programs Local Database User 1 User 2 Comm. controller Comm. controller User 2 User n User n Database site2 Database Site n A Popular DDBMS Architecture

DDBMS Architecture Description (1) • The architecture shown on previous slide shows a system in which each site has • a local DBMS, • a copy of DDBMS • and associated DD/D or data repository • The DD/D has data definitions and info about location of distributed data. • User requests for data are first processed by DDBMS which determines whether a local* or global* transaction is involved.

DDBMS Architecture Description(2) • For local transactions, DDBMS passes the local requests to local DBMS. • For global transactions, DDBMS routes the request to other sites. • In this architecture, it is assumed that copies of distributed DBMS and DD/D exist at each site. • In another setup, these may be at a central site, but this setup will be vulnerable to failures. • Other strategies are also possible

Key Objectives of DDBMS • Location Transparency • Replication Transparency • Failure Transparency • Commit protocol • Concurrency Transparency • Query Optimization

Location Transparency • It implies that User of the data need not to know where the data are physically located. • User being unaware of data distribution assumes one single database (physically). • Example: • User at region A wants to see data of customers whose sales have exceeded $10,000 • Select * from customer where total_sales>10000 • Now distributed DBMS at local site (region A) consults the distributed DD/D and determines that this request has to be routed to Region C. • On display of results to user, it appears as if data were locally retrieved (unless there is long comm. delay)

Data Distribution Example PriceList PriceList Account ing RegionB parts Engg Parts RegionA parts PriceList Customers RegionC parts RegionA RegionB RegionC

Replication Transparency(1) • Although a given data item may be replicated at several nodes in the network, with replication transparency, a user may treat it as if it were a single item at a single node. • Also called fragmentation transparency. • Example: • User wants to view the Price List file. • An identical copy of this file exists at all three nodes (three regions) – case of full replication. • Now distributed DBMS will consult DD/D and determine that this is a local transaction. • User need not be aware that same data is stored at other sites.

Replication Transparency(2) • In case requested data are not at the local site, the distributed DBMS decides the optimum route to the target site. • Consider another situation when one or more users try to update replicated data. • user at region C wants to change price of a part. • This update must be reflected at all replicated sites for data integrity. • With replication transparency this is easily handled.

Failure Transparency (1) • Each site in a distributed environment is prone to same type of failures as in centralized systems (like disk failure, erroneous data,etc.) • An additional risk is communication link failure • A robust system must be able to • Detect failure • Reconfigure system to continue computation • Recover when processor or link is repaired • The 1st & 2nd tasks are jobs of comm. controller or processor and 3rd one is job of DDBMS • DDBMS has a software module – Transaction Manager

Failure Transparency (2) • Transaction Manager: • Maintain a log of transactions and “before & after images” of database • Maintain concurrency control scheme to ensure data integrity during parallel execution of transactions at that site • For global transactions, transaction managers at each participating site cooperate to ensure all update operations are synchronized • Example • A person at site1 wants to change a part’s price in the StandardPriceList (copies are present at three different sites) • This global transaction must update every copy of record for that part at the three sites • Suppose, site1 and site2 are successfully updated but due to transmission failure, update does not occur at site3 • With failure transparency, either all actions of a transaction are committed or none of them are committed

Commit protocol • Commit Protocol • Ensures data integrity for distributed update operations, the cooperating transaction managers execute a commit protocol • Ensures that a global transaction is successfully completed at all sites or else aborted • Two-phase Commit (an algorithm) • Most widely used protocol • Coordinates updates in a distributed environment • Ensures that concurrent transactions at multiple sites are processed as though they were executes in same serial order at all sites • How does it work?

How does the protocol work?(1) • Site originating the global transaction sends a request to each of the sites that will process some portion of the transaction. • Each site processes sub-transaction but does not commit the result to database (holds in temporary file). • Each site locks its portion of database being updated. • Each site notifies originating site about completion of sub-transaction. • When all sites have responded, originating site initiates two-phase commit protocol • Preparation Phase • Final Commit Phase

How does the protocol work?(2) • Preparation Phase • Message is broadcasted to every participating sites, asking whether it is willing to commit its portion of transaction at that site • Each site returns “OK” or “not OK” • Originating site collects all messages.

How does the protocol work?(3) • Final Commit Phase • If all are “OK”, it broadcasts a message to all sites to commit the portion of transaction handled at each site • If one or more sites respond with “not OK”, it broadcasts a message to all sites to abort the transaction • A transaction can fail during the commit phase • Such transaction will be in limbo

How does the protocol work?(4) • A limbo transaction can be identified by polling or by a timeout. With a timeout (no confirmation of commit for a specified time period) its not possible to distinguish between a failed or a busy site. • Polling is expensive in terms of network load and processing time • In case a commit confirmation is not received from one or more sites, the originating sites forces all sites to undo changes by abort message

Improvement Strategies for two-phase commit • Two-phase commit protocol is slow due to delays caused by extensive coordination among sites. Some improvements develop –ed are: • Read-only optimization • This approach identifies read-only portions of a transaction and eliminates need of confirmation messages on them, e.g., a transaction may include reading of data before inserting new data. (inventory balance check before creating new order, so data read can happen without callback confirmation) • Lazy commit optimization • This approach allows those sites which can update to proceed to update and the ones which can not update are allowed to “catch up” later • Linear commit optimization • This approach allows each part of transaction (sub-transaction) to be committed in sequence rather than holding up a whole transaction when sub-transaction parts are delayed from being processed

Concurrency Transparency • Concurrency control • When multiple users access and update a database, data integrity may be lost unless locking mechanisms are used to protect data from effects of concurrent updates • Concurrency control is more complex in a distributed database as • users are geographically distributed at sites • Data are often replicated at several sites • Concurrency Transparency • Allows a distributed system to run many transactions • Allows each transaction to appear as if it were the only activity in the system • When several transactions are processed concurrently, results must be same as if each transaction were processed in serial order • Transaction managers at each site must cooperate to provide concurrency control in distributed Database

Concurrency Control Approaches • 1st approach - Locking • Data retrieved by a user for updating must be locked or denied to other users until that update is completed or aborted. Transactions can place locks on data resources. • Locks are implemented at the following levels • Database • Table • Block or Page • Record level • Field level • Lock Types • Shared locks (S lock or read lock) • Exclusive locks (X lock or write lock) • Details on page 474 (Hoffer - 6/e)

Concurrency Control Approaches • 2nd approach – Versioning • Uses optimistic approach (opposite to pessimistic locking approach) that most of the time other users do not want the same record • Each transaction is restricted to a view of database as of the time that transaction started and when a transaction modifies a record, the DBMS creates a new record version instead of overwriting the old record. • When there is no conflict (only one user transaction changes database records) the changes are merged directly to database • If two users make conflicting changes, changes made by one of the users are committed to database – usually the earlier time stamped transaction gets priority. Other user is told about the conflict.

Concurrency Control Approaches • 3rd approach - Timestamping • Every transaction gets a globally unique timpestamp (clock time when transaction occurred) • Ensures that even if two simultaneous events/transactions occur at different sites, each will get a unique timestamp • Timestamp ensures serial order processing of transactions • Every record carries timestamp of transaction that last updated it. • A transaction can not process a record until its timestamp is later than that carried in the record • If a transaction timestamp is earlier than that carried in record, it is assigned a new timestamp and restarted

Query Optimization • The decision of DDBMS of how to process a query is affected by the way user formulates a query and intelligence of the distributed DBMS to develop a sensible plan • A distributed DBMS typically uses the following 3 steps to develop a query processing plan • Query decomposition • Query is simplified and rewritten into structured relational algebra form • Data localization • Query is transformed from a query referencing data across the network into one or more fragments which each explicitly reference data at only one site • Global optimization • In this final step, decisions are made about the order in which to execute query fragments, where to move data between sites and where parts of the query will be executed

Query Optimization • One of the techniques of processing a distributed query efficiently is to use semijoin operation. • In Semijoin, only the joining attribute is sent to sent from one site to another rather than sending all selected attributes from every qualified row. • After that then only the required rows are returned • Hence there is minimal transfer of data over the network. • A DDBMS also uses a cost model to predict execution time of alternative execution plans.

Reading assignment • Evolution of DDBMS

That’s it for today…!

Distributed Databases