Distributed Database

Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database • What is a distributed database? • Distributed database concepts • Types of distributed databases • Basic architecture • Data fragmentation, replication, and allocation techniques • Query processing and optimization • Transaction management

Distributed Database • Concurrency control and recovery • Distributed catalog management • Current trends • Distributed databases in Oracle • Advantages of distributed databases • Disadvantages of distributed databases

What is a Distributed Database? A major motivation behind the development of database systems is the desire to integrate the operational data of an organization and provide controlled access to the data. • Computer networks promote a decentralized mode of work • Logically distributed • Divisions, departments, projects • Physically distributed • Offices, plants, factories A distributed database system makes all the data in each unit accessible and stores it to the location where it is most frequently used.

Distributed Database Concepts • Distributed database (DDB) – A collection of multiple, logically interrelated databases distributed over a computer network. • Distributed database management system (DDBMS) – A software system that manages a distributed database while making the distribution transparent to the user.

Distributed Database Concepts • A Distributed Database Management System (DDBMS) consists of a single logical database that is split into a number of fragments. • Each fragment is stored on computers that are under the control of a separate DBMS connected by a network • Each site is capable of processing data stored on other computers in the network. • Users access the distributed database on applications

Distributed Database Concepts • A Distributed Database Management System (DDBMS) has the following characteristics • A collection of logically related shared data • The data is split into a number of fragments • Fragments may be replicated • Fragments are allocated to sites • The sites are linked by a communications network • The data at each site is under the control of a DBMS • The DBMS at each site can handle the local applications autonomously • Each DBMS participates in at least one global application

Distributed Database Concepts • Distributed processing - The operation that occurs when an application distributes its tasks among different computers in a network • Distributed database system vs distributed processing • Replicated databases – The operation of copying and maintaining database objects in multiple databases belonging to a distributed system • Distributed database system vs Replicated database

Distributed Database Concepts • Transparency – the idea of hiding implementation details from end users • Data organization • Location • Naming • Replication • Fragmentation • Autonomy – the extent to which individual nodes in a connected DDB can operate independently • Design • Communication • Execution

Types of Distributed Databases • Homogeneous • all servers use identical software • all users use identical software • Ex. each database is an Oracle Database • Otherwise it is heterogeneous • Ex. at least one of the databases is not an Oracle Database

Types of Distributed Databases • Local Autonomy • There is a provision for the local site to function as a standalone DBMS. • Otherwise the system has no local autonomy • It can have some degree of local autonomy, if direct access by local transactions to a server is permitted

Types of Distributed Databases

Types of Distributed Databases • Federated database system (FDBS) • There is some global view or schema of the federation of databases that is shared by the applications • Multi-database system • Full local autonomy • Does not have a global schema but constructs one as needed • This is not very feasible in practice for enterprise databases

Types of Distributed Databases • Federated database system (FDBS) and Multi-database system • In these systems, each server is an independent and autonomous centralized DBMS. • They have their own local users, local transactions, and DBA, and a high degree of local autonomy. • Both are hybrids between distributed and centralized systems

Types of Distributed Databases Issues with FDBS • Differences in data models • In these systems, each server is an independent and autonomous centralized DBMS. • They have their own local users, local transactions, and DBA, and a high degree of local autonomy. • Both are hybrids between distributed and centralized systems • Differences in constraints • Different systems have different limitations • Global schema must deal with potential conflicts among constraints

Types of Distributed Databases Issues with FDBS • Differences in query languages • Even with same data model, the languages and their versions may vary • Ex. SQL has multiple versions and each has its own set of data types, operators, manipulation features, etc. • Semantic Heterogeneity • Occurs when there are differences in the meaning, interpretation, ad intended use of the same or related data. • This is the biggest problem in designing global schemas of heterogeneous databases.

Types of Distributed Databases Issues with FDBS • Freedom of choosing from these design parameters(below) which could affect the complexity of the FDBS • The universe of discourse from which the data is drawn • Ex. US and Japan have different sets of attributes about customer accounts • Currency rate fluctuations could also be a problem • Relations in these two DBs that have identical names may have some common and some entirely distinct information

Types of Distributed Databases Issues with FDBS • Representation and naming • Elements and structure of the model may be pre-specified for each local database • The understanding, meaning, and subjective interpretation of data • Transaction and policy constraints • Derivation of summaries • data processing features and operations supported by the system

Types of Distributed Databases Issues with FDBS • Semantic heterogeneity problems are being faced by all major multinational and governmental organizations in all application areas • Most enterprises are resorting to heterogeneous FDBSs, having heavily invested in the development of individual database systems

Types of Distributed Databases Issues with FDBS • Local component databases strive to preserve autonomy. • Communication autonomy • Ability to decide whether to communicate with another component DBS • Execution autonomy • Ability of a component DBS to execute local operations without interference from external operations by another component DBSs • Ability to decide order in which to execute • Association autonomy • Ability to decide whether and how much to share its functionality and resources with other component DBSs

Distributed Database Architectures • Distributed database architecture strives to provide the ultimate transparency

Distributed Database Architectures Shared memory – Multiple processors within a single system share system memory Shared disk – each processor accesses all disks, but each has its own memory Shared nothing – each processor is part of a complete system with its own memory and disk storage

Distributed Database Architectures This shows the logical structure of underlying data across all nodes It is represented by the global conceptual schema (GCS) which provides network transparency Each node has its own local internal schema (LIS) to accommodate potential heterogeneity The local conceptual schema (LCS) specifies the logical organization of data at each site

Distributed Database Architectures Three-tier Client-Server Architecture • A database server is the Oracle software managing a database • A client is an application that requests information from a server. • Each computer in a network is a node that can host one or more databases. • Each node in a distributed database system can act as a client, a server, or both, depending on the situation.

Distributed Database Architectures Presentation layer (client) Provides the user interface Handles user input, output, and navigation by accepting user commands and displaying the information in the form of static or dynamic web pages Application Layer (business logic) Programs the application logic Queries formulated from client – results formatted and sent to client Security checks, verification, and other functions Database server Handles query and update requests from the application layer, processes requests, and sends results Query results may be formatted into XML

Data Fragmentation, Replication, and Allocation Techniques A relation may be divided into a number of sub-relations called fragments which are then distributed. There are three types of fragmentation: • Horizontal • Subsets of tuples • Vertical • Subsets of attributes • Mixed (Hybrid)

Data Fragmentation, Replication, and Allocation Techniques Replication – the DDBMS may maintain a copy of a fragment at several different sites, or in extreme cases, replicate the whole database at every site • Advantages • Improves availability and performance • Disadvantages • Slows update operations and is expensive

Data Fragmentation, Replication, and Allocation Techniques Allocation – each fragment is stored at the site with ‘optimal’ distribution • Centralized • Fragmented • Complete replication • Selective replication

Query Processing and Optimization • Query mapping • Input query is specified formally using a query language • Then translated into an algebraic query on global relations • Does not take into account actual distribution and replication of data • This process is identical to a centralized DBMS • It is normalized, analyzed for semantic errors, simplified, and restructured into an algebraic query • Localization • Fragmentation results in relations being stored in separate sites, possibly replicated • Maps the distributed query on the global scheme to separate queries on individual fragments using data distribution and replication information

Query Processing and Optimization • Global Query Optimization • Selects a strategy that is closest to optimal • Time is the preferred unit for measuring cost • Total cost is a weighted combination of costs such as CPU cost, I/O costs, and communication costs • Communication costs over the network are the most significant, especially when the sites are connected through a wide area network (WAN) . 4. Local Query Optimization • Common to all sites in the DDB • Techniques similar to those in centralized systems • Stages 1, 2, and 3 are performed at a central control site, while this one is performed locally

Query Processing and Optimization • Data transfer costs • Cost of transferring data over the network • This is a factor that complicates query processing • Data includes intermediate files transferred to other sites for processing and final result files to be transferred to the site where the query result is needed • Costs may not be very high if sites are connected by a high performance local area network DDBMS query optimization algorithms consider the goal of reducing the amount of data transfer as criterion in choosing a distributed query execution strategy

Query Processing and Optimization

Query Processing and Optimization Consider the query: for each department, retrieve the department name and the name of the department manager 1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and perform the join at site 3. A total of 1,000,000 + 3,500 = 1,003,500 bytes must be transferred 2. Transfer the EMPLOYEE relation to site 2, execute join at 2, and send result to 3. The size of the query result is 40 * 100 = 4,000 bytes, so 4,000 + 1,000,000 = 1,004,000 bytes must be transferred 3. Transfer the DEPARTMENT relation to site 1, execute join at 1, and send result to 3. 4,000 + 3,500 = 7,500 bytes must be transferred In this case we would choose strategy number 3

Query Processing and Optimization Using Semi-join • More complex, works better than the previous strategies • The idea is to reduce the number of tuples in a relation before transferring it to another site In other words: Send the joining column of one relation R to the site where the other relation S is located – column joined with S – join attributes along with attributes required for result are projected out and shipped back to original site and joined with R

Query Processing and Optimization Considering our previous example: • Project the join attributes of DEPARTMENT at site 2, transfer them to site 1. We transfer F = Mgr_ssn(DEPARTMENT), sizes 9 * 100 = 900 bytes • Join the transferred file with the employee at site 1, and transfer the required attributes to site 2. We transfer R = Mgr_ssn, Fname, Lname(F ⋈Mgr_ssn=Ssn EMPLOYEE), size is 39 * 100 = 3,900 bytes • Execute the query by joining the transferred file R with DEPARTMENT, and present the result to the user at site 2 Using semijoin we transferred 4,800 bytes. We limited the attributes and tuples transmitted to only those that will actually be joined.

Transaction Management • The transaction manager coordinates transactions on behalf of the application programs and communicates to the scheduler • The recover manager ensures the database is restored to the state it was in before the start of the transaction if there is a failure • There is a global transaction manager at each site to coordinate the execution of both global and local transactions

Transaction Management

Concurrency Control and Recovery • Problems occur in a distributed DBMS environment for concurrency control and recovery that are not encountered in a centralized DBMS environment. These are: • Dealing with multiple copies of the data items • Concurrency control is responsible for maintaining consistency among these copies • Recovery is responsible for making a copy consistent with other copies if the site on which the copy is stored fails and recovers later • Failure of individual sites • The DDBMS should continue to operate with its running sites when one or more individual sites fail • When site recovers it must be brought up to date with the rest of the sites before it rejoins the system

Concurrency Control and Recovery • Failure of communication links • Must be able to deal with failure of 1+ communication links • Network partitioning may occur – this breaks up the sites into 2+ partitions, the sites within each partition can only communicate with one another and not with sites in other partitions • Distributed commit • Problems arise with committing a transaction that is accessing databases stored on multiple sites if some sites fail during process – use two-phase commit protocol • Distributed deadlock • May occur among several sites • Must take this into account

Concurrency Control and Recovery Distinguished copy of a data item • This deals with replicated data items • Idea is to designate a particular copy of each data item as a distinguished copy • Locks for this data item are associated with the distinguished copy and all locking and unlocking requests are sent to the site that contains that copy

Concurrency Control and Recovery • Primary site technique • All distinguished copies are kept at the same site • A single primary site is designated to be the coordinator site for all database items • All locks kept at that site, all requests for locking/unlocking are sent there • Advantage: not overly complex, simple extension of the centralized approach • Disadvantages: (1) All locking requests are sent to a single site, chance for overloading and causing a system bottleneck and (2) Failure of the primary site paralyzes the system since all locking info is kept at that site. This can limit system reliability and availability.

Concurrency Control and Recovery • Primary Site with Backup Site • This fixes second disadvantage of the primary site by designating a second site to be a backup site. All locking information is maintained at both sites. • If primary site failure, backup site takes over as primary site, new backup site chosen • This technique slows down the process of acquiring locks because all requests/granting of locks must be recorded at both sites before a response is sent to the requesting transaction

Concurrency Control and Recovery • Primary Copy Technique • Attempts to distribute the load of lock coordination among various sites by having the distinguished copies of different data items stored at different sites • Can also use backup sites to enhance reliability and availability

Concurrency Control and Recovery • Choosing A New Coordinator Site in Case of Failure • When a coordinator site fails, the sites still running must choose a new coordinator • Primary site with no backup – executing transactions must be aborted and restarted • With backup sites – transaction processing is suspended while backup becomes new primary and new backup chosen and locking information copies are sent to new primary

Concurrency Control and Recovery • Election - If no backup site existed or if the primary and backup sites are down this is used to choose new coordinator site • Any site that attempts to communicate with the coordinator site continuously and fails starts the election process. • Messages are sent to all running sites proposing which site should become the new coordinator

Concurrency Control and Recovery • Voting method • No distinguished copy • Lock request is sent to all sites that includes a copy of the data item • Each copy maintains its own lock and can grant or deny the request for it • Lock requests are granted/denied by a majority vote from the sites involved

Concurrency Control and Recovery • Recovery process quite involved. • It is difficult to determine whether a site is down without exchanging numerous messages with other sites. • Another problem is with distributed commit • When transaction is updating data at several sites, it cannot commit until it is sure that the effect of the transaction on every site cannot be lost

Distributed Catalog Management Catalogs are databases containing metadata about the DDS • Centralized Catalog – The entire catalog is stored on a single site • Easy to implement but loses reliability, availability, and autonomy • Fully Replicated Catalog – Identical copies of the complete catalog are present at each site • Faster reads but updates must be broadcasted to each site • Partially Replicated Catalogs – Each site is permitted to cache entries received from remote sites • Autonomy not restricted but cached copies will not be the most recent

Current Trends • Cloud Computing • Infrastructure, platforms, and software over the internet • Peer-to-Peer Database Systems • Nodes are autonomous and are linked only to a small number of peers

Distributed Database in Oracle • Homogeneous Distributed Database

Distributed Database