Investigating Distributed Database Systems
This document investigates the complexities and technologies associated with distributed database systems. It defines a distributed database as a collection of interrelated databases across a computer network, highlighting various architectures, including client/server and peer-to-peer models. Key concepts such as fragmentation and replication are discussed, emphasizing their roles in data distribution and access. The document also addresses challenges in query processing, concurrency control, reliability protocols, and replication strategies needed to ensure performance and consistency across distributed environments.
Investigating Distributed Database Systems
E N D
Presentation Transcript
Investigating Distributed Database Systems Challenges and Technology Kishore Puppala Rao
Definitions • A database is a logically related collection of data, stored in one or many files • A distributed database is a collection of multiple, logically interrelated databases distributed over a computer network
Architecture • Client/server architectures • Multiple clients, single server – this is the most common and straightforward implementation • Multiple clients, multiple servers – more flexible. DB distributed over multiple servers. Each client directs requests to a “home” server.
Architecture (cont’d) • DB is physically distributed by fragmenting and replicating data (discussed later) • Regardless of architecture, implementation details of queries, transactions and DB operations should be transparent to users.
Architecture (Peer-to-peer) • No distinction between client and server • Each site has functionality of both client and server • E.g. File-sharing apps such as BearShare, LiveWire • Sophisticated protocols needed to manage data distributed across multiple sites
Fragmentation • Partitions the data • Subdivides each relation either vertically (by project operation) or horizontally (by selection operation) • Facilitates the placement of data close to its place of use, reducing transmission costs
Replication • Refers to duplication of data for access and/or security purposes • Fragments or whole database may be replicated • Replication involves keeping physical separate copies of data at different sites
Distributed vs. Parallel • Distributed DBMS are not parallel DBMS, although distinction may be unclear • Distributed DBMS assume loose connection between processors operating independently, perhaps under different operating systems
Parallel DBMS • Multiple processors under same operating system. • Architecture: Shared-none, shared-disk, or shared memory • Shared-Nothing: Each processor has exclusive access to its main memory and disk. Each processing element (PE) is a local site.
Parallel DBMS (cont’d) • Shared-memory: Each PE has access to any memory module or disk through some fast connection (e.g. LAN or cross-bar switch) • Shared-disk: Each PE has exclusive access to its own memory, but shared access to any disk via a fast connection. PE accesses DB pages on shared disk and copy to local cache
Transparency • Distributed (and Parallel) DBMS must provide same functionality and consistency of centralized DBMS. • Transparency implies presenting a consistent view that shields the user from implementation details such as fragmentation, replication, and distribution. • Introduces major challenges
Challenges • Query processing and optimization • Concurrency control • Reliability protocols • Replication protocols
Query Processing and Optimization • Techniques needed to address difficulties arising from data distribution and fragmentation. Localization techniques employed. • Algebraic queries on global relations are transformed to operate on fragments • Opportunities for parallel processing are identified (fragments are stored at different sites), unnecessary work is eliminated (not all fragments may be involved in the query)
Query optimization • Determining the execution sites for distributed operations • Identifying the best distributed algorithm for distributed operations • Changing the order of operations in a query
Concurrency Control • Challenge in synchronizing user transactions is to extend serializability and concurrency to the distributed execution environment • Serializability: The ability to perform a set of operations in parallel with the same effect as if they were performed in a certain sequence, requires: • (a) execution of the set of transactions at each site must be serializable • (b) the serialization orders of these transactions at all these sites must be identical
Concurrency (cont’d) • If locking-based algorithms used, lock management may be centralized or distributed • Deadlocks must be avoided • Deadlock detection and management in a distributed database can be difficult
Reliability protocols • Several types of failures: System, media, transaction, communication • May be difficult to differentiate type of failure • Distributed reliability protocols enforce transaction atomicity (commit all or commit nothing)
Reliability (cont’d) • E.g. of Atomic commitment protocol: Two-phase commit • All sites involved in the execution of a distributed transaction must agree to commit the transaction before it is made permanent.
Replication protocols • Each logical data item has a number of physical instances • Challenge is to maintain (or approximate) consistency among physical copies as user updates logical data • Example criterion: One-copy equivalence – All physical copies of logical data should be equivalent after being updated by a transaction • Read-One/Write All (ROWA) protocol – enforces one-copy equivalence. Disadvantage: failure of one site may block entire transaction
Replication (cont’d) • Alternative algorithms relax ROWA by mapping each write to a subset of the physical copies • Quorum-based voting: Copies are assigned votes; read and (especially) write operations have to collect votes and reach a quorum to commit data. (see class notes)
Research and Trends • Workflow models (advanced transaction models) • Network scaling problems • Multi-database systems and interoperability • Distributed object management
Trends (cont’d) • Primitive objects are not simple-structured data. Can consist of programs, voice, images, etc. • Distributed DBMS must handle increasingly larger data objects. E.g. 1MB storage needed for 1 digital X-Ray image (1024x1024) @ 8 bits/pixel • Most commercial DBMS (e.g. MS SQL Server 2000, Oracle 8i) provide some sort of distribution • Emergence of broadband networks eliminates the network as a bottleneck
Trends (cont’d) • Mobile computing is escalating in interest and prevalence • Mobile stations may download data as needed • Alternatively, more powerful mobile stations may store native data for sharing with others • Mobility raises issues of address migration, maintenance of directories, and determining the location of stations • Object-oriented DBMS e.g. CORBA (platform independent), COM/OLE (MS-specific)
CORBA • Common Object Request Broker Architecture • Facilitates the maintenance and DB access of data from a number of autonomous and heterogeneous sources (e.g. file systems, spreadsheets) via a multidatabase approach • Provides a generic platform for distributed computing
CORBA (cont’d) • In multidatabase systems, the main problem is the heterogeneity extant at four levels: platform, communication, database system, and semantic. • CORBA facilitates implementation transparency by providing client access via interfaces defined in a special Interface Definition Language (IDL), independent of the databases actual software and hardware environment. • Provides location transparency, allowing clients to access DB objects independent of location and communication protocols
CORBA (cont’d) • Provides a common interface to mask heterogeneity among native database system implementations based on different data models (e.g. flat-file, relational, spreadsheet) and query languages • Common interface overcomes semantic conflicts such as schema and data conflicts
References • M.T. Ozsu and P. Valduriez, "Distributed and Parallel Database Systems – Technology and Current State-of-the-Art", ACM Computing Surveys, 28(1): 125 - 128, March 1996. • A. Dogac, C. Dengi and M.T. Ozsu, "Distributed Object Computing Platforms", Communications of ACM, 41(9): 95-103, September 1998. • J. N. Gray, “Notes on Data Base Operating Systems.” Operating Systems: An Advanced Course. R. Bayer, R.M. Graham (eds.) New York: Springer-Verlag, 1979, pp. 393-481.
References (cont’d) • M.T. Ozsu, "The Push/Pull Effect - Can Distributed Database Technology Meet The Challenges of New Applications?", Database Programming & Design, April 1997.