Distributed Database Design & Semantic Data Control

Distributed Database Design& Semantic Data Control Univ.-Prof. Dr. Peter Brezany Institut für Scientific Computing Universität Wien Tel. 4277 39425 Sprechstunde: Di, 13.00-14.00 LV-Portal: www.par.univie.ac.at/~brezany/teach/gckfk/06ss/300658.html

Distributed Database Design

A General Framework Considered in the Distributed Database Design Problem • The organization of distributed systems can be investigated along 3 dimensions: • Level of Sharing • (a) no sharing: each application and its data execute at one site, and there • is no communication with any other program or access to any data file • at other sites (not very common today) (b) data sharing: all the programs are replicated at all the sites, but data files are not (c) data-plus-program sharing: a program at a given site can request a service from another program at a second site, which, in turn, may have to access a data file located at a 3. site.

A General Framework Considered in the Distributed Database Design Problem (2) 2. Access patterns (along this dimension, the relationship between distributed database design and query processing is established) (a) static (they do not change over time) – the design is easier for static environments; unfortunately, it is difficult to find many real-life applications that would be classified as static. (b) dynamic 3. Level of knowledge about the access pattern(AP) behavior (a) no knowledge (a theoretical possibility) it is very difficult to design a distr. database that can effectively cope with this situation (b) complete knowledge – AP can be re- sonably (almost precicely) predicted (c) partial information – there are deviations from the predictions

Top-Down approach Bottom-Up approach In most database designs, the two approaches may need to be applied to complement one another. It is a joint function of the database, enterprise, and application system administrators (or of the administrator performing all three roles) Alternative Design Strategies

Top-Down Design Process

Top-down design is a suitable approach when a database system is being designed from scratch. Commonly, a number of databases already exist, and the design task involves integrating them into one database.  the bottom-up approach is suited for this type of environment. The starting point is individual local conceptual schemas. The process consists of integrating local schemas into the global conceptual schemas. This type of environment exists primarily in the context of heterogeneous databases. Bottom-Up Design Process

Why fragment at all? How should we fragment? How much should we fragment? Is there any way to test the correctness of decomposition? How should we allocate? What is necessary information for fragmentation and allocation? Distribution Design Issues - Fragmentation

The important issue is the appropriate unit of distribution: a relation is not a suitable unit – why? Application views are usually subsets of relations, therefore the locality of accesses of applications is defined not on entire relations but on their subsets  it is only natural to consider subsets of relations as distribution units. If the applications that have views defined on a given relation reside at different sites, two alternatives can be followed, with the entire relation being the unit of distribution: The relation is not replicated  this results into in an unnecessarily high volume of remote data accesses. The relation is replicated at all or some of the sites where the applications reside  problems in executing updates and may not be desirable if storage is limited. The decomposition of a relation into fragments, each being treated as a unit, permits a large number of transactions to execute concurrently and intraquery concurrency is enabled. Disadvantages of fragmentation Applications whose views are defined on more than one fragment may suffer performance degradation. Integrity checking (semantic data control): Attributes participating in a dependency may be decomposed into different fragments which might be allocated to different sites. Reasons for Fragmentation

Fragmentation Alternatives:Horizontal and Vertical Fragmentation Example Database

Example of Horizontal Fragmentation

Example of Vertical Fragmentation Remark: The fragmentation may, of course, be nested. If the nestings are of different types, one gets hybrid fragmentation.

Hybrid (Mixed) Fragmentation Relation A Fragment H1 Fragment H2V1H1 Fragment H2V2 Fragment H2V1H2 Fragment H2V1H3

Completness. If a relation R is decomposed into fragments R1, R2, ..., Rn, each data item that can be found in R can also be found in one or more of Ri´s. Reconstruction. If a relation R is decomposed into a set fragments FR = {R1, R2, ..., Rn}, it should be possible to define a relational operator  such that R = Ri, Ri  FR The operator  will be different for for the different forms of fragmentation. The reconstructability of the relation from its fragments ensures that constraints defined on the data in the form of dependencies are preserved. Disjointness. If a relation R is horizontally decomposed into fragments R1, R2, ..., Rn and data item di is in Rj, it is not in any other fragment Rk (k  j).If relation R is vertically decomposed, its primary key attributes are typically repeated in all its fragments  disjointness is defined only on the nonprimary key attributes of a relation. Correctness Rules of Fragmentation

Allokation • Sei F = {F1, ..., Fn} eine Menge von Fragmenten, S = {S1, ...,Sm} ein Netzwerk gegeben durch die Menge seiner Sites, und Q = {Q1, ..., Qp} die Menge der relevanten Anwendungen. • Allokationsproblem: • Was ist die „optimale“ Zuordnung von F zu S bzgl. Q ? • Optimaltätskriterium: • Minimalität der Kosten gegeben durch die Speicherkosten der Fian den Sites Sj, der Anfragekosten für Fi an Site Sj, der Änderungskosten der Fi an allen Sites an den sie gespeichert sind, und die Kosten der Datenkommunikation. • Performanz im Sinne von Antwortzeiten oder Systemdurchsatz.

Sites R1 R1,1 S1 R2 R R2,1 R1,2 R3 R4 Globale Relation S2 R2,2 Fragmente Fragmente und ihre Allokation Allokation an den Sites R3,3 S3 R4,3

Zusätzliche Beispiele Beispiel 1: horizontale Fragmentierung

Beispiel 2: abgeleitete horizontale Fragmentierung

Beispiel 3: vertikale Fragmentierung

Semantic Data Control

View management Security control Semantic integrity control. The above items can be defined by rules that the system automatically enforces. The violation of some such rule by a user program (a set of database operations) generally implies the rejection of the effects of that program. The rules are defined by the DB administration. The cost of enforcing semantic data control, which is high in terms of resource utilization in a centralized DBMS, can be prohibitive in a distributed environment. The rules must be stored in a catalog (directory)  efficient management in a distributed environment is important Main Objectives

In a rel. DB system a view is a virtual relation, defined as the result of a query on base relations (or real relations), but not materialized like a base relation, which is stored in the DB. An external schema can be defined as a set of views and/or base relations. Besides their use in external schemas, views are useful for ensuring data security in a simple way  if users may only access the database through views, they cannot see or manipulate the hidden data, which are therefore secure. In a distributed DBMS, a view can be derived from distributed relations, and the access to a view requires the execution of the distributed query corresponding to the view definition. An important issue in a distributed DBMS is to make view materialization efficient. View Management

Example 1 The view of system analysts (SYSAN) derived from relation EMP (ENO,ENAME,TITLE), can be defined by the following SQL query: CREATE VIEW SYSAN(ENO, ENAME) AS SELECT ENO, ENAME FROM EMP WHERE TITLE = ``Syst. Anal.´´ The single effect is the storage of the view defintion in the catalog. No other information needs not be recorded. Therefore, the result to the query defining the view (i.e., a relation having the attributes ENO and ENAME for the system analysts as shown in the figure below) is not produced. However, the view SYSAN can be manipulated as a base relation. Views in Centralized DBMSs ENO ENAME E2 M.Smith E5 B. Casey E8 J.Jones

Example 2 The query „Find the names of all the system analysts with their project number and responsibility(ies)“ involving the view SYSAN and relation ASG(ENO,PNO,RESP,DUR) can be expressed as SELECT ENAME, PNO, RESP FROM SYSAN, ASG WHERE SYSAN.ENO = ASG.ENO Mapping a query expresed on views into a query expressed on base relations can be done by query modification at compile time. The preceding can be modified to SELECT ENAME, PNO, RESP FROM EMP, ASG WHERE EMP.ENO = ASG.ENO AND TITLE = ``Syst. Anal.´´ Views in Centralized DBMSs (cont.) ENAME PNO RESP M.Smith P1 Analyst M.Smith P2 Analyst B.Casey P3 Manager J.Jones P4 Manager

The definition of a view is similar as in centralized systems – distinction, a view may be derived from fragmented relations stored at different sites. View definitions can be centralized at one site, partially duplicate, or fully duplicated. Processing a query expressed on views: distributed query processor. Views derived from distributed relations may be costly to evaluate. Since in a given organization it is likely that many users access the same views, some proposals have been made to optimize view derivation. An alternative solution is to avoid view derivation by maintaining actual versions of the views – snapshots. A snapshot represents a particular state of the database and is therefore static, meaning that it does not reflect updates to base relations. Views in Distributed DBMSs

Data security includes 2 aspects: Data protection. It is required to prevent unauthorized users from understanding the physical content of data. Authorization control. It must guarantee that only authorized users perform operations they are allowed to perform on DB. Checking whether a given triple (user, operation, object) can be allowed to proceed. The introduction of a user: (user name, password). The user name uniquely identifies the users of that name in the system, while the password, known only to the users of that name, authenticates the users. Distributed authorization control Additional complexity: objects and subjects are distributed remote user authentication management of distributed authorization rules handling of views and of user groups (DB administration is simplified) Data Security

How to guarantee database consistency? A DB state is said to be consistent if the DB satisfies a set of constraints, called semantic integrity constraints. Mantaining a consistent DB requires various mechanisms: concurrency control, reliability, protection, and semantic integrity control (SIC). SIC ensures DB consistency by rejecting update programs which lead to inconsistent DB states, or by activating specific actions on the DB state, which compensate for the effects of the update programs. SI constraints are rules that represent the knowledge about the properties of an application. Structural constraints (e.g. unique key constraints in the rel. model) Behavioral constraints regulate the application behavior. Semantic Integrity Control

The language for expression and manipulating integrity assertions. Examples: Employee number in relation EMP cannot be null: ENO NOT NULL in EMP The pair (ENO,PNO) is the unique key in relation ASG: (ENO,PNO) UNIQUE IN ASG Enforcement mechanism that performs specific actions to enforce DB integrity and updates. Examples: The query for increasing the budget of the CAD/CAM project by 10%, which would be specified as UPDATE PROJ SET BUDGET = BUDGET * 1.1 WHERE PNAME = ``CAD/CAM´´ will be transformed into the following query in order to enforce the domain constraint: CHECK ON PROJ (BUDGET >= 500000 AND BUDGET <= 1000000) UPDATE PROJ SET BUDGET = BUDGET * 1.1 WHERE PNAME = ``CAD/CAM´´ AND NEW.BUDGET >= 500000 AND NEW.BUDGET <= 1000000) Centralized Semantic Integrity Control

Individual assertions. They refer only to tuples to be updated independently of the rest of the database. Example: CHECK ON PROJ (BUDGET >= 500000 AND BUDGET <= 1000000). The assertion definition is sent to all other sites that contain fragments of the relation involved in the assertion. The assertion must be compatible with the relation data at each site. Compatibility can be checked at two levels: predicate and data. First, predicate compatibility is verified by comparing the assertion predicate with the fragment predicate. An asertion C is not compatible with a fragment predicate p if „C is true“ implies that „p is false,“ and is compatible with p otherwise. If noncompatibility is found at one of the sites, the assertion definition is globally rejected because tuples of that fragment do not satisfy the integration constraints. Second, if predicate compatibility has been found, the assertion is tested against the instance of the fragment. If it is not satisfied by that instance, the assertion is also globally rejected. If compatibility is found, the assertion is stored at each site. Distributed Semantic Integrity Control

Distributed Individual Assertions: Example Consider relation EMP, horizontally fragmented across three sites using the predicates p1 : 0 <= ENO < „E3“ p2: „E3“ <= ENO <= „E6“ p3: ENO > „E6“ and the domain assertion C: ENO < „E4“. Assertion is compatible with p1 (if C is true, p1 is true) and p2 (if C is true, p2 is not necessarily false), but is not with p3 (if C is true, then p3 is false). Therefore, assertion C should be globally rejected because the tuples at site 3 cannot satisfy C, and thus relation EMP does not satisfy C.

They include single-relation multivariable constraints such as functional dependency (Example 1) and multirelation multi-variable constraints such as foreign key constraints (Example 2) Example 1: The employee number functionally determines the employee name. ENO IN EMP DETERMINES ENAME Example 2: The project number PNO in relation ASG is a foreign key matching the primary key PNO of relation PROJ. In other words, a project referred to in relation ASG must exist in relation PROJ. PNO IN ASG REFERENCES PNO IN PROJ Set-Oriented Assertions

Example: The total duration for all employees in the CAD/CAM project is less than 100. CHECK ON g:ASG, j:PROJ (SUM(g.DUR WHERE g.PNO=j.PNO) < 100 IF j.PNAME=``CAD/CAM´´) These assertions are among the most costly to test because they require the calculations of the aggregate functions. Assertions Involving Aggregates

Distributed Database Design & Semantic Data Control