Database II

Database II

Goals and Objectives • The course goals are to introduce students to the Advance Topics of Databases Systems. • DBMS are at the heart of modern commercial application development. • use extends beyond this to many applications • Large amounts of data Storage with efficient update and retrieval. • The aim of this course is to study the internals of DBMSs. Students are introduced to the major problems to be faced when building a DBMS and also the ideas behind and algorithms to implement solutions of these problems.

Text Book & Reference Books • Text Book • Fundamentals of Database Systems by R. Elmasri and S. Navathe, 5th Edition 2007, Benjamin/Cummings • Database Systems Concepts By Abraham Silberschatz (Fifth Edition) • Reference Books • Modern Database management , Eight Edition, By Jefrey A. Hoffer & Mary B. Prescott.

Background: Prerequisite lectures Relational Algebra Sects: 6.1 –– 6.3 – 6.4 – 6.5 Basic Algebra Operations (projection, selection, and join) Types of join (Θ-join, Equi-join, and Natural join) Set Operations (Union, Intersection, and Difference) Division Operation SQL Sects: 8.4 – 8.5 – 8.8 Basic SELECT-FROM-WHERE query block Nested queries Views Indexing Sects: 14.1 – 14.2 – 14.3

Contents • Chapter 15 - Algorithms for Query Processing and Optimization 3-4 Weeks • Translating SQL Queries into Relational Algebra • Algorithms for External Sorting • Algorithms for SELECT and JOIN Operations • Algorithms for PROJECT and SET Operations • Implementing Aggregate Operations and OUTER JOINS • Combining Operations Using Pipelining • Using Heuristics in Query Optimization • Using Selectivity and Cost Estimates in Query Optimization • Overview of Query Optimization in Oracle • Semantic Query Optimization

Contents Cont… • Transaction Processing 5-6 Weeks • Chapter 17 - Introduction to Transaction Processing Concepts and Theory • Introduction to Transaction Processing • Transaction and System Concepts • Desirable Properties of Transactions • Characterizing Schedules Based on Recoverability • Characterizing Schedules Based on Serializability • Transaction Support in SQL

Contents Cont… • Chapter 18 - Concurrency Control Techniques • Two-Phase Locking Techniques for Concurrency Control • Concurrency Control Based on Timestamp Ordering • Multiversion Concurrency Control Techniques • Validation (Optimistic) Concurrency Control Techniques • Granularity of Data Items and Multiple Granularity Locking • Using Locks for Concurrency Control in Indexes • Other Concurrency Control Issues

Contents Cont… • Chapter 19 - Database Recovery Techniques • Recovery Concepts • Recovery Techniques Based on Deferred Update • Recovery Techniques Based on Immediate Update • Shadow Paging • The ARIES Recovery Algorithm • Recovery in Multidatabase Systems • Database Backup and Recovery from Catastrophic Failures • Security 2 Weeks • Chapter 23: Database Security

QUERY PROCESSING and OPTIMIZATION

Agenda of QUERY PROCESSING and OPTIMIZATION • Query Processing and Optimization: Why? • Steps of Processing 0. Introduction to Query Processing 1. Translating SQL Queries into Relational Algebra 2. Algorithms for External Sorting 3. Algorithms for SELECT and JOIN Operations 4. Algorithms for PROJECT and SET Operations 5. Implementing Aggregate Operations and Outer Joins 6. Combining Operations using Pipelining

Agenda of QUERY PROCESSING and OPTIMIZATION III. Methods of Optimization 7. Using Heuristics in Query Optimization • Heuristic (Logical Transformations) • Transformation Rules • Heuristic Optimization Guidelines 8. Using Selectivity and Cost Estimates in Query Optimization • Cost Based (Physical Execution Costs) • Data Storage/Access Refresher • Catalog & Costs 9. Overview of Query Optimization in Oracle 10. Semantic Query Optimization IV. What All This Means To YOU?

Query Processing & Optimization What is Query Processing? • Steps required to transform high level SQL query into a correct and “efficient” strategy for execution and retrieval. What is Query Optimization? • The activity of choosing a single “efficient” execution strategy (from hundreds) as determined by database catalog statistics.

Questions for Query Optimization • Which relational algebra expression, equivalent to the given query, will lead to the most efficient solution plan? • For each algebraic operator, what algorithm (of several available) do we use to compute that operator? • How do operations pass data (main memory buffer, disk buffer,…)? • Will this plan minimize resource usage? (CPU/Response Time/Disk)

Query Processing: Who needs it? A motivating example: • Identify all managers who work in a London branch • SELECT * • FROM Staff s, Branch b • WHERE s.branchNo = b.branchNo AND s.position = ‘Manager’ AND b.city = ‘london’; Results in these equivalent relational algebra statements (1) s(position=‘Manager’)^(city=‘London’)^(Staff.branchNo=Branch.branchNo) (Staff X Branch) (2) s(position=‘Manager’)^(city=‘London’) (StaffwvStaff.branchNo = Branch.branchNoBranch) (3) [s(position=‘Manager’) (Staff)] wvStaff.branchNo = Branch.branchNo [s(city=‘London’) (Branch)]

A Motivating Example (cont…) Assume: • 1000 tuples in Staff. • ~ 50 Managers • 50 tuples in Branch. • ~ 5 London branches • No indexes or sort keys • All temporary results are written back to disk (memory is small) • Tuples are accessed one at a time (not in blocks)

Motivating Example: Query 1 (Bad) s(position=‘Manager’)^(city=‘London’)^(Staff.branchNo=Branch.branchNo)(Staff X Branch) • Requires (1000+50) disk accesses to read from Staff and Branch relations • Creates temporary relation of Cartesian Product (1000*50) tuples • Requires (1000*50) disk access to read in temporary relation and test predicate Total Work = (1000+50) + 2*(1000*50) = 101,050 I/O operations

Motivating Example: Query 2 (Better) s(position=‘Manager’)^(city=‘London’)(Staff wvStaff.branchNo = Branch.branchNoBranch) • Again requires (1000+50) disk accesses to read from Staff and Branch • Joins Staffand Branch on branchNo with 1000 tuples (1 employee : 1 branch ) • Requires (1000) disk access to read in joined relation and check predicate Total Work = (1000+50) + 2*(1000) = 3050 I/O operations 3300% Improvement over Query 1

Motivating Example: Query 3 (Best) [ s(position=‘Manager’) (Staff) ]wvStaff.branchNo = Branch.branchNo [ s(city=‘London’) (Branch) ] • Read Staff relation to determine ‘Managers’ (1000 reads) • Create 50 tuple relation(50 writes) • Read Branch relation to determine ‘London’ branches (50 reads) • Create 5 tuple relation(5 writes) • Join reduced relations and check predicate (50 + 5 reads) Total Work = 1000 + 2*(50) + 5 + (50 + 5) = 1160 I/O operations 8700% Improvement over Query 1 Consider if Staff and Branch relations were 10x size? 100x? Yikes!

0. Introduction to Query Processing (1) • Query optimization: • The process of choosing a suitable execution strategy for processing a query. • Two internal representations of a query: • Query Tree • Query Graph

Introduction to Query Processing (2) Check Attributes and Relations Identification of the query language Creation of possible Relational algebra Check Syntaxes Choose the strategy for executing the query and produce the optimal executing plan

1. Translating SQL Queries into Relational Algebra (1) • Query block: • The basic unit that can be translated into the algebraic operators and optimized. • A query block contains a single SELECT-FROM-WHERE expression, as well as GROUP BY and HAVING clause if these are part of the block. • Nested queries within a query are identified as separate query blocks. • Aggregate operators in SQL must be included in the extended algebra.

Translating SQL Queries into Relational Algebra (2) SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > ( SELECT MAX (SALARY) FROM EMPLOYEE WHERE DNO = 5); SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > C SELECT MAX (SALARY) FROM EMPLOYEE WHERE DNO = 5 πLNAME, FNAME(σSALARY>C(EMPLOYEE)) ℱMAX SALARY(σDNO=5 (EMPLOYEE))

2. Algorithms for External Sorting (1) • External sorting: • Refers to sorting algorithms that are suitable for large files of records stored on disk that do not fit entirely in main memory, such as most database files. • Sort-Merge strategy: • Starts by sorting small subfiles (runs) of the main file and then merges the sorted runs, creating larger sorted subfiles that are merged in turn. • Sorting phase: nR = (b/nB) • Merging phase: dM = Min (nB-1, nR); nP = (logdM(nR)) • nR: number of initial runs; b: number of file blocks; • nB: available buffer space; dM: degree of merging; • nP: number of passes.

Sort/Merge Sort Sort Merge Partition Sort Sort

MergingSingle phase

MergingMultiple phase

External SortingSort-merge strategy • The Sorting phase • Runs of file are read into main memory • Runs are sorted using an internal sorting algorithm • Runs are written back to disk as temporary sorted subfiles nR : number of initial runs b : number of file blocks nB: available buffer space nR: Example: nB: 5 blocks, b: 1024 blocks nB== 205 runs

External SortingSort-merge strategy (Cont.) • Merging phase Sorted runs are merged during one or more passses. dM: degree of merging (dM–way merging)number of runs that can be merged together in each pass dM= min ( ( nB-1), nR) number of passes = ┌ ┐

205 1 1 1 1 … 1 52 1 4 4 ….. 1 13 2 13 16 ….. 4 3 64 ….. 61 4 205 External SortingSort-merge strategy (Cont.) Example: (2＊b + 2＊ ) dM = 4 ( 4-way merging)

2. Algorithms for External Sorting (1) • The typical external sorting algorithm uses a sort-merge strategy, which starts by sorting small subfiles—called runs—of the main file and then merges the sorted runs, creating larger sorted subfiles that are merged in turn. The sort- merge algorithm, like other database algorithms, requires buffer space in main memory, where the actual sorting and merging of the runs is performed. two phases: the sorting phase and the merging phase.

2. Algorithms for External Sorting (1) • In the sorting phase, runs (portions or pieces) of the file that can fit in the available buffer space are read into main memory, sorted using an internal sorting algorithm, and written back to disk as temporary sorted subfiles (or runs). • The size of a run and number of initial runs (nR) is dictated by the number of file blocks (b) and the available buffer space (nB). For example, if nB = 5 blocks and the size of the file b = 1024 blocks, then nR= ⎡(b/nB)⎤ or 205 initial runs each of size 5 blocks (except the last run which will have 4 blocks). • Hence, after the sort phase, 205 sorted runs are stored as temporary subfiles on disk.

2. Algorithms for External Sorting (1) • In the merging phase, the sorted runs are merged during one or more passes. The degree of merging (dM) is the number of runs that can be merged in each pass. • In each pass, one buffer block is needed to hold one block from each of the runs being merged, and one block is needed for containing one block of the merge result. • Hence, dM is the smaller of (nB − 1) and nR, and the number of passes is ⎡(logdM(nR))⎤. In our example, dM = 4 (four-way merging), so the 205 initial sorted runs would be merged into 52 at the end of the first pass, which are then merged into 13, then 4, then 1 run, which means that four passes are needed.

2. Algorithms for External Sorting (1) • The minimum dM of 2 gives the worst-case performance of the algorithm, which is (2 ∗ b) + (2 ∗ (b ∗ (log2 b))). • The first term represents the number of block accesses for the sort phase, since each file block is accessed twice—once for reading into memory and once for writing the records back to disk after sorting. • The second term represents the number of block accesses for the merge phase, assuming the worst-case dM of 2. In general, the log is taken to the base dM and the expression for number of block accesses becomes (2 ∗ b) + (2 ∗ (b ∗ (logdM nR))).

User select * from R, S where … Resolve the references, Syntax errors etc. Converts the query to an internal format relational algebra like Query Parser Query Optimizer Results Find the best way to evaluate the query Which index to use ? What join method to use ? … Query Processor Read the data from the files Do the query processing joins, selections, aggregates … R, B+Tree on R.a S, Hash Index on S.a … Overview

“Cost” • Complicated to compute • We will focus on disk: • Number of I/Os ? • Not sufficient • Number of seeks matters a lot… why ? • tT – time to transfer one block • tS – time for one seek • Cost for b block transfers plus S seeksb * tT + S * tS • Measured in seconds

3. Algorithms for SELECT and JOIN Operations (1) • Implementing the SELECT Operation • Examples: • (OP1): sSSN='123456789' (EMPLOYEE) • (OP2): sDNUMBER>5(DEPARTMENT) • (OP3): sDNO=5(EMPLOYEE) • (OP4): sDNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE) • (OP5): sESSN=123456789 AND PNO=10(WORKS_ON)

Basic Algorithms for Executing Query Operations ImplementingSELECT (OP1) σSSN=12345689 (EMPLOYEE)equality comparison on key attribute (OP2) σDNUMBER > 5 (DEPARMENT)nonequality comparison on key attribute (OP3) σDNO=5 (EMPLOYEE) equality comparison on non key attribute (OP4) σDNO=5 AND SALARY >30000 AND SEX=F(EMPLOYEE)conjunctive condition (OP5) σESSN=123456789 AND DNO=10 (WORKS_ON) conjunctive condition and composite key

Search Methods for Selectionfile scans / index scans S1. Linear Search (brute force) S2. Binary Search SSN=123456789 (OP1) ordering attribute for EMPLOYEE S3. Use primary index or hash key (single record) SSN=123456789 (OP1) Primary index or hash key S4. Use primary index (multiple records) DNUMBER>5 (OP2)primary index S5. Use clustering index (multiple records) DNO=5 (OP3)clustering index • Locate • Find proceeding subsequent

Algorithms for SELECT and JOIN Operations (2) • Implementing the SELECT Operation (contd.): • Search Methods for Simple Selection: • S1 Linear search (brute force): • Retrieve every record in the file, and test whether its attribute values satisfy the selection condition. • S2 Binary search: • If the selection condition involves an equality comparison on a key attribute on which the file is ordered, binary search (which is more efficient than linear search) can be used. (See OP1). • S3 Using a primary index or hash key to retrieve a single record: • If the selection condition involves an equality comparison on a key attribute with a primary index (or a hash key), use the primary index (or the hash key) to retrieve the record.

Selection Operation • select * from person where SSN = “123” • Option 1: Sequential Scan • Read the relation start to end and look for “123” • Can always be used (not true for the other options) • Cost ? • Let br = Number of relation blocks • Then: • 1 seek and br block transfers • So: • tS + br * tT sec • Improvements: • If SSN is a key, then can stop when found • So on average, br/2 blocks accessed

Selection Operation • select * from person where SSN = “123” • Option 2 : Binary Search: • Pre-condition: • The relation is sorted on SSN • Selection condition is an equality • E.g. can’t apply to “Name like ‘%424%’” • Do binary search • Cost of finding the first tuple that matches • log2(br) * (tT + tS) • All I/Os are random, so need a seek for all • The last few are closeby, but we ignore such small effects • Not quite: What if 10000 tuples match the condition ? • Incurs additional cost

Primary index on the ordering key field

Algorithms for SELECT and JOIN Operations (3) • Implementing the SELECT Operation (contd.): • Search Methods for Simple Selection: • S4 Using a primary index to retrieve multiple records: • If the comparison condition is >, ≥, <, or ≤ on a key field with a primary index, use the index to find the record satisfying the corresponding equality condition, then retrieve all subsequent records in the (ordered) file. • S5 Using a clustering index to retrieve multiple records: • If the selection condition involves an equality comparison on a non-key attribute with a clustering index, use the clustering index to retrieve all the records satisfying the selection condition. • S6 Using a secondary index: • On an equality comparison, this search method can be used to retrieve a single record if the indexing field has unique values (is a key) or to retrieve multiple records if the indexing field is not a key. • In addition, it can be used to retrieve records on conditions involving >,>=, <, or <=. (FOR RANGE QUERIES)

A Clustering Index Example • FIGURE 14.2A clustering index on the DEPTNUMBER ordering non-key field of an EMPLOYEE file.

A Two-level Primary Index

Example of a Dense Secondary Index

An Example of a Secondary Index

3. Algorithms for SELECT and JOIN Operations (1) • Implementing the SELECT Operation • Examples: • (OP1): sSSN='123456789' (EMPLOYEE) • (OP2): sDNUMBER>5(DEPARTMENT) • (OP3): sDNO=5(EMPLOYEE) • (OP4): sDNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE) • (OP5): sESSN=123456789 AND PNO=10(WORKS_ON)

Algorithms for SELECT and JOIN Operations (4) • Implementing the SELECT Operation (contd.): • Search Methods for Simple Selection: • S7 Conjunctive selection: • If an attribute involved in any single simple condition in the conjunctive condition has an access path that permits the use of one of the methods S2 to S6, use that condition to retrieve the records and then check whether each retrieved record satisfies the remaining simple conditions in the conjunctive condition. • S8 Conjunctive selection using a composite index • If two or more attributes are involved in equality conditions in the conjunctive condition and a composite index (or hash structure) exists on the combined field, we can use the index directly.

Database II

Database II

Presentation Transcript

CS4432: Database Systems II

CS4432: Database Systems II

CS4432: Database Systems II

CST221: Database Systems (II)

Applied Database II

CS4432: Database Systems II

Run II Database Applications

CS4432: Database Systems II

EPICS Database II

Relational Database Design II

CS4432: Database Systems II

CS4432: Database Systems II

CS4432: Database Systems II

Secondary Database Analysis II

CS4432: Database Systems II

CS4432: Database Systems II

EPICS Database II

II. Selected Database Issues