1 / 35

Cost Estimation for Relational Algebra Queries

Cost Estimation for Relational Algebra Queries. Suppose I have a query like P ( ename) [ s (ID# = 1234) (Emp)] How to retrieve the data from the disk? The best choice depends on the type of query and the type of file structure present. Simple Query Types. Exact matches on a unique attribute.

emil
Télécharger la présentation

Cost Estimation for Relational Algebra Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cost Estimation for Relational Algebra Queries • Suppose I have a query like P(ename)[s(ID# = 1234)(Emp)] • How to retrieve the data from the disk? • The best choice depends on the type of query and the type of file structure present.

  2. Simple Query Types • Exact matches on a unique attribute. • Exact matches on a non-unique attribute. • Range queries. • Print the entire file unordered. • Print the entire file, sorted on a particular attribute.

  3. Possible File Structures • Access unsorted data file. • Access sorted data file. • Use an index/b-tree/b+-tree (“Secondary Index” or “Alternative 2/3”) • Use a hash table • Use a clustered B-Tree (“Primary Index” or “Alternative 1”) • Use clustered files.

  4. Cost Estimation Introduction • To determine the best choice, we must have a way of determining the cost of each of the previous file structure access schemes with all of the possible query types. • This requires we understand how files are stored and retrieved from secondary storage.

  5. File Systems Review: • A table will be stored as a file; the following metrics should be review: • r --the number of records in the file. • |r| -- the size of a record, usually bytes/record. • Block Size -- the number of bytes/block on the disk. A block is the number of bytes that can be read with one disk access. • bf -- the blocking factor of the file. It is the ëBlock Size/|r|û Note: ë û represents the floor function.

  6. File Systems Review II: • b -- the number of blocks needed for the file. It is ér/ bfù. Note: éù represents the ceiling function. • For a b-tree, the number of levels in the tree is L. This is important because each level must be accessed once for a simple search (worst case). • Also in a b-tree the maximum number of levels is given by: • L <= log ém/2ù((N+1)/2) + 1 where m is the degree of the tree, and N is the number of key values.

  7. Finding m: • |node| <=block size • (m-1)*|key| + (m)*|ChildPtr| + (m-1)*|RecPtr| <= block size

  8. File Systems Review III: • bL is the number of nodes of a B-tree at the terminal level. This is important for determining the number of disk accesses for non-unique retrievals. • “d” is the number of distinct values of a particular attribute. • “s” is the selectivity of a particular attribute. It is r/d.

  9. Cost Estimation -- Unsorted File • CUu = b (worst case) b/2 (average case). • CUn = b • CUr = b • CUpu = b • CUps = b log (b) + b

  10. Cost Estimation -- Sorted File • CSu = log(b) • CSn = log(b) + és/bf ù - 1 • CSr = log(b) + b/2 • CSpu = b • CSps = b

  11. Cost Estimation -- B-, B+-tree (Secondary) • CBu = L+1 • CBn = L+ (és/(m-1) ù - 1) + s • CBr = L+ bL/2 + r/2 • CBpu = L + bL + r • CBps = L + bL + r

  12. Cost Estimation -- B-, B+-tree (Primary) • CBPu = L • CBPn = L+ (és/(m-1) ù - 1) • CBPr = L+ bL/2 • CBPpu = L + bL • CBPps = L + bL

  13. Cost Estimation -- Hash Table • CHu = 1+1 = 2 • CHn = 1+s • CHr = NA • CHpu = #bins + r • CHps = NA

  14. Cost Estimation -- Clustered Files • CCu = log(bR+bS) • CCn = log(bR+bS) + és/bf ù - 1 • CCr = log(bR+bS) + (bR+bS)/2 • CCpu = bR+bS • CCps = bR+bS • Clustered files are used to speed up joins, so these would be worse than a sorted file by design.

  15. Example -- File Information • Emp(Fn, Minit, LN, SSN, Bdate, Addr, Sex, Salary, SuperSSN, Dno) • r = 10,000 records • bf = 5 records/block • b = 2,000 blocks • Hash Table on SSN • B+-Tree on Dno: • Ldno = 4, ddno = 125, sdno= 80, mdno = 50 • BLdno = 200

  16. Example -- File Information II • B+-Tree on Salary: • Ldno = 4, ddno = 10,000, sdno= 1, mdno = 45 • BLdno = 250

  17. Example -- The Queries • sssn = 1234(Emp) • sDno = 5(Emp) • sDno > 5(Emp) • sDno = 5 Ù Salary =30000 Ù Sex = ‘F’ (Emp)

  18. Example -- Query #1 sssn = 1234(Emp) • How can we perform this query? • Access the EMP table unsorted: • CUu = b = 2,000 disk accesses • Use the Hash table on SSN: • CHu = 2 • Best choice: • Use the hash table.

  19. Example -- Query #2 • sDno = 5(Emp) • How can we perform this query? • Access the EMP table unsorted: • CUn = b = 2,000 disk accesses • Use B+-Tree on Dno: • CBn = L+ (és/m ù - 1) + s = 4 + 1 + 80 = 85 • Best Choice: • Use the B+-Tree.

  20. Example -- Query #3 sDno > 5(Emp) • How can we perform this query? • Access the EMP table unsorted: • CUr = b = 2,000 disk accesses • Use B+-Tree on Dno: • CBr = L+ bL/2 + r/2 = 4 + 100 + 5000 = 5104 • Best Choice: • Use the unsorted file.

  21. Example -- Query #4 • sDno = 5 Ù Salary = 30000 Ù Sex = ‘F’ (Emp) • How can we perform this query? • Access the EMP table unsorted: • CUn = b = 2,000 disk accesses • Use B+-Tree on Dno = 85 disk accesses • Then apply remaining conjuncts to result in memory.

  22. Example -- Query #4 (cont) • Use B+-Tree on Salary: • CBn = L+ (és/m ù - 1) + s = 4 + 0 + 1 = 5 • Use unsorted file access for Sex = ‘F’: • CUn = b = 2,000 • Which is best? • Use B+-Tree on salary, then apply the others to result in memory.

  23. Join Cost Functions • Suppose I have R ê S. How do I estimate the cost of this? a=b • CJ1: Nested Loop Approach: • Access the blocks of R and for each record, access the blocks of S looking for matches. • CJ1 = bR + bR * bS + Write-to-disk-cost

  24. Join Cost Functions II • CJ2: Use B-Tree (for example) on S.b: • Access each block of R, then search the B-Tree looking for matches. • CJ2 = bR + rR * (LS.b + SS.b)+ WTDC • Note: Using a B-Tree on R.a and accessing the blocks of S first could yield a different cost.

  25. Join Cost Functions III • CJ3: If the two files are clustered: • CJ3 = bR + bS + WTDC

  26. Write to Disk Cost • WTDC is the cost to write an intermediate result to disk. This would be the number of blocks in the resulting table. • WTDC = |RêS| / BFRêS • I need |RêS|, but this is hard to calculate directly. • The “Join Selectivity” is easier to estimate: • JS = |RêS| / |R ´ S|

  27. Join Selectivity • JS = |RêS| / |R ´ S| • JS = |RêS| / |R| ´ |S| • |RêS| = JS ´ |R| ´ |S| • WTDC = |RêS| / BFRêS • WTDC = JS ´ |R| ´ |S| / BFRêS

  28. Join Selectivity II • Note: if R.a is the key of R, then |RêS| <= |S| • WTDC <= |S| / BFRêS

  29. Example Join-- File Information • Emp(Fn, Minit, LN, SSN, Bdate, Addr, Sex, Salary, SuperSSN, Dno) • r = 10,000 records • bf = 5 records/block • b = 2,000 blocks • B+-Tree on Dno: • Ldno = 4, ddno = 125, sdno= 80, mdno = 50 • BLdno = 200

  30. Example Join-- File Information • Dept(D#, Dname, MGRSSN, MgrStartDate) • r = 125 records • bf = 10 records/block • b = 13 blocks • B+-Tree on D#: • Ldno = 2, ddno = 125, sdno= 1, mdno = 50 • BLdno = 3

  31. The Sample Query: • I want to use Emp ê Dept. What are my options? Dno=D# • Nested Loop approach. • Use B-Tree on Emp.Dno. • Use B-Tree on Dept.D# • Cluster Employee and Department together.

  32. Join Query -- Nested Loop Emp ê Dept. Dno=D# • CJ1 = bEmp * bDept = 2000 * 13 = 26,000

  33. Join Query -- B-Tree on D# Emp ê Dept. Dno=D# • CJ2D# = bEmp + rEmp * (LDept.D# + SDept.D#) = 2000 + 10000 * (2 + 1) = 32,000

  34. Join Query -- B-Tree on Dno Emp ê Dept. Dno=D# • CJ2D# = bDept + rDept * (LEmp.Dno + SEmp.Dno) = 13 + 125 * (4 + 80) = 10,513 • This is the better choice. • Note: I have ignored the WTDC here because it would be the same in all cases.

  35. Join Query -- Clustered Files Emp ê Dept. Dno=D# • CJ3 = bR + bs = 2000 + 13 = 2,013 • This is the best choice. • Note: it does restrict how the files are stored.

More Related