1 / 19

Evaluation of Relational Operations

Evaluation of Relational Operations. Notation. M = Memory allocated for pages T R = # of tuples in R. b R = # of tuples of R per page. B R = # of pages (blocks) occupied by R. b = # of bytes per page. I R.A = Image of attribute A in R = # of distinct values of A in R.

Télécharger la présentation

Evaluation of Relational Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of Relational Operations

  2. Notation • M = Memory allocated for pages • TR= # of tuples in R. • bR = # of tuples of R per page. • BR= # of pages (blocks) occupied by R. • b = # of bytes per page. • IR.A = Image of attribute A in R = # of distinct values of A in R. • C = Cost of the relational operation. • Cost metric: # of I/Os. We will ignore output costs.

  3. Selection SELECT * FROMSailors S WHERE S.rating = 10 • Number of tuples with rating =10 is TS / IS.rating; v.g., ifTS.rating =100, and TS = 10,000 there are, on average, 100 sailors with rating = 10. • If there is a non-clustered index on rating, in the worst case each satisfying tuple can be in a different page => Cost = TS / IS.rating(note that here the unit is # of I/O) • If there is a clustered index on rating, all satisfying tuples are packed together, thus Cost = TS / (bS * IS.rating) = BS / IS.rating • Whenever Cost >BS => Scan the relation and check the condition on-the-fly. • If condition is rating < 10, if no other information exists, assume that 1/2 of the tuples satisfy the inequality => Cost = 0.5 TS or 0.5 BS with unclustered or clustered index, respectively.

  4. Selection SELECT * FROMSailors S WHERE S.rating = 10 AND S. age = 20 • Number of tuples with rating =10 and age = 20is TS /(IS.rating * IS.age); • The images of rating and age must be considered. The file will be accessed through the index with the higher image. If IS.rating =100 and IS.age =10 (10 different ages for sailors), on the average there will be 1000 sailors with age=20 and 100 sailors with rating = 10 => access the file through the index over rating, and with the pages in main memory, check if age = 20. This is called the Most selective access path approach. • A compound index may also be used.

  5. Selection SELECT * FROMSailors S WHERE S.rating = 10 OR S. age = 20 • Number of tuples with rating =10 or age = 20is (roughly) TS /IS.rating + TS IS.age; • We can only use indices if both attributes are indexed. If IS.rating =100 and IS.age =10, in the worst case, the 100 sailors with rating = 10 can be all different than the 1000 sailors with age = 20 => the strategy of the most selective path does not work. • If there is an index over rating and there is NO index over age, scan the relation.

  6. Intersection (Union) of Rids • Another approach for disjunctive or conjunctive queries • Get sets of rids of data records using each matching index. • Then intersect these sets of rids • Retrieve the records and apply any remaining terms. • Consider day<8/9/94 AND bid=5 AND sid=3. If we have a B+ tree index on day and an index on sid, we can retrieve rids of records satisfying day<8/9/94 using the first, rids of recs satisfying sid=3 using the second, intersect, retrieve records and check bid=5. • We can proceed analogously with disjunctions, but instead of intersection, we apply union of rids.

  7. Cross Product SELECT * FROM R , S • Assume BR < BS • Two cases: M > BR and M < BR • If M > BR , load all R into memory, and one page of S at a time. Load R in main memory (cost= BR); foreach page ps in S do Load ps into main memory for each tuple r in R do for each tuple s in ps do add <r, s> to result C = BR +BS

  8. Cross Product (cont.) • If M < BR • Partition R in segments of M-1 pages => BR /(M-1) segments • Load one segment of R and one page of S at a time. for each segment Ri of R Load segment Ri in main memory; foreach page ps in S do Load ps in main memory for each tuple s in ps do for each tuple r in Ri do add <r, s> to result C = BR +BS *BR /(M-1)

  9. Equality Joins With One Join Column SELECT * FROM R,S WHERE R.B=S.B • In algebra: R S. Common! Must be carefully optimized. R S is large; so, R S followed by a selection is inefficient (except for small relations). • First alternative: page-oriented nested loops as seen before. But previously to the output <r,s>, check the join condition.

  10. Cost of Joins (cont.) • Number of tuples in the output of a join R S • For each tuple r in R, the estimate # of joining tuples isTS / IS.B . Thus, for all the tuples, TRxTS / IS.A. The same could be done in terms of S, thus, • # of tuples in R S =TRxTS / max (IR.B , IS.B) • In general, if the join is a conjunction of equalities: # of tuples in R S =TRxTS / (max (IR.B , IS.B) x max (IR.C , IS.C) x ….)

  11. Index Nested Loops Join • If there is an index on the join column of one relation (say S), can make it the inner and exploit the index. read R into main memory foreach tuple r in R do use index on S.B for accessing pages of S with tuples s.t. s.B == r.B for each tuple that matches, add <r, s> to result jones 25 smith 2 …. 25 25 red 2 blue …. 2 25 pink 4 black …. R Index on S.B S

  12. Index Nested Loops Join (cont.) • If the index is non-clustered • Cost. for each tuple r in R, the # of matching tuples in S is TS / IS.B ; then, Cost = BR + TS *TS / IS.B • For a clustered index: • Cost. for each tuple r in R, the # of matching tuples in S is TS / IS.B ; but matching tuples are packed together; then, Cost = BR + TS *BS / IS.B

  13. Sort-Merge Join (R S) i=j • Sort R and S on the join column, then scan them to do a ``merge’’ (on join col.), and output result tuples. • Advance scan of R until current R-tuple >= current S tuple, then advance scan of S until current S-tuple >= current R tuple; do this until current R tuple = current S tuple. • At this point, all R tuples with same value in Ri (current R group) and all S tuples with same value in Sj (current S group) match; output <r, s> for all pairs of such tuples. • Then resume scanning R and S. • R is scanned once; each S group is scanned once per matching R tuple. (Multiple scans of an S group are likely to find needed pages in buffer.)

  14. Example of Sort-Merge Join • Cursors now are in sid=28 and will advance to sid=31. Before that, <28, yuppy, 9, 35.0, 103, 12/4/96…> and <28, yuppy, 9, 35.0, 103, 11/3/96…> are added to the output set. • Cost: M log M + N log N + (M+N)

  15. Original Relation Partitions OUTPUT 1 1 2 INPUT 2 hash function h . . . B-1 B-1 B main memory buffers Disk Disk Hash-Join • Partition both relations using hash fn h: R tuples in partition i will only match S tuples in partition i. • Then, check in each bucket for the matching tuples. Note that only tuples of R and S in the same bucket must be checked.

  16. Cost of Hash-Join • In partitioning phase, read+write both relns; 2(BR+BS). In matching phase, read both relns; BR+BS I/Os. • Sort-Merge Join vs. Hash Join: • Given a minimum amount of memory both have a cost of 3(BR+BS) I/Os. Hash Join superior on this count if relation sizes differ greatly. Also, Hash Join shown to be highly parallelizable. • Sort-Merge less sensitive to data skew; result is sorted.

  17. General Join Conditions • Equalities over several attributes (e.g., R.B=S.B ANDR.C=S.C): • For Index Nested Loops, build index on <B, C> (if S is inner); or use existing indexes on B or C. • For Sort-Merge and Hash Join, sort/partition on combination of the two join columns. • Inequality conditions (e.g., R.B < S.B): • For Index NL, need (clustered!) B+ tree index. • Range probes on inner; # matches likely to be much higher than for equality joins. • Hash Join, Sort Merge Join not applicable. • Block NL quite likely to be the best join method here.

  18. Set Operations • Intersection special case of join. • Union (Distinct) and Except similar; we’ll do union. • Sorting based approach to union: • Sort both relations (on combination of all attributes). • Scan sorted relations and merge them. • Hash based approach to union: • Partition R and S using hash function h. • For each S-partition, build in-memory hash table, scan corr. R-partition and add tuples to table while discarding duplicates.

  19. Aggregate Operations (AVG, MIN, etc.) • Without grouping: • In general, requires scanning the relation. • Given index whose search key includes all attributes in the SELECT or WHERE clauses, can do index-only scan. • With grouping: • Sort on group-by attributes, then scan relation and compute aggregate for each group. (Can improve upon this by combining sorting and aggregate computation.) • Similar approach based on hashing on group-by attributes. • Given tree index whose search key includes all attributes in SELECT, WHERE and GROUP BY clauses, can do index-only scan; if group-by attributes form prefix of search key, can retrieve data entries/tuples in group-by order.

More Related