1 / 16

Join Processing in Database Systems with Large Main Memories

Join Processing in Database Systems with Large Main Memories. Leonard D. Shapiro. ACM Transactions on Database Systems Vol. 11, No. 3, Sep 1986. Donghui Zhang, UC Riverside, May 17, 2002. Content. Join Definition Internal-Memory Joins External-Memory Model External-Memory Joins

Télécharger la présentation

Join Processing in Database Systems with Large Main Memories

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Join Processing in Database Systems with Large Main Memories Leonard D. Shapiro ACM Transactions on Database Systems Vol. 11, No. 3, Sep 1986 Donghui Zhang, UC Riverside, May 17, 2002

  2. Content • Join Definition • Internal-Memory Joins • External-Memory Model • External-Memory Joins • Performance Comparisons • Conclusions

  3. Problem Definition • Join operator: given relation R and S, report all pairs (r, s) s.t. r R, s S and the two records satisfy some given condition. • Equi-Join: join if a certain attribute of r is equal to an attribute of s.

  4. Department Faculty SELECT FacName, DeptName FROM Faculty, Department WHERE Faculty.DeptID =Department.DeptID

  5. Internal-Memory Solutions • Nested-loop join: check for every record in R and every record in S; cost={R}{S} • Sort-merge join: sort R, S followed by merging; cost=O({S}*log{S})(if {R}<{S}) • Hash join: build a hash table for R; for every record in S, probe the hash table; cost =O({S})

  6. External-Memory Model • Both relations R, S reside on disk; • Each disk page holds up to B records; • A disk page (block) has to be read in memory before records in it can be processed; • Need to extend the internal-memory join algorithms to the external-memory model; • Important for join: I/O, CPU.

  7. Block-Nested Loop Join • for every block in R • scan through S; • join records in the R block with the S records. • I/O: |R|*|S|, where |R| is the number of blocks R occupies; • Works good for small buffer (e.g. two blocks).

  8. External Sort-Merge Join • Extend the internal-memory sort-merge join by changing the sorting algorithm to external-memory merge sort. • Merge sort:

  9. External Sort-Merge Join (cont.) • Optimization: omit the final pass of merge sort by pipelining the sort result to join; • If buffer size  , can sort by reading R and S twice; • E.g. page size=8KB, each relation has 10,000 pages (80MB), buffer size=100 pages (<1MB), two passes are enough.

  10. Classic Hash Join Works when the smaller relation R fits in memory. • Build a in-memory hash table for the smaller relation; • For each record in the larger relation, probe the hash table. If the smaller relation does not fit in memory, partition into smaller buckets!

  11. Simple Hash Join • for each logical bucket j • for each record r in R • if r is in bucket j then • insert r into the hash table; • for each record s in S • if s is in bucket j then • probe the hash table; • Classic hash join is a special case, with one bucket; • Optimization: write the tuples not in bucket j to disk; • Works good when memory is large (nearly as large as |R|).

  12. GRACE Hash Join • partition R into n buckets so that each bucket fits in memory; • partition S into n buckets; • for each bucket j do • for each record r in Rj do • insert into a hash table; • for each record s in Sj do • probe the hash table. • Works good when memory is small.

  13. Hybrid Hash Join • Hybrid of simple hash join and GRACE; • When partitioning R, keep the records of the first bucket in memory as a hash table; • Typically this means that the first bucket uses more pages in memory (all other partitions are 1 page each) • When partitioning S, for records of the first bucket, probe the hash table directly; • Saving: no need to write R1 and S1 to disk or read back to memory. • Works good for large and small memory.

  14. Handle Partition Overflow • Case 1, overflow on disk: an R partition is larger than memory size (note: don’t care about the size of S partitions). • Solution (a) small partitions first and combine before join; • Solution (b) recursive partition. • Case 2, overflow in memory: the in-memory hash table of R becomes too large. • Solution: revise the partitioning scheme and keep a smaller partition in memory.

  15. Conclusions • Addressed equi-join problem in the external-memory environment; • With decreasing cost of memory, hash-based join is better than nested-loop and sort-merge joins; • Proposed three hash-based algorithms (simple hash join, GRACE join and hybrid join), out of which the hybrid hash join is the best.

  16. Hash-based Nested Loop Join • This is a hybrid of hash-based and nested-loop join; • In pure hash-based joins, we have two steps: first, partition the source relations; next, join each partition separately; • Hash-based nested loops join: no need to partition; Read some pages of R to fill memory and build a hash table for it, then scan through S.

More Related