Inspector Joins
E N D
Presentation Transcript
Carnegie Mellon University 2 Intel Research Pittsburgh Inspector Joins 1 Shimin Chen 1 AnastassiaAilamaki 2 1,2 ToddC.Mowry PhillipB.Gibbons 1
Exploiting Information about Data • Ability to improve query depends on information quality • General stats on relations are inadequate • May lead to incorrect decisions for specific queries • Especially true for join queries • Previous approaches exploiting dynamic information • Collecting information from previous queries • Multi-query optimization [Sellis’88] • Materialized views [Blakeley et al. 86] • Join indices [Valduriez’87] • Dynamic re-optimization of query plans [Kabra&DeWitt’98] [Markl et al. 04] This study exploits the inner structure of hash joins 2
Join I/O Partitioning Extra information greatly helps phase 2 Exploiting Multi-Pass Structure of Hash Joins • Idea: • Examine the actual data in I/O partitioning phase • Extract useful information to improve join phase Inspection 3
decide Using Extracted Information • Enable a new join phase algorithm • Reduce the primary performance bottleneck in hash joins i.e. Poor CPU cache performance • Optimized for multi-processor systems • Choose the most suitable join phase algorithm for special input cases Join Phase Simple Hash Join Inspection Cache Partitioning I/O Partitioning Cache Prefetching Extracted Information New Algorithm 4
Outline • Motivation • Previous hash join algorithms • Hash join performance on SMP systems • Inspector join • Experimental results • Conclusions 5
Hash Table GRACE Hash Join • I/O Partitioning Phase: • Divide input relations into partitions with a hash function Probe Build Over 70% execution time stalled on cache misses! • Join Phase: (simple hash join) • Build hash table, then probe hash table Probe Build • Random memory accesses cause poor CPU cache performance 6
Cache-sized Partitions Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00] • Recursively produce cache-sized partitions after I/O partitioning • Avoid cache misses when joining cache-sized partitions • Overhead of re-partitioning Memory-sized Partitions Probe Build 7
Hash Table Build Probe Cache Prefetching [Chen et al. 04] • Reduce impact of cache misses • Exploit available memory bandwidth • Overlap cache misses and computations • Insert cache prefetch instructions into code • Still incurs the same number of cache misses 8
Outline • Motivation • Previous hash join algorithms • Hash join performance on SMP systems • Inspector join • Experimental results • Conclusions 9
CPU CPU CPU CPU Cache Cache Cache Cache Shared bus Main Memory Build4 Build1 Build2 Build3 Probe2 Probe4 Probe3 Probe1 Hash Joins on SMP Systems • Previous studies mainly focus on uni-processors • Memory bandwidth is precious • Each processor joins a pair of partitions in join phase 10
GRACECache partitioningCache prefetching Bandwidth-sharing Number of CPUs used Number of CPUs used Re-partition cost Previous Algorithms on SMP Systems Wall clock time Aggregate time on all CPUs • Join phase performance of joining a 500MB and a 2GB relations (details later in the talk) • Aggregate performance degrades dramatically over 4 CPUs Reduce data movement (memory to memory, memory to cache) 11
I/O Partitioning Join Sub-partition 0 Sub-partition 1 Sub-partition 2 Inspector Joins • Extracted information:summary of matching relationships • Every K contiguous pages in a build partition forms a sub-partition • Tells which sub-partition(s) every probe tuple matches Probe Partition Build Partition Summary of Matching Relationship 12
I/O Partitioning Join CPU Cache Hash Table Cache-Stationary Join Phase • Recall cache partitioning: re-partition cost Copying cost Copying cost Build Partition Probe Partition • We want to achieve zero copying 13
I/O Partitioning Join CPU Cache Hash Table Cache-Stationary Join Phase • Joins a sub-partition and its matching probe tuples • Sub-partition is small enough to fit in CPU cache • Cache prefetching for the remaining cache misses • Zero copying for generating recursive cache-sized partitions Sub-partition 0 Sub-partition 1 Sub-partition 2 Build Partition Probe Partition 14
I/O Partitioning Join Probe Relation Build Relation Filters in I/O Partitioning • How to extract the summary efficiently? • Extend filter scheme in commercial hash joins • Conventional single-filter scheme • Represent all build join keys • Filter out probe tuples having no matches Filter Mem-sized Partitions Test Construct 15
0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 Background: Bloom Filter • A bit vector • A key is hashed d (e.g. d=3) times and represented by d bits • Construct: for every build join key, set its 3 bits in vector • Test: given a probe join key, check if all its 3 bits are 1 • Discard the tuple if some bits are 0 • May have false positives Bit0=H0(key) Bit1=H1(key) Bit2=H2(key) Filter 16
I/O Partitioning Join Sub0,0Sub0,1Sub0,2 Sub1,0Sub1,1Sub1,2 Sub2,0Sub2,1Sub2,2 Multi-Filter Scheme • Single filter: a probe tuple entire build relation • Our goal: a probe tuple sub-partitions • Construct a filter for every sub-partition • Replace a single large filter with multiple small filters Single Filter Multi-Filter Partition 0 Partition 1 Build Relation Partition 2 17
I/O Partitioning Join Testing Multi-Filters When partitioning the probe relation • Test a probe tuple against all the filters of a partition • Tells which sub-partition(s) the tuple may have matches • Store summary of matching relationships in partitions Test Partition 0 Partition 1 Multi-Filter Probe Relation Partition 2 18
Partition 0 0 0 1 S filters 1 1 1 0 1 1 Partition 1 Multi-Filter Probe Relation Partition 2 Minimizing Cache Misses for Testing Filters • Single filter scheme: • Compute 3 bit positions • Test 3 bits • Multi-filter scheme: if there are S sub-partitions in a partition • Compute 3 bit positions • Test the same 3 bits for every filter, altogether 3*S bits • May cause 3*S cache misses ! Test 19
I/O Partitioning Join 0 1 0 0 1 1 1 1 1 Vertical Filters for Testing • Bits at the same position are contiguous in memory • 3 cache misses instead of 3*S cache misses! S filters Contiguous in memory Test Partition 0 Partition 1 Probe Relation Partition 2 • Horizontal vertical conversion after partitioning build relation • Very small overhead in practice 20
More Details in Paper • Moderate memory space requirement for filters • Summary information representation in intermediate partitions • Preprocessing for cache-stationary join phase • Prefetching for improving efficiency and robustness 21
Outline • Motivation • Previous hash join algorithms • Hash join performance on SMP systems • Inspector join • Experimental results • Conclusions 22
Experimental Setup • Relation schema: 4-byte join attribute + fixed length payload • No selection, no projection • 50MB memory per CPU available for the join phase • Same join algorithm run on every CPU joining different partitions • Detailed cycle-by-cycle simulations • A shared-bus SMP system with 1.5GHz processors • Memory hierarchy is based on Itanium 2 processor 23
GRACECache prefetchingCache partitioningEnhanced cache partitioningInspector join Partition Phase Wall-Clock Time • I/O partitioning can take advantage of multiple CPUs • Cut input relations into equal-sized chunks • Partition one chunk on every CPU • Concatenate outputs from all CPUs • Enhanced cache partitioning: cache partitioning + advanced prefetching • Inspection incurs very small overhead • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 24
GRACE Cache prefetching Cache partitioning Enhanced cache partitioning Inspector join Join Phase Aggregate Time • Inspector join achieves significantly better performancewhen 8 or more CPUs are used • 1.7-2.1X speedups over cache prefetching • 1.6-2.0X speedups over enhanced cache partitioning • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 25
decide Results on Choosing Suitable Join Phase • Case #1: a large number of duplicate build join keys • Choose enhanced cache partitioning • When a probe tuple on average matches 4 or more sub-partitions • Case #2: nearly sorted input relations • Surprisingly: cache-stationary join is very good Join Phase Simple Hash Join Inspection Cache Partitioning I/O Partitioning Cache Prefetching Extracted Info Cache Stationary 26
Conclusions • Exploit multi-pass structure for higher quality info about data • Achieve significantly better cache performance • 1.6X speedups over previous cache-friendly algorithms • When 8 or more CPUs are used • Choose most suitable algorithms for special input cases • Idea may be applicable to other multi-pass algorithms 27
Thank You ! 28
GRACECache prefetchingCache partitioningInspector join Partition Phase Wall-Clock Time • I/O partitioning can take advantage of multiple CPUs • Cut input relations into equal-sized chunks • Partition one chunk on every CPU • Concatenate outputs from all CPUs • Inspection incurs very small overhead • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 29
GRACE Cache prefetching Cache partitioning Inspector join Join Phase Aggregate Time • Inspector join achieves significantly better performancewhen 8 or more CPUs are used • 1.7-2.1X speedups over cache prefetching • 1.6-2.0X speedups over enhanced cache partitioning • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 30
Hash Table Build Probe CPU-Cache-Friendly Hash Joins • Recent studies focus on CPU cache performance • I/O partitioning gives good I/O performance • Random memory accesses cause poor CPU cache performance • Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00] • Recursively produce cache-sized partitions from memory-sized partitions • Avoid cache misses during join phase • Pay re-partitioning cost • Cache Prefetching [Chen et al. 04] • Exploit memory system parallelism • Use prefetches to overlap multiple cache misses and computations 31
Example Special Input Cases • Example case #1: a large number of duplicate build join keys • Count the average number of sub-partitions a probe tuple matches • Must check the tuple against all possible sub-partitions • If too large, cache stationary join works poorly • Example case #2: nearly sorted input relations • A merge-based join phase might be better? A probe tuple Build Partition Sub-partition 0 Probe Partition Sub-partition 1 Sub-partition 2 32
Varying Number of Duplicates per Build Join Key • Join phase aggregate performance • Choose enhanced cache part • When a probe tuple on average matches 4 or more sub-partitions 33
Nearly Sorted Cases • Sort both input relations, then randomly move 0%-5% of tuples • Join phase aggregate performance • Surprisingly: cache-stationary join is very good • Even better than merge join when over 1% tuples are out-of-order 34
Analyzing Nearly Sorted Case • Partitions are also nearly sorted • Probe tuples matching a sub-partition are almost contiguous • Similar memory behavior as merge join • No cost for sorting out-of-order tuples A probe tuple Build Partition Sub-partition 0 Probe Partition Sub-partition 1 Sub-partition 2 Nearly Sorted Nearly Sorted 35