1 / 31

SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units

SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units. Thomas Willhalm and Nicolae Popovici, Intel GmbH Yazan Boshmaf, SAP AG Hasso Plattner , Alexander Zeier, Jan Schaffner Hasso-Plattner-Institute, University of Potsdam VLDB 2009 August 25, 2009. Agenda.

cher
Télécharger la présentation

SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units Thomas Willhalm and Nicolae Popovici, Intel GmbH Yazan Boshmaf, SAP AG Hasso Plattner, Alexander Zeier, Jan SchaffnerHasso-Plattner-Institute, University of Potsdam VLDB 2009August 25, 2009

  2. Agenda • Column-Store and Light-Weight Database Compression • Single Instruction Multiple Data (SIMD) • Using SIMD for Decompression • Using SIMD for Predicate Handling Acknowledgement: We would like to thank Franz Färber, Günter Radestock, Tobias Mindnich, and Christoph Weyerhäuser from SAP for the fruitful discussion and the tremendous help in integrating and testing the SIMD routines.

  3. Column-Store A B C D • Column-oriented • Columns compressed independently • Completely in main memory • For each query do a full-table scan, i.e. • Decompresses required columns • Aggregate data according to predicate • Further processing row Data is stored in memory as columns

  4. SAP* NetWeaver Business Warehouse Accelerator (BWA) • Processing highly parallelized across multiple cores and blades • Shared nothing approach • Public demo available at http://microfinance.sap.com/

  5. Light-Weight Database Compression Attribute Table Dictionary • DocIdValueId • 42 • 4 • 100000 • 128 • 31455 ValueIdValue 1 0.02 2 0.10 3 1.00 … … 100000 132.13 Tables for storing “Amount” attribute Sales Table • ID Amount • 12.23 • 1.02 • 132.13 • … … … • Our focus: Loading this column • From “Dictionary”, those values are (0, 1, 2 ,3, …, 100000) • Max is 100000 which needs 17-bits to represent (217-1) • Idea: instead of 32-bits, use 17-bits to store each • Accessing “Value” needs decompression into 32-bits 3/11/2014 5

  6. Integers are compressed as packed bit-fieldsExample: packed 17-bit fields F E D C B A 9 8 7 6 5 4 3 2 1 0 ... 110300 65536 1772 2702 2 42 DECOMPRESS 1772 2702 2 42 … 3.14 17 bits 2.55 Use as Index for Dictionary 2.73 1.23 0.02 32 bits Dictionary

  7. Using SIMD for Full-Table Scans

  8. Single Instruction Multiple Data (SIMD) • Scalar processing • traditional mode • one instruction producesone result • SIMD processing • with Intel® SSE(2,3,4) • one instruction producesmultiple results SOURCE SOURCE 127 0 X4 X3 X2 X1 X SSE/2/3 OP Scalar OP Y4 Y3 Y2 Y1 Y DEST DEST X4opY4 X3opY3 X2opY2 X1opY1 XopY

  9. Single Instruction Multiple Data (SIMD) • 128-bit wide with Intel® SSE(2,3,4) • 2 64-bit integer ops/cycle • 4 32-bit integer ops/cycle • 8 16-bit integer ops/cycle • 16 8-bit integer ops/cycle • 256-bit with AVX (Sandy Bridge) • 512-bit with Larrabee SSE Operation SOURCE 127 0 X4 X3 X2 X1 SSE2 OP Y4 Y3 Y2 Y1 DEST CLOCK CYCLE 1 X4opY4 X3opY3 X2opY2 X1opY1 Vector-Processing Unit built-in standard processors

  10. Using SIMD for Decompression

  11. DECOMPRESS unaligned bit fieldsExample: packed 17-bit fields F E D C B A 9 8 7 6 5 4 3 2 1 0 1772 2702 2 42 _mm_shuffle_epi8 1772 2702 2 42 • Load a pre-fetched 128-bit segment of input data into SSE register. ... 110300 65536 1772 2702 2 42 2. Copy compressed values to target DWORDs “32-bit segment”. 3. Align the values from unequally shifted DWORDs. 4. Store uncompressed values. 3/11/2014 11

  12. Problem: There are values that span across 5 Bytes F E D C B A 9 8 7 6 5 4 3 2 1 0 Example: packed 27-bit fields 0 128~ 32766 17 127321873 42 Shuffle 32766 ?? 127321873 42 • The 3rd value spans across 5 Bytes. • Cannot use Shuffle to copy the FULL bits into a 4-Byte space directly. 3/11/2014 3/11/2014 12

  13. Solution: Shift 5-Bytes values into 4 Bytes and blend F E D C B A 9 8 7 6 5 4 3 2 1 0 Shuffle _mm_shuffle_epi8 32766 27 4 42 _mm_srli_epi64 32766 27 4 42 32766 27 4 42 Example: packed 27-bit field ~128 32766 27 4 42 Shift(64) _mm_slli_epi64 Blend _mm_blend_epi16

  14. Different workarounds for “independent shift” are used Direct shuffle for nicely aligned values Integer Multiplication 16-bit and 32-bit to simulate independent shift Use 2 shifts and blend results Integer Comparison to propagate value (1-bit compression) 3/11/2014 14

  15. Shift-1: Direct shuffle aligned valuesExample: packed 24-bit fields Data is nicely aligned (case 8, 16 & 24). Copy interesting parts only and “zero” out the others. F E D C B A 9 8 7 6 5 4 3 2 1 0 … 32766 128 31415 5 114 Shuffle 128 31415 5 114 3/11/2014 15

  16. Shift-2: Use multiplication to simulate independent left shiftExample: packed 15-bit fields F E D C B A 9 8 7 6 5 4 3 2 1 0 7-bits shift 7-bits shift 7-bits shift 7-bits shift 32766 27 4 42 32766 27 4 42 5-bits shift 6-bits shift 7-bits shift 0-bits shift 32766 27 4 42 __m128i mult_msk = _mm_set_epi32(0x04,0x02,0x01,0x80); __m128i mult_rslt = _mm_mullo_epi32(shfl_rslt, mult_msk ); Multiply to shift left Shift right _mm_srli_epi32(mult_rslt1_m128i, 7); __ 3/11/2014 16

  17. Shift-3: Blend results of different shift amount Example: packed 4-bit fields F E D C B A 9 8 7 6 5 4 3 2 1 0 7 23 31 15 22 30 14 6 29 21 5 13 4 20 12 28 19 27 11 3 26 10 2 18 17 9 25 1 24 0 16 8 23 31 15 7 22 14 30 6 5 29 21 13 27 19 11 3 10 26 2 18 1 25 9 17 3 27 11 19 18 10 26 2 17 9 25 1 23 31 15 7 22 30 14 6 21 29 13 5 11 19 27 3 10 18 2 26 17 25 9 1 11 19 27 3 10 26 18 2 9 1 25 17 16 8 24 0 27 19 11 3 1 9 25 17 2 18 26 10 24 16 8 0 Shuffle 3 2 1 0 Shift right by 4 Shuffle Blend Mask with “and” Use second Blend for remaining values

  18. Shift-4: Use integer comparison to propagate values packed 1-bit fields only F E D C B A 9 8 7 6 5 4 3 2 1 0 0x01 0x01 0x01 0xFF 0x01 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0x01 0x01 0x01 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 Packed 1-bit fields ONLY shuffle and compare and shuffle 0 1 0 1

  19. Decompression is 1.58x faster with SIMD

  20. Using SIMD for Predicate Handling

  21. COMPRESSEDSEARCH searches on compressed values F E D C B A 9 8 7 6 5 4 3 2 1 0 Input vector ... y x 42 4 27 32766 114 113 • Algorithmic optimization by only decompressing the range of values that are of interest: • DECOMPRESS • And returns indexes of “Index Values” instead of decompressed “Index Values” COMPRESSEDSEARCH(1,30) Decompress Index=114 Index=115 Index=113 Index=112 32766 42 4 27 Compare and store the index Result Buffer

  22. Basic idea of COMPRESSEDSEARCHExample: COMPRESSEDSEARCH(3,30)for packed 17-bit fields F E D C B A 9 8 7 6 5 4 3 2 1 0 ... 49 270 42 4 27 32766 42 4 27 32766 0x00000000 0xFFFFFFFF 0x00000000 0xFFFFFFFF … 114 113 • Load a pre-fetched 128-bit segment of input data into SSE register. 2. Copy compressed values to target DWORDs “32-bit segment”. 3. Make a Parallel Comparison of each DWORDs (1 <= Value < 30). 4. Store the Indexes.

  23. Compare shifted valuesExample: COMPRESSEDSEARCH(3,30)for packed 15-bit fields F E D C B A 9 8 7 6 5 4 3 2 1 0 mask values with pand 4 32766 42 27 5-bit shift 0-bit shift 7-bit shift 6-bit shift greater than Shifted lower bound 3 3 3 3 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF 0x00000000 Less than Shifted upper bound 30 30 30 30 and 0xFFFFFFFF 0x00000000 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF 0x00000000 0x00000000 0xFFFFFFFF

  24. Hits are stored with look-up table F E D C B A 9 8 7 6 5 4 3 2 1 0 • Test first, if there are any hits with _mm_testz_si128 • Implicit “pand” (saves 1 instruction) 0xFFFFFFFF 0x00000000 0x00000000 0xFFFFFFFF Extract bits with “movemask”use this for table look-up 0b0110 Maintain loop variable with current indexes 114 112 115 113 Shuffle indexes of hits (shuffle mask from look-up table) 114 113 Append result to list of hits

  25. Full-table Scan is 1.63x faster with SIMD

  26. Best performance is achieved for small results sets

  27. SIMD-Scan scales with the number of cores

  28. Summary • Data is stored in memory as columns • Integers are compressed as packed bit-fields • Use Vector-Processing Unit built-in standard processors • Decompression is 1.58x faster with SIMD • Full-table Scan is 1.63x faster with SIMD • Best performance is achieved for small results sets • SIMD-Scan scales with the number of cores

  29. Trademarks • Intel and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. • SAP, SAP NetWeaver, BusinessObjects, BusinesObjects Explorer, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and other countries. • Other names and brands may be claimed as the property of others.

  30. Load problem of unaligned dataExample: packed 15-bit fields 0 63 ~ 27 4 42 32766 F E D C B A 9 8 7 6 5 4 3 2 1 0 ~ 128~ 114 5 31415 2~ ~ 99 115 57 … 1~ 3~ 114 5 31415 128 32766 27 4 42 3~ ~ 44 6 12 1 99 115 57 2 44 6 12 ~3 Load the first group Load the second group 127 0 What about this value? 3/11/2014 3/11/2014 30

  31. Solution: Load using palign Use the same technique for further loads with “suitable” shift amount. Unroll loop because shift amount is an immediate(loop unrolling can be done by Intel compiler with pragma) Use unaligned loads on Intel® Xeon® processor 5500 series F E D C B A 9 8 7 6 5 4 3 2 1 0 Start END 127 0 0 127 mm_load_si128 Align to 128-bits new SSE register with palign Neglect by shifting 15 bytes 0 127 ~ _mm_alignr_epi8 1~ 99 115 57 2 44 6 12 3 3/11/2014 31

More Related