SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units

SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units Thomas Willhalm and Nicolae Popovici, Intel GmbH Yazan Boshmaf, SAP AG Hasso Plattner, Alexander Zeier, Jan SchaffnerHasso-Plattner-Institute, University of Potsdam VLDB 2009August 25, 2009

Agenda • Column-Store and Light-Weight Database Compression • Single Instruction Multiple Data (SIMD) • Using SIMD for Decompression • Using SIMD for Predicate Handling Acknowledgement: We would like to thank Franz Färber, Günter Radestock, Tobias Mindnich, and Christoph Weyerhäuser from SAP for the fruitful discussion and the tremendous help in integrating and testing the SIMD routines.

Column-Store A B C D • Column-oriented • Columns compressed independently • Completely in main memory • For each query do a full-table scan, i.e. • Decompresses required columns • Aggregate data according to predicate • Further processing row Data is stored in memory as columns

SAP* NetWeaver Business Warehouse Accelerator (BWA) • Processing highly parallelized across multiple cores and blades • Shared nothing approach • Public demo available at http://microfinance.sap.com/

Light-Weight Database Compression Attribute Table Dictionary • DocIdValueId • 42 • 4 • 100000 • 128 • 31455 ValueIdValue 1 0.02 2 0.10 3 1.00 … … 100000 132.13 Tables for storing “Amount” attribute Sales Table • ID Amount • 12.23 • 1.02 • 132.13 • … … … • Our focus: Loading this column • From “Dictionary”, those values are (0, 1, 2 ,3, …, 100000) • Max is 100000 which needs 17-bits to represent (217-1) • Idea: instead of 32-bits, use 17-bits to store each • Accessing “Value” needs decompression into 32-bits 3/11/2014 5

Integers are compressed as packed bit-fieldsExample: packed 17-bit fields F E D C B A 9 8 7 6 5 4 3 2 1 0 ... 110300 65536 1772 2702 2 42 DECOMPRESS 1772 2702 2 42 … 3.14 17 bits 2.55 Use as Index for Dictionary 2.73 1.23 0.02 32 bits Dictionary

Using SIMD for Full-Table Scans

Single Instruction Multiple Data (SIMD) • Scalar processing • traditional mode • one instruction producesone result • SIMD processing • with Intel® SSE(2,3,4) • one instruction producesmultiple results SOURCE SOURCE 127 0 X4 X3 X2 X1 X SSE/2/3 OP Scalar OP Y4 Y3 Y2 Y1 Y DEST DEST X4opY4 X3opY3 X2opY2 X1opY1 XopY

Single Instruction Multiple Data (SIMD) • 128-bit wide with Intel® SSE(2,3,4) • 2 64-bit integer ops/cycle • 4 32-bit integer ops/cycle • 8 16-bit integer ops/cycle • 16 8-bit integer ops/cycle • 256-bit with AVX (Sandy Bridge) • 512-bit with Larrabee SSE Operation SOURCE 127 0 X4 X3 X2 X1 SSE2 OP Y4 Y3 Y2 Y1 DEST CLOCK CYCLE 1 X4opY4 X3opY3 X2opY2 X1opY1 Vector-Processing Unit built-in standard processors

Using SIMD for Decompression

DECOMPRESS unaligned bit fieldsExample: packed 17-bit fields F E D C B A 9 8 7 6 5 4 3 2 1 0 1772 2702 2 42 _mm_shuffle_epi8 1772 2702 2 42 • Load a pre-fetched 128-bit segment of input data into SSE register. ... 110300 65536 1772 2702 2 42 2. Copy compressed values to target DWORDs “32-bit segment”. 3. Align the values from unequally shifted DWORDs. 4. Store uncompressed values. 3/11/2014 11

Problem: There are values that span across 5 Bytes F E D C B A 9 8 7 6 5 4 3 2 1 0 Example: packed 27-bit fields 0 128~ 32766 17 127321873 42 Shuffle 32766 ?? 127321873 42 • The 3rd value spans across 5 Bytes. • Cannot use Shuffle to copy the FULL bits into a 4-Byte space directly. 3/11/2014 3/11/2014 12

Solution: Shift 5-Bytes values into 4 Bytes and blend F E D C B A 9 8 7 6 5 4 3 2 1 0 Shuffle _mm_shuffle_epi8 32766 27 4 42 _mm_srli_epi64 32766 27 4 42 32766 27 4 42 Example: packed 27-bit field ~128 32766 27 4 42 Shift(64) _mm_slli_epi64 Blend _mm_blend_epi16

Different workarounds for “independent shift” are used Direct shuffle for nicely aligned values Integer Multiplication 16-bit and 32-bit to simulate independent shift Use 2 shifts and blend results Integer Comparison to propagate value (1-bit compression) 3/11/2014 14

Shift-1: Direct shuffle aligned valuesExample: packed 24-bit fields Data is nicely aligned (case 8, 16 & 24). Copy interesting parts only and “zero” out the others. F E D C B A 9 8 7 6 5 4 3 2 1 0 … 32766 128 31415 5 114 Shuffle 128 31415 5 114 3/11/2014 15

Shift-2: Use multiplication to simulate independent left shiftExample: packed 15-bit fields F E D C B A 9 8 7 6 5 4 3 2 1 0 7-bits shift 7-bits shift 7-bits shift 7-bits shift 32766 27 4 42 32766 27 4 42 5-bits shift 6-bits shift 7-bits shift 0-bits shift 32766 27 4 42 __m128i mult_msk = _mm_set_epi32(0x04,0x02,0x01,0x80); __m128i mult_rslt = _mm_mullo_epi32(shfl_rslt, mult_msk ); Multiply to shift left Shift right _mm_srli_epi32(mult_rslt1_m128i, 7); __ 3/11/2014 16

Shift-3: Blend results of different shift amount Example: packed 4-bit fields F E D C B A 9 8 7 6 5 4 3 2 1 0 7 23 31 15 22 30 14 6 29 21 5 13 4 20 12 28 19 27 11 3 26 10 2 18 17 9 25 1 24 0 16 8 23 31 15 7 22 14 30 6 5 29 21 13 27 19 11 3 10 26 2 18 1 25 9 17 3 27 11 19 18 10 26 2 17 9 25 1 23 31 15 7 22 30 14 6 21 29 13 5 11 19 27 3 10 18 2 26 17 25 9 1 11 19 27 3 10 26 18 2 9 1 25 17 16 8 24 0 27 19 11 3 1 9 25 17 2 18 26 10 24 16 8 0 Shuffle 3 2 1 0 Shift right by 4 Shuffle Blend Mask with “and” Use second Blend for remaining values

Shift-4: Use integer comparison to propagate values packed 1-bit fields only F E D C B A 9 8 7 6 5 4 3 2 1 0 0x01 0x01 0x01 0xFF 0x01 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0x01 0x01 0x01 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 Packed 1-bit fields ONLY shuffle and compare and shuffle 0 1 0 1

Decompression is 1.58x faster with SIMD

Using SIMD for Predicate Handling

COMPRESSEDSEARCH searches on compressed values F E D C B A 9 8 7 6 5 4 3 2 1 0 Input vector ... y x 42 4 27 32766 114 113 • Algorithmic optimization by only decompressing the range of values that are of interest: • DECOMPRESS • And returns indexes of “Index Values” instead of decompressed “Index Values” COMPRESSEDSEARCH(1,30) Decompress Index=114 Index=115 Index=113 Index=112 32766 42 4 27 Compare and store the index Result Buffer

Basic idea of COMPRESSEDSEARCHExample: COMPRESSEDSEARCH(3,30)for packed 17-bit fields F E D C B A 9 8 7 6 5 4 3 2 1 0 ... 49 270 42 4 27 32766 42 4 27 32766 0x00000000 0xFFFFFFFF 0x00000000 0xFFFFFFFF … 114 113 • Load a pre-fetched 128-bit segment of input data into SSE register. 2. Copy compressed values to target DWORDs “32-bit segment”. 3. Make a Parallel Comparison of each DWORDs (1 <= Value < 30). 4. Store the Indexes.

Compare shifted valuesExample: COMPRESSEDSEARCH(3,30)for packed 15-bit fields F E D C B A 9 8 7 6 5 4 3 2 1 0 mask values with pand 4 32766 42 27 5-bit shift 0-bit shift 7-bit shift 6-bit shift greater than Shifted lower bound 3 3 3 3 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF 0x00000000 Less than Shifted upper bound 30 30 30 30 and 0xFFFFFFFF 0x00000000 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF 0x00000000 0x00000000 0xFFFFFFFF

Hits are stored with look-up table F E D C B A 9 8 7 6 5 4 3 2 1 0 • Test first, if there are any hits with _mm_testz_si128 • Implicit “pand” (saves 1 instruction) 0xFFFFFFFF 0x00000000 0x00000000 0xFFFFFFFF Extract bits with “movemask”use this for table look-up 0b0110 Maintain loop variable with current indexes 114 112 115 113 Shuffle indexes of hits (shuffle mask from look-up table) 114 113 Append result to list of hits

Full-table Scan is 1.63x faster with SIMD

Best performance is achieved for small results sets

SIMD-Scan scales with the number of cores

Summary • Data is stored in memory as columns • Integers are compressed as packed bit-fields • Use Vector-Processing Unit built-in standard processors • Decompression is 1.58x faster with SIMD • Full-table Scan is 1.63x faster with SIMD • Best performance is achieved for small results sets • SIMD-Scan scales with the number of cores

Trademarks • Intel and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. • SAP, SAP NetWeaver, BusinessObjects, BusinesObjects Explorer, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and other countries. • Other names and brands may be claimed as the property of others.

Load problem of unaligned dataExample: packed 15-bit fields 0 63 ~ 27 4 42 32766 F E D C B A 9 8 7 6 5 4 3 2 1 0 ~ 128~ 114 5 31415 2~ ~ 99 115 57 … 1~ 3~ 114 5 31415 128 32766 27 4 42 3~ ~ 44 6 12 1 99 115 57 2 44 6 12 ~3 Load the first group Load the second group 127 0 What about this value? 3/11/2014 3/11/2014 30

Solution: Load using palign Use the same technique for further loads with “suitable” shift amount. Unroll loop because shift amount is an immediate(loop unrolling can be done by Intel compiler with pragma) Use unaligned loads on Intel® Xeon® processor 5500 series F E D C B A 9 8 7 6 5 4 3 2 1 0 Start END 127 0 0 127 mm_load_si128 Align to 128-bits new SSE register with palign Neglect by shifting 15 bytes 0 127 ~ _mm_alignr_epi8 1~ 99 115 57 2 44 6 12 3 3/11/2014 31

SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units

SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units

Presentation Transcript

Warm Mix Asphalt SCAN

The Spatial Scan Statistic

Political Environment Scan

3D Scan Baby Moments www.scan4d.co.uk

USACO’s this weekend!

CT SCAN,CTA AND MRI AT NEUROHOSPITAL,BIRATNAGAR

RHB C-band radar scan strategy constraints

T-Scan 300, T-Scan 300DL, T-scan 300DL+

PHOTO SEQUENCE 20

Chapter 2

Biometrics

BOUNDARY SCAN

Chapter 2

USING SCAN TOOLS

JTAG Boundary Scan

Amazing 4D Baby scan in Berkshire call 08000075076

|OEM PARTNER | LINE SCAN CAMERA | DIGITAL LINE SCAN CAMERA |

2098 PIXEL RGB LINE SCAN CAMERA | KLI-2113 LINE SCAN CAMERAS

quickASSURE™ Scan

3D Baby Scan Only £69 (Save £20)