1 / 84

Database Operations on GPU

Database Operations on GPU. Changchang Wu 4/18/2007. Outline. Database Operations on GPU Point List Generation on GPU Nearest Neighbor Searching on GPU. Database Operations on GPU. Design Issues. Low bandwidth between GPU and CPU A void frame buffer readbacks No arbitrary writes

Albert_Lan
Télécharger la présentation

Database Operations on GPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database Operations on GPU Changchang Wu 4/18/2007

  2. Outline • Database Operations on GPU • Point List Generation on GPU • Nearest Neighbor Searching on GPU

  3. Database Operations on GPU

  4. Design Issues • Low bandwidth between GPU and CPU • Avoid frame buffer readbacks • No arbitrary writes • Avoid data rearrangements • Programmable pipeline has poor branching • Evaluate branches using fixed function tests

  5. Design Overview • Use depth test functionality of GPUs for performing comparisons • Implements all possible comparisons <, <=, >=, >, ==, !=, ALWAYS, NEVER • Use stencil test for data validation and storing results of comparison operations • Use occlusion query to count number of elements that satisfy some condition

  6. Basic Operations Basic SQL query Select A From T Where C A= attributes or aggregations (SUM, COUNT, MAX etc) T=relational table C= Boolean Combination of Predicates (using operators AND, OR, NOT)

  7. Basic Operations • Predicates – ai op constant or ai op aj • Op is one of <,>,<=,>=,!=, =, TRUE, FALSE • Boolean combinations – Conjunctive Normal Form (CNF) expression evaluation • Aggregations – COUNT, SUM, MAX, MEDIAN, AVG

  8. Predicate Evaluation • ai op constant (d) • Copy the attribute values ai into depth buffer • Define the comparison operation using depth test • Draw a screen filling quad at depth d glDepthFunc(…) glStencilOp(fail,zfail,zpass);

  9. Predicate Evaluation • Comparing two attributes: • ai op ajis treated as (ai – aj) op 0 • Semi-linear queries • Easy to compute with fragment shader

  10. Boolean Combinations • Expression provided as a CNF • CNF is of form (A1 AND A2 AND … AND Ak) where Ai = (Bi1 OR Bi2 OR … OR Bimi ) • CNF does not have NOT operator • If CNF has a NOT operator, invert comparison operation to eliminate NOT Eg. NOT (ai < d) => (ai >= d) • For example, compute ai within [low, high] • Evaluated as ( ai >= low ) AND ( ai <= high )

  11. Algorithm

  12. Range Query • Compute ai within [low, high] • Evaluated as ( ai >= low ) AND ( ai <= high )

  13. Aggregations • COUNT, MAX, MIN, SUM, AVG • No data rearrangements

  14. COUNT • Use occlusion queries to get pixel pass count • Syntax: • Begin occlusion query • Perform database operation • End occlusion query • Get count of number of attributes that passed database operation • Involves no additional overhead!

  15. MAX, MIN, MEDIAN • We compute Kth-largest number • Traditional algorithms require data rearrangements • We perform no data rearrangements, no frame buffer readbacks

  16. K-th Largest Number • By comparing and counting, determinate every bit in order of MSB to LSB

  17. Example: Parallel Max • S={10,24,37,99,192,200,200,232} • Step 1: Draw Quad at 128(10000000) • S = {10,24,37,99,192,200,200,232} • Step 2: Draw Quad at 192(11000000) • S = {10,24,37,192,200,200,232} • Step 3: Draw Quad at 224(11100000) • S = {10,24,37,192,200,200,232} • Step 4: Draw Quad at 240(11110000) • – No values pass • Step 5: Draw Quad at 232(11101000) • S = {10,24,37,192,200,200,232} • Step 6,7,8: Draw Quads at 236,234,233 – No values pass, Max is 232

  18. Accumulator, Mean • Accumulator - Use sorting algorithm and add all the values • Mean – Use accumulator and divide by n • Interval range arithmetic • Alternative algorithm • Use fragment programs – requires very few renderings • Use mipmaps [Harris et al. 02], fragment programs [Coombe et al. 03]

  19. Accumulator • Data representation is of form ak 2k + ak-1 2k-1 + … + a0 Sum = sum(ak) 2k+ sum(ak-1) 2k-1+…+sum(a0) Current GPUs support no bit-masking operations

  20. The Algorithm >=0.5 means i-th bit is 1

  21. Implementation • Algorithm • CPU – Intel compiler 7.1 with hyper-threading, multi-threading, SIMD optimizations • GPU – NVIDIA Cg Compiler • Hardware • Dell Precision Workstation with Dual 2.8GHz Xeon Processor • NVIDIA GeForce FX 5900 Ultra GPU • 2GB RAM

  22. Benchmarks • TCP/IP database with 1 million records and four attributes • Census database with 360K records

  23. Copy Time

  24. Predicate Evaluation

  25. Range Query

  26. Multi-Attribute Query

  27. Semi-linear Query

  28. Kth-Largest

  29. Kth-Largest

  30. Kth-Largest conditional

  31. Accumulator

  32. Analysis: Issues • Precision • Copy time • Integer arithmetic • Depth compare masking • Memory management • No Branching • No random writes

  33. Analysis: Performance • Relative Performance Gain • High Performance – Predicate evaluation, multi-attribute queries, semi-linear queries, count • Medium Performance – Kth-largest number • Low Performance - Accumulator

  34. High Performance • Parallel pixel processing engines • Pipelining • Early Z-cull • Eliminate branch mispredictions

  35. Medium Performance • Parallelism • FX 5900 has clock speed 450MHz, 8 pixel processing engines • Rendering single 1000x1000 quad takes 0.278ms • Rendering 19 such quads take 5.28ms. Observed time is 6.6ms • 80% efficiency in parallelism!!

  36. Low Performance • No gain over SIMD based CPU implementation • Two main reasons: • Lack of integer-arithmetic • Clock rate

  37. Advantages • Algorithms progress at GPU growth rate • Offload CPU work • Fast due to massive parallelism on GPUs • Algorithms could be generalized to any geometric shape • Eg. Max value within a triangular region • Commodity hardware!

  38. GPU Point List Generation • Data compaction

  39. Overall task

  40. 3D to 2D mapping

  41. Current Problem

  42. The solution

  43. Overview, Data Compaction

  44. Algorithm: Discriminator

  45. Algorithm: Histogram Builder

  46. Histogram Output

  47. Algorithm: PointList Builder

  48. PointList Output

  49. Timing Reduces a highly sparse matrix with N elements to a list of its M active entries in O(N) + M (log N) steps,

  50. Applications • Image Analysis • Feature Detection • Volume Analysis • Sparse Matrix Generation

More Related