1 / 54

Optimized Algorithms for Data Analysis in Parallel Database Systems

Optimized Algorithms for Data Analysis in Parallel Database Systems. Wellington M. Cabrera Advisor: Dr. Carlos Ordonez. Outline. Motivation Background Parallel DBMSs under shared-nothing architecture Data sets Review of work pre proposal Linear Models with Parallel Matrix Multiplication

traceyt
Télécharger la présentation

Optimized Algorithms for Data Analysis in Parallel Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimized Algorithms for Data Analysis in Parallel Database Systems Wellington M. Cabrera Advisor: Dr. Carlos Ordonez

  2. Outline • Motivation • Background • Parallel DBMSs under shared-nothing architecture • Data sets • Review of work pre proposal • Linear Models with Parallel Matrix Multiplication • Variable Selection, Linear Regression, PCA • Presentation of recent work • Graph Analytics with Parallel Matrix-Matrix Multiplication • Transitive closure, All Pairs Shortest Path, Triangle Counting • Graph Analytics with Parallel Matrix-Vector Multiplication • PageRank, Connected Components • Reachability, SSSP • Conclusions

  3. Motivation & Contributions

  4. Motivation • Large datasets found in any domain. • Continuous growing of data. • Number of records • Number of attributes/features • DBMS are systems with a lot of research behind. • Query optimizer • Optimized I/O • Parallelism • DBMS offer increased security, compared with ad-hoc file management.

  5. Issues • Most of the data analysis, model computation and graph analytics is done outside of the database, exporting CSV files. • It is difficult to express complex models and graph algorithms in a DBMS. • No matrix operations support • Queries may become hard to program • Algorithms programmed without a deep understanding of DBMS technology may run with a poor performance. • What’s wrong with exporting the data set to external systems? • Data Privacy threat • Waste of time • Analysis is delayed

  6. Contributions History/Timeline • First part of PhD • Linear Models with Parallel Matrix Multiplication 1, 2 • Variable Selection, Linear Regression, PCA • Second part of PhD • Graph Analytics with Parallel Matrix-Matrix Multiplication3 • Transitive closure, All Pairs Shortest Path, Triangle Counting • Graph Analytics with Parallel Matrix-Vector Multiplication4 • PageRank, Connected Components • Reachability, SSSP 1. The Gamma Matrix to Summarize Dense and Sparse Data Sets for Big Data Analytics. IEEE TKDE 28(7): 1905-1918 (2016) 2. Accelerating a Gibbs sampler for variable selection on genomics data with summarization and variable pre-selection combining an array DBMS and R. Machine Learning 102(3): 483-504 (2016) 3. Comparing columnar, row and array DBMSs to process recursive queries on graphs. Inf. Syst. 63: 66-79 (2017) 4. Unified Algorithm to Solve Several Graph Problems with Relational Queries. Alberto Mendelzon International Workshop on Foundations of Data Management (2016)

  7. BACKGROUND

  8. Definitions • Data set for Linear Models • Let X = {x1, ..., xn}be the input data set with n data points, where each point has d dimensions. • X is a d × n matrix, where the data point xi is represented by a column vector (thus, equivalent to a d× 1 matrix). • Y is a 1 x n vector , representing the dependent variable . • Generally n>d; therefor X is a rectangular matrix • Big data: n>>d

  9. Definitions

  10. Definition: Graph data set • LetG=(V,E) m=|E| ; n =|V| We denominate the adjacency matrix of G as E . E is a n x n matrix, generally sparse • S: a vector of vertices used in graph computations |S|=|V|=n • Each entry Si represents a vertex attribute. • distance from an specific source, membership, probability • We omit values in S with no information (like ∞ for distances, 0 for probabilities) • Notice Eisn x n , but Xisd x n

  11. DBMS Storage classes • Row Store: Legacy, transactions • Column Store: Modern, analytics • Array Store: Emerging, Scientific

  12. Linear Models: data set storage in columnar/row DBMS • Case n>>d • Low and high dimensional datasets • n in millions/billions; d up to few hundreds • Most data sets: marketing, public health, sensor networks. • data point xi stored as a row, with d columns • extra column to store outcome Y • Thus, data set is stored as a table T, where T has n rows and d+1 columns. • Parallel databases may partition T either by hash function or mod function.

  13. Linear Models: data set storage in columnar/row DBMS • Case d>n • Very High d, low n. • d in thousands. Examples: • Gene expression (microarray) data. • Word frequency in documents • Cannot keep n>d layout. Number of columns beyond most Row DBMS limits. • Data point xi stored as a column • Extra row to store the outcome Y • Thus, data set is stored in a table T, which has n columns and d+1 rows

  14. Linear Models: data set representation in an array DBMS • Array databases store data as multidimensional arrays, instead of relational tables. • Arrays are partitioned by chunks (bi-dimensional data blocks) • All chunks in a specific array have the same shape and size. • Data points xi stored as rows, with an extra column for the outcome yi • Thus, the dataset is represented as a bi-dimensional array, with n rows and d+1 columns.

  15. Graph data set • Row and columnar DBMS: E(i,j,v) • Array DBMS: E as a n x n sparse array

  16. Linear Models COMPUTATION with MATRIX MULTIPLICATION

  17. Gamma Matrix Γ= Z ZT

  18. Models Computation • 2-Step algorithm for PCA, LR, VS…. One pass to the dataset. • Compute summarization matrix (Gamma) in one pass. • Compute models (PCA, LR, VS) using Gamma. • Preselection & 2-Step algorithm for very high-dimensional VS ( Two passes). A preprocess step is incorporated • Compute partial Gamma and perform preselection. • Compute summarization matrix (Gamma) in one pass. • Compute VS using Gamma.

  19. Models Computation • 2-step algorithm • Compute the summarization matrix Gamma in the DBMS ( cluster, multiple nodes/cores) • Compute the model locally exploiting Gamma and parallel matrix operations (LAPACK) , using any programming language (i.e. R, C++, C#). • This approach was published in our work [1]

  20. First step: One pass data set summarization • We introduced the Gamma Matrix in [1]. • The Gamma Matrix ( or Г) is a square matrix with d+2 rows and columns that contains a set of sufficient statistics, useful to compute several statistical indicators and models. • PCA, VS, LR, covariance/correlation matrices. • Computed in parallel with multiple cores or multiple nodes.

  21. Matrix Multiplication Z∙ZT • Parallel Computation with Multicore CPU (single node) in one pass. • AGGUDF are processed in parallel, in four phases (initialize, accumulate, merge, terminate) and enable multicore processing. • Initialize: Variables set up • Accumulate: partial Gammas are calculated via vector products. • Merge: Final Gamma is computed by adding partial Gammas. • Terminate: Control returns to main processing • Computation with LAPACK ( main memory) • Computation with OpenMPI

  22. Matrix Multiplication Z∙ZT • Parallel Computation with multiple nodes • Computation in Parallel Array Database. • Each worker can process with one or multiple cores. • Each core computes its own partial Gamma, using its own local data. • Master node receives partial Gammas from workers • Master node computes final Gamma with matrix addition.

  23. Gamma: Z∙ZT

  24. Models Computation • Contribution summary; • Enables the analysis of very high dimensional data sets in the DBMS. • Overcomes the problem of data sets larger than RAM (d< n) • 10s to 100s times faster than standard approach

  25. PCA • Compute Г, which contains n, L and Q • Compute ρ, solve SVD of ρ, and select the k principal components

  26. LR Computation

  27. Variable Selection1 + 2 Step Algorithm • Pre-selection • Based on marginal correlation ranking • Calculate correlation between each variable and the outcome • Sort in descending order • Take the best d variables • Top d variables are considered for further analysis • Compute Г, which contains Qγ and XγYT • Iterate the Gibbs sampler a sufficiently large number of iterations to explore

  28. Optimizing Gibbs Sampler • Non-conjugate Gaussian priors require the full Markov Chain. • Conjugate priors simplify the computation. • β,σintegrated out. • Marin-Roberts formulation • Zellner-g prior for βand Jeffrey’s prior for σ

  29. PCA DBMS : SciDB System : Local 1 node, 2 instances Dataset: KDDnet

  30. LR DBMS : SciDB System : Local 1 node, 2 instances Dataset: KDDnet

  31. VS DBMS : SciDB System : Local 1 node, 2 instances Dataset: Brain Cancer - miRNA DBMS : SciDB System : Local 1 node, 2 instances Dataset: Brain Cancer - miRNA

  32. MATRIX-VECTOR Computation IN PARALLEL DBMS

  33. Algorithms

  34. Matrix –vector multiplication with relational queries

  35. Optimizing Parallel JoinData Partitioning • Join locality: E, S partitioned by hashing in the joining columns. • Sorted tables: a merge join is possible, complexity O(n)

  36. Optimizing Parallel JoinData Partitioning • S split in N chunks • E split in N x N squared chunks

  37. Handling Skewed data • Chunk density for a social network data set in a 8 instances cluster. • Skewness results on uneven data distribution (right) • Chunk density after repartition (left) • Edges per worker, before (right) and after (left) repartitioning

  38. Unified Algorithm • Unified Algorithm solves: • Reachability from a source vertex, SSSP • WCC, Page Rank

  39. Data Partitioning in Array DBMS • Data is partitioned by chunks: ranges • Vector S is evenly partitioned through the cluster. • Sensible to skewness • Redistribution using mod function

  40. Experimental Validation • Time complexity close to linear • Comparing with a classical optimization: • Replication of the smallest table

  41. Experimental Validation • Optimized Queries in array DBMS vs ScaLAPACK

  42. Comparing columnar vs array vs Spark

  43. Experimental Validation • Speed up with real data sets

  44. Matrix Powers

  45. Matrix Powers with recursive queries

  46. Recursive Queries • *

  47. Recursive Queries: Powers of a Matrix • *

  48. Matrix Multiplication with SQL Queries • Matrix-Matrix Multiplication (+ , x ) semiring SELECT R.i, E.j, sum(R.v * E.v) FROM R join E on R.j=E.i GROUP BY i, j • Matrix-Matrix Multiplication (min , - ) semiring SELECT R.i, E.j, min(R.v + E.v) FROM R join E on R.j=E.i GROUP BY i, j

  49. Data partitioning for parallel computation in Columnar DBMS

  50. Data partitioning for parallel computation in Array DBMS • Distributed storage of R, E in array DBMS

More Related