1 / 24

A Perfect Hybrid

A Perfect Hybrid. Split query processing in Polybase. biaobiaoqi shenyapeng@gmail.com 2013/4/25. Outline. Background Related Work Overview of Polybase Architecture of PDW Polybase Implementation Performance Evaluation. Background. Structured data & unstructured data

erica
Télécharger la présentation

A Perfect Hybrid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Perfect Hybrid Split query processing in Polybase biaobiaoqi shenyapeng@gmail.com 2013/4/25

  2. Outline • Background • Related Work • Overview of Polybase • Architecture of PDW • Polybase Implementation • Performance Evaluation

  3. Background • Structured data & unstructured data • RDBMS & Big Data RDBMS Combine Insight Hadoop

  4. Related Work • Sqoop: Transferring bulk data between Hadoop and structured data stores such as relational database • Teradata&Asterdata • Greenplum & Vertica: external table • Oracle: external table and OLH(Oracle loader for Hadoop) • IBM: split mechanism to use mapreduce to access appliance • Hadapt(HadoopDB): outset to support the execution of SQL-like queries across both unstructured and structured data sets.

  5. Overview of Polybase • A new feature in PDW V2 • Using SQL standard language • Dealing with both structured and unstructured data(in SQL Server and Hadoop) • leveragesthe capabilities of SQL Server PDW, especially it's cost-based parallel query optimizer and execution engine.

  6. PDW Architecture • Parallel Data Warehouse • Shared-nothing system

  7. Components in PDW • Node • SQL server instance on node • Data are hash-partitioned through compute node • Control node: [PWD Engine in it] • query parsing • Optimization • creating distributed execution plan to compute nodes(DSQL) • tracking execution steps of plan to compute nodes • Compute node: • Storage • Query processing • DMS: Data Movement Service • (1)repartitioning rows of a table among the SQL Server instances on PDW compute nodes. • (2)converting fields of rows being loaded into appliance into the appropriate ODBC types.

  8. Components in Polybase • External Table • Make data in Hadoop visible to SQL Server • Content: namenode, namenode port, jobtracker, jobtracker port, input format class, types in table. • HDFS Bridge in DMS • Transferring data in parallel between the nodes of Hadoop and PDW clusters. • Cost-based query optimizer • wrapping the one in PDW Version1

  9. HDFS Bridge

  10. HDFS Bridge • HDFS shuffle phase: (read data from hadoop) • 1. Communicate with namenode, get info of file • 2. Balancenumber of bytes read by each DMS instance(based on HDFS info and DMSinstancescount) • 3. Invoke openRecordReader()  RecordReader instance: directly communicate with datanode • 4. Get data and transfer into ODBC types.(may done in MapReducejob) • 5. Hash function to determine target node for each record • Write to Hadoopis almost the same

  11. Query Optimizer • Parsing • A Memo data structure of alternative serial plans • Parallel optimization[in PDW V1] • Bottom-up optimizer to insert data movement operators in the serial plans • Cost-based query optimizer:[whether pushing to Hadoop] • Based on • statistics • relative size of two clusters and other factors

  12. Statistics • Define statistics table for external table • CREATE STATISTICS hdfsCustomerStats ON • hdfsCustomer (c_custkey); • Steps to obtain statistics in HDFS • 1. Read block level sample data from DMS or map jobs • 2. Partitioned samples across compute nodes. • 3. Each node calculates a histogram on its portion • 4. Merge all histograms stored in catalog for database. • An alternative implementation • In HadoopV2, let Hadoop cluster calculate the histograms. (cost a lot) • Make the best use of computational resource of Hadoop cluster

  13. Else • Semantic Compatibility • Data types • Java primitive types • Non-primitive types • Third-party types that can be implemented • Marked those can not be implemented in Java[only can be processed in PDW] • SQL semantics • Return of Expressions: implemented in Java • Returning null: eg. A+B  (A==null || B==null)?null: (A+B) • Marked those can not be implemented in Java[only can be processed in PDW] • Error handling • Exceptions will come out in SQL should also be throwed in Java

  14. Example • SELECT count (*) from Customer WHERE acctbal < 0GROUP BY nationkey

  15. Optimized Query Plan #1

  16. Optimized Query Plan #2

  17. MapReduce Join • Distributed Hash Join • Support for equi-join • Implementation: • Build side: the side with smaller size of data. They are materialized in HDFS. • Probe side: the other side of data. • Partition build side, making build side in-memory to speed up. • Build side may also be replicated.

  18. Performance Evaluation • Test configuration: • C-16/48 16 node PDW cluster, 48 node Hadoop cluster • C-30/30 30 node PDW cluster, 30 node Hadoopcluster • C-60 60 node PDW cluster and 60 node Hadoop cluster • Test database: • Two identical tables T1 and T2 • 10 billion rows • 13 integer attributes and 3 string attributes (~200 bytes/row) • About 2TB uncompressed • One copy of each table in HDFS • HDFS block size of 256 MB • Stored as a compressed RCFile • RCFiles store rows “column wise” inside a block • One copy of each table in PDW • Block-wise compression enabled

  19. Selection on HDFS table Crossover Point: Above a selectivity factor of ~80%, PB Phase 2 is slower SELECT u1, u2, u3, str1, str2, str4 from T1 (in HDFS) where (u1 % 100) < sf Polybase Phase 2 Polybase Phase 1 SP PB.1 PB.1 PB.1 PB.1 PB.1 PB.1 SP SP SP SP SP 19

  20. Join HDFS Table with PDW Table SELECT * from T1 (HDFS), T2 (PDW) where T1.u1 = T2.u2 and (T1.u2 % 100) < sf and (T2.u2 % 100) < 50 Polybase Phase 2 SP PB.1 PB.1 PB.1 PB.1 Polybase Phase 1 SP SP SP 20

  21. Join Two HDFS Tables PB.2P – Selections on T1 and T2 pushed to Hadoop. Join performed on PDW PB.1 – All operators on PDW PB.2H – Selections & Join on Hadoop SELECT * from T1 (HDFS),T2 (HDFS) where T1.u1 = T2.u2 and (T1.u2 % 100) < SF and (T2.u2 % 100) < 10 PB.1 PB.1 PB.1 PB.1 PB.2P PB.2H PB.2P PB.2H PB.2P PB.2H PB.2H PB.2P 21

  22. Performance Wrap-up • Split query processing really works! • Up to 10X performance improvement! • A cost-based optimizer is clearly required to decide when an operator should be pushed • Optimizer must also incorporate relative cluster sizes in its decisions

  23. Reference • Split Query Processing in Polybase(SIGMOD’13, June 22-27,2013,New York,USA.) • Microsoft Corporation • Polybase: What, Why, How(ppt) • Microsoft Corporation • Query Optimization in Microsoft SQL Server PDW (SIGMOD'12, May 20-24,2012,Scottsdale,Arizona,USA) • Microsoft Corporation

  24. THANKS

More Related