Jan-Jan Wu Chao- Jui Chang

Big Data Management System Jan-Jan Wu Chao-Jui Chang

Outline • Choices for DBMS • Better DBMS • SQL vs. NoSQL • Introduction to Lab Research • SQLMR • HSQL • Experiment • Conclusion

A Better DBMS High Performance Management Tools Better DBMS System Data Fault Tolerant NoSQL Scalability SQL Compliant What is a better DBMS system?

A Better DBMS High Performance Management Tools E-Learning Web Applications Better DBMS System Information Retrieval LargeData Analysis Data Fault Tolerant NoSQL Scalability Customer Relation Management Data Mining SQL Compliant Data Warehouse Trend Analysis Application for DBMS

NoSQL In computing, NoSQL (sometimes expanded to "not only SQL") is a broad class of database management systems that differ from the classic model of the relational database management system (RDBMS) in some significant ways. These data stores may not require fixed table schemas, usually avoid join operations, and typically scale horizontally.

SQL vs. NoSQL SQL is more User Friendly NoSQL is More Scalable

Normalization? SQL - Normalization NoSQL – De-Normalization Join Hello world Reply 1: Hi Reply 2: Hello

Query Language SQL: SELECT * FROMtable WHEREcolumn > 10 NoSQL: (Hbase) Filter.setColumn(column); Filter.GreaterThan(10); Query.setFilter(Filter); Query.SCAN(table); SQL: SELECT cust_id, SUM(li.qty) as qty FROM orders o, order_lineitem li WHEREli.order_id = o.id GROUP BY cust_id NoSQL: (MongoDB) db.orders.aggregate( [ { $unwind: "$items" }, { $group: { _id: "$cust_id", qty: { $sum: "$items.qty" } } } ] ) Different

The Goal of our research SQL User Friendly NoSQL Easy Scalable + A highly scalable & user friendly DBMS • Combine the advantage of • SQL User Friendly Interface • NoSQL Scalability

Use case Business Process Periodic Archive order products Multiple Customers Shop-website Business Strategy Information Business Data Warehouse Data miming, Decision making Database Servers Transaction Data Shopping Website

Our Solution Online Transaction Processing (HSQL) For multiple users Example: Shop website, Mass-order website Online AnalysisProcessing (SQLMR) For data analysis Example: Data-warehouse, Decision Support System, CRM + HDFS • According to different use scenario • We provide a hybrid system

Use case Require quickly response tomultiple customers HSQL Business Process Periodic Archive order products Require large scaledata analysis SQLMR Multiple Customers Shop-website Business Strategy Information Business Data Warehouse Data miming, Decision making Hadoop Servers Transaction Data Shopping Website

SQLMR

SQLMR SQLMR : Online Analysis Processing It is for Online-Analysis-Processing(OLAP). Support large scale analytical data processing. Providing a query interface from SQL to Map-Reduce. Index and partition optimization

SQLMR Architecture SQLMR : Online Analysis Processing

Data Partition and Indexing SQLMR : Online Analysis Processing

Experiment SQLMR : Online Analysis Processing • TPC-H • Standard Database Benchmark • Data size: 128GB • Compare with Hive • Another SQL to MapReduce Framework • Experiment • Aggregation • Join • FloatPoint

Experiment 1/3 – TPC-H Benchmark – Aggregation SQLMR : Online Analysis Processing 資料過濾計算：查詢 ”大量資料中一定時間內”產品內容資訊並做出分析出訊息 select sum(l_quantity) as sum_qty, l_linestatus, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice* (1 - l_discount) * (1 + l_tax)) as sum_disc_price, sum(l_extendedprice* (1 - l_discount) * (1 + l_tax)) as sum_charge, sum(l_discount),sum(l_tax), sum(l_quantity) as avg_qty, sum(l_extendedprice) as avg_price,sum(l_discount) as avg_disc From lineitem Where l_shipdate >= "1998-09-01" group by l_linestatus;

Experiment 1/3 – TPC-H Benchmark – Aggregation  SQLMR is 6x faster then Hive SQLMR : Online Analysis Processing

Experiment 2/3 – TPC-H Benchmark – Join SQLMR : Online Analysis Processing 高維度查詢：查詢”大量資料中一定時間內”所訂購的清單的詳細資料及優先順序 select o_orderpriority, count(*) from orders JOIN lineitem on ( orders.o_orderkey = lineitem.l_orderkey ) where o_orderdate >= '1993-07-01‘ and orders.o_orderdate < "1993-10-01" and lineitem.l_commitdate < lineitem.l_receiptdate group by o_orderpriority order by o_orderpriority

Experiment 2/3 – TPC-H Benchmark – Join  SQLMR is 1.39x faster then Hive SQLMR : Online Analysis Processing

Experiment 3/3 – TPC-H Benchmark – FloatPoint SQLMR : Online Analysis Processing 資料過濾計算：查詢”大量資料中一定時間內”產品內容資訊的折扣等小數運算做分析 select sum(l_extendedprice * l_discount) as revenue From lineitem Where l_shipdate >= date '1996-01-01' and l_shipdate < date '1996-01-01' + interval '1' year and l_discount between 0.05 - 0.01 and 0.05 + 0.01 and l_quantity < 25;

Experiment 3/3 – TPC-H Benchmark – FloatPoint  SQLMR is 4.78x faster then Hive SQLMR : Online Analysis Processing

HSQL

HSQL HSQL : Online Transaction Processing It is for Online-Transaction-Processing (OLTP). Support massive multi-user concurrent queries. Combine the programming advantage of SQL and the scalability and fault tolerance of HBase. A distributed indexing scheme for HBase to offer more flexibility in indexing and improvement in performance and scalability.

HBase Architecture HSQL : Online Transaction Processing

HSQL Architecture HSQL : Online Transaction Processing

API for HSQL HSQL : Online Transaction Processing • The user of HSQL can access the database by using familiar SQLlanguage. • User can create column index for arbitrary column by using HSQL B-tree API. • Use row key as primary key in Hbase • Multi-column index is allowable

Supported Queries HSQL : Online Transaction Processing

Optimization A distributed indexing scheme for HBase to offer more flexibility in indexing and improvement in performance and scalability.

Row-Key Based Index in HBase Index by row key ! HSQL : Online Transaction Processing • Pros: • Simple and efficient • Cons: • Incur performance degradation for data queries on column fields. • Solution: • Secondary index

Table-Based Secondary Index • Generate too many tables • Need to process two queries to locate the data • Client needs to take care of index updating HSQL : Online Transaction Processing

Coprocessor-Based Secondary Index HSQL : Online Transaction Processing Support indexing for arbitrary, non-row-key columns. Support combined indexing for multiple columns. Perform data search in a parallel, distributed and scalable way.

Parallel Query Processing with Coprocessors HSQL : Online Transaction Processing

Coprocessor-Based Distributed B-tree Column Indexing HSQL : Online Transaction Processing Each coprocessor in a region server hosts a B-tree for its portion of the data table. The coprocessors search the B-trees concurrently.

Experimental Result HSQL : Online Transaction Processing • Experiment setting • Use SysBench as benchmarks. • Table size: 5GB~40GB (12millions~100millions records) • Evaluate • Point query • Update • Aggregation • concurrent queries

Response Time of Point Select 4.8 ms 0. 13 ms HSQL improves HBase (table-based index) by 37X. MySQL response time is very close to HSQL. HSQL : Online Transaction Processing SELECT c FROM sbtest WHERE id = N

Response Time of Point Update HBaseresponse time > 2ms (table-based index) HSQL improves HBaseby 8X and is 1.15X faster than MySQL. HSQL : Online Transaction Processing UPDATE sbtest SET k=k+1 WHERE id = N

Response Time of Aggregation Query SELECT SUM(a*(1-c)*d) FROM sbtest WHERE id BETWEEN m AND n HSQL improve Hbase by 12.2X with 1% of data HSQL : Online Transaction Processing

Average Response Time of Concurrent Point Queries HSQL improves Hbase by 2X, and is 1.43X faster than MySQL. HSQL : Online Transaction Processing

Conclusion • We provide a user friendly and high scalable database management system • SQLMR for large scale data analysis • HSQL for multiple user processing • The improvement ratios are expected to increase with data size.

Conclusion HSQL : Online Transaction Processing Coprocessor-based distributed B-tree index can support fast parallel B-tree search for non-row-key columns. The experiments on a not-so-big database show an average of 6.2X speed up for range query and 12.2X speed up for aggregation query over HBase. The improvement ratios are expected to increase with data size.

Compare with HadoopDB/MySQL Sysbench - SUM

Parallel Query Processing with Coprocessors (1) HSQL : Online Transaction Processing HBase is a row-based, column-family distributed database. A data table is sorted by row key and split into tablets. The tablets are distributed over region servers. Region server can use Coprocessors to facilitate parallel, user-defined query processing.

Query Translation An Example of SQL Query to HBase Operations HSQL : Online Transaction Processing

Response Time of Range Query SELECT c FROM sbtest WHERE id BETWEEN m AND n HSQL improves HBase by 6.2X with 1% of data. HSQL : Online Transaction Processing

Jan-Jan Wu Chao- Jui Chang

Jan-Jan Wu Chao- Jui Chang

Presentation Transcript

Sales Guide V1.00 Andy Wu Jan, 2012

Jan. 26

Jan 29

Jan

Jan sê : “ (Jan)

JAN

Jan

Jan

Jan. 2013