1 / 48

Jan-Jan Wu Chao- Jui Chang

Big Data Management System. Jan-Jan Wu Chao- Jui Chang. Outline. Choices for DBMS Better DBMS SQL vs. NoSQL Introduction to Lab Research SQLMR HSQL Experiment Conclusion. A Better DBMS. High Performance. Management Tools. Better DBMS System. Data Fault Tolerant .

lora
Télécharger la présentation

Jan-Jan Wu Chao- Jui Chang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data Management System Jan-Jan Wu Chao-Jui Chang

  2. Outline • Choices for DBMS • Better DBMS • SQL vs. NoSQL • Introduction to Lab Research • SQLMR • HSQL • Experiment • Conclusion

  3. A Better DBMS High Performance Management Tools Better DBMS System Data Fault Tolerant NoSQL Scalability SQL Compliant What is a better DBMS system?

  4. A Better DBMS High Performance Management Tools E-Learning Web Applications Better DBMS System Information Retrieval LargeData Analysis Data Fault Tolerant NoSQL Scalability Customer Relation Management Data Mining SQL Compliant Data Warehouse Trend Analysis Application for DBMS

  5. NoSQL In computing, NoSQL (sometimes expanded to "not only SQL") is a broad class of database management systems that differ from the classic model of the relational database management system (RDBMS) in some significant ways. These data stores may not require fixed table schemas, usually avoid join operations, and typically scale horizontally.

  6. SQL vs. NoSQL SQL is more User Friendly NoSQL is More Scalable

  7. Normalization? SQL - Normalization NoSQL – De-Normalization Join Hello world Reply 1: Hi Reply 2: Hello

  8. Query Language SQL: SELECT * FROMtable WHEREcolumn > 10 NoSQL: (Hbase) Filter.setColumn(column); Filter.GreaterThan(10); Query.setFilter(Filter); Query.SCAN(table); SQL: SELECT cust_id, SUM(li.qty) as qty FROM orders o, order_lineitem li WHEREli.order_id = o.id GROUP BY cust_id NoSQL: (MongoDB) db.orders.aggregate( [ { $unwind: "$items" }, { $group: { _id: "$cust_id", qty: { $sum: "$items.qty" } } } ] ) Different

  9. The Goal of our research SQL User Friendly NoSQL Easy Scalable + A highly scalable & user friendly DBMS • Combine the advantage of • SQL User Friendly Interface • NoSQL Scalability

  10. Use case Business Process Periodic Archive order products Multiple Customers Shop-website Business Strategy Information Business Data Warehouse Data miming, Decision making Database Servers Transaction Data Shopping Website

  11. Our Solution Online Transaction Processing (HSQL) For multiple users Example: Shop website, Mass-order website Online AnalysisProcessing (SQLMR) For data analysis Example: Data-warehouse, Decision Support System, CRM + HDFS • According to different use scenario • We provide a hybrid system

  12. Use case Require quickly response tomultiple customers HSQL Business Process Periodic Archive order products Require large scaledata analysis SQLMR Multiple Customers Shop-website Business Strategy Information Business Data Warehouse Data miming, Decision making Hadoop Servers Transaction Data Shopping Website

  13. SQLMR

  14. SQLMR SQLMR : Online Analysis Processing It is for Online-Analysis-Processing(OLAP). Support large scale analytical data processing. Providing a query interface from SQL to Map-Reduce. Index and partition optimization

  15. SQLMR Architecture SQLMR : Online Analysis Processing

  16. SQLMR Architecture SQLMR : Online Analysis Processing

  17. Data Partition and Indexing SQLMR : Online Analysis Processing

  18. Experiment SQLMR : Online Analysis Processing • TPC-H • Standard Database Benchmark • Data size: 128GB • Compare with Hive • Another SQL to MapReduce Framework • Experiment • Aggregation • Join • FloatPoint

  19. Experiment 1/3 – TPC-H Benchmark – Aggregation SQLMR : Online Analysis Processing 資料過濾計算:查詢 ”大量資料中一定時間內”產品內容資訊並做出分析出訊息 select sum(l_quantity) as sum_qty, l_linestatus, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice* (1 - l_discount) * (1 + l_tax)) as sum_disc_price, sum(l_extendedprice* (1 - l_discount) * (1 + l_tax)) as sum_charge, sum(l_discount),sum(l_tax), sum(l_quantity) as avg_qty, sum(l_extendedprice) as avg_price,sum(l_discount) as avg_disc From lineitem Where l_shipdate >= "1998-09-01" group by l_linestatus;

  20. Experiment 1/3 – TPC-H Benchmark – Aggregation  SQLMR is 6x faster then Hive SQLMR : Online Analysis Processing

  21. Experiment 2/3 – TPC-H Benchmark – Join SQLMR : Online Analysis Processing 高維度查詢:查詢”大量資料中一定時間內”所訂購的清單的詳細資料及優先順序 select o_orderpriority, count(*) from orders JOIN lineitem on ( orders.o_orderkey = lineitem.l_orderkey ) where o_orderdate >= '1993-07-01‘ and orders.o_orderdate < "1993-10-01" and lineitem.l_commitdate < lineitem.l_receiptdate group by o_orderpriority order by o_orderpriority

  22. Experiment 2/3 – TPC-H Benchmark – Join  SQLMR is 1.39x faster then Hive SQLMR : Online Analysis Processing

  23. Experiment 3/3 – TPC-H Benchmark – FloatPoint SQLMR : Online Analysis Processing 資料過濾計算:查詢”大量資料中一定時間內”產品內容資訊的折扣等小數運算做分析 select sum(l_extendedprice * l_discount) as revenue From lineitem Where l_shipdate >= date '1996-01-01' and l_shipdate < date '1996-01-01' + interval '1' year and l_discount between 0.05 - 0.01 and 0.05 + 0.01 and l_quantity < 25;

  24. Experiment 3/3 – TPC-H Benchmark – FloatPoint  SQLMR is 4.78x faster then Hive SQLMR : Online Analysis Processing

  25. HSQL

  26. HSQL HSQL : Online Transaction Processing It is for Online-Transaction-Processing (OLTP). Support massive multi-user concurrent queries. Combine the programming advantage of SQL and the scalability and fault tolerance of HBase. A distributed indexing scheme for HBase to offer more flexibility in indexing and improvement in performance and scalability.

  27. HBase Architecture HSQL : Online Transaction Processing

  28. HSQL Architecture HSQL : Online Transaction Processing

  29. API for HSQL HSQL : Online Transaction Processing • The user of HSQL can access the database by using familiar SQLlanguage. • User can create column index for arbitrary column by using HSQL B-tree API. • Use row key as primary key in Hbase • Multi-column index is allowable

  30. Supported Queries HSQL : Online Transaction Processing

  31. Optimization A distributed indexing scheme for HBase to offer more flexibility in indexing and improvement in performance and scalability.

  32. Row-Key Based Index in HBase Index by row key ! HSQL : Online Transaction Processing • Pros: • Simple and efficient • Cons: • Incur performance degradation for data queries on column fields. • Solution: • Secondary index

  33. Table-Based Secondary Index • Generate too many tables • Need to process two queries to locate the data • Client needs to take care of index updating HSQL : Online Transaction Processing

  34. Coprocessor-Based Secondary Index HSQL : Online Transaction Processing Support indexing for arbitrary, non-row-key columns. Support combined indexing for multiple columns. Perform data search in a parallel, distributed and scalable way.

  35. Parallel Query Processing with Coprocessors HSQL : Online Transaction Processing

  36. Coprocessor-Based Distributed B-tree Column Indexing HSQL : Online Transaction Processing Each coprocessor in a region server hosts a B-tree for its portion of the data table. The coprocessors search the B-trees concurrently.

  37. Experimental Result HSQL : Online Transaction Processing • Experiment setting • Use SysBench as benchmarks. • Table size: 5GB~40GB (12millions~100millions records) • Evaluate • Point query • Update • Aggregation • concurrent queries

  38. Response Time of Point Select 4.8 ms 0. 13 ms HSQL improves HBase (table-based index) by 37X. MySQL response time is very close to HSQL. HSQL : Online Transaction Processing SELECT c FROM sbtest WHERE id = N

  39. Response Time of Point Update HBaseresponse time > 2ms (table-based index) HSQL improves HBaseby 8X and is 1.15X faster than MySQL. HSQL : Online Transaction Processing UPDATE sbtest SET k=k+1 WHERE id = N

  40. Response Time of Aggregation Query SELECT SUM(a*(1-c)*d) FROM sbtest WHERE id BETWEEN m AND n HSQL improve Hbase by 12.2X with 1% of data HSQL : Online Transaction Processing

  41. Average Response Time of Concurrent Point Queries HSQL improves Hbase by 2X, and is 1.43X faster than MySQL. HSQL : Online Transaction Processing

  42. Conclusion • We provide a user friendly and high scalable database management system • SQLMR for large scale data analysis • HSQL for multiple user processing • The improvement ratios are expected to increase with data size.

  43. Conclusion HSQL : Online Transaction Processing Coprocessor-based distributed B-tree index can support fast parallel B-tree search for non-row-key columns. The experiments on a not-so-big database show an average of 6.2X speed up for range query and 12.2X speed up for aggregation query over HBase. The improvement ratios are expected to increase with data size.

  44. Compare with HadoopDB/MySQL Sysbench - SUM

  45. Parallel Query Processing with Coprocessors (1) HSQL : Online Transaction Processing HBase is a row-based, column-family distributed database. A data table is sorted by row key and split into tablets. The tablets are distributed over region servers. Region server can use Coprocessors to facilitate parallel, user-defined query processing.

  46. Query Translation An Example of SQL Query to HBase Operations HSQL : Online Transaction Processing

  47. Response Time of Range Query SELECT c FROM sbtest WHERE id BETWEEN m AND n HSQL improves HBase by 6.2X with 1% of data. HSQL : Online Transaction Processing

More Related