1 / 23

ES 2 : A Cloud Storage for Supporting both OLTP and OLAP

ES 2 : A Cloud Storage for Supporting both OLTP and OLAP. Yu Cao, Chun Chen, Fei Guo, Dawei Jiang, Yuting Lin, Beng Chin Ooi, Hoang Tam Vo , Sai Wu, Quanqing Xu. “NoSQL” vs. “CloudDB”. Mix ad-hoc requirements with trying to find/develop open sources. Amazon SimpleDB. BigTable.

rian
Télécharger la présentation

ES 2 : A Cloud Storage for Supporting both OLTP and OLAP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ES2: A Cloud Storage for Supporting both OLTP and OLAP Yu Cao, Chun Chen, Fei Guo, Dawei Jiang, Yuting Lin, Beng Chin Ooi, Hoang Tam Vo, Sai Wu, Quanqing Xu

  2. “NoSQL” vs. “CloudDB” Mix ad-hoc requirements with trying to find/develop open sources Amazon SimpleDB BigTable Azure Table Storage HBase Cassandra PNUTS Scalaris Hypertable InfoGrid SciDB HypergraphDB MemcacheDB LightCloud Objectivity Perst GenieDB CouchDB KAI Know targets/build proprietary systems MongoDB Dynamo Voldemort Dynomite and variants...(http://nosql-database.org)

  3. System Design by Choice/Need Web 2.0 Analytics DBMS OLTP & OLAP OLAP queries OLTP & Updates MRDryadHive Pig… Processing Engine QP engine QP engine ETL Storage engine Storage engine Reader DBMS Data Warehouse DFS databases HBaseCassandraBigtable… Problems: 1. Data Freshness/Real-time Search 2. Storage/Investment Cost files streams/query results

  4. Challenges from applications • Real-time analysis • Updatable warehouse • Database as a Service (DaaS) yes new order available to promise? request supplier place order aggregating stock level no Challenge 1:How to support both OLTP and OLAP within the same storage and processing engine Challenge 2:Similar functions as centralized DBMSes such as indexes, but in a scalable/elastic cloud environment

  5. Design of Cloud Data Management Systems • Goals • Scalable DaaS • Integrate OLTP and OLAP into one data storage and processing system without loss of performance • Scalable OLTP • Redesign OLTP module to tackle huge volumes of insertions/deletions • Low latency is important • Scalable OLAP • Redesign OLAP module to perform cloud-scale data analysis • Fault tolerance support in query processing

  6. epiC: elastic power-aware data intensive Cloud Users/Business • One data management system instance • Shared-nothing structure • Run on all nodes • Integrate OLTP and OLAP • OLTP and OLAP are separate modules (not separate systems) • Share the same storage layer • Dispatch workload based on query type Query Dispatcher Data Management System OLTP OLAP Storage Layer (ES2) Virtual Machines Cloud Environment More info at epiCproject http://www.comp.nus.edu.sg/~epiC

  7. ES2 – an elastic cloud storage system • Key features • Elastic scaling • Hybrid storage - supporting both OLTP and OLAP • Flexible data partitioning based on the database workload • Load-adaptive replication • Transactional semantics for bundled updates • DBMS-like index functionality • Multiple indexes of different types • e.g. hash, range, multi-dimensional, bitmap indexes • Comparisons to other systems • Cassandra, Pnuts and MegaStore Towards Elastic Transactional Cloud Storage with Range Query Support H.T. Vo, B.C. Ooi, C. Chen. PVLDB 3(1): 506-517 (2010)

  8. Architecture of ES2 Data Requests Data Access Control Data Access Interface Data Manipulator Storage Components Plain File, Database, Application Results Import Manager Meta-data Catalog Distributed Indexing Write Cache Distributed File System Data Import Control

  9. 1,2Alice,Fred32,49 1,22.5,3HR,FI Hybrid Data Partitioning Horizontal partitions Vertical partitions Logical table 3,4,5Malice,Fred,Smith37,24,30 Workload trace selectname fromEmp whereage > 35 selectavg(salary) fromEmp group by deptupdate Emp set salary=4K where id=4… 3,4,54,3.5,6MA,FI,HR PAX Storage Layout [PAX] Weaving Relations for Cache Performance. A. Ailamaki, D. J. DeWitt, M. D. Hill, M. Skounakis. VLDB 2001: 169-180

  10. Hybrid Data Partitioning • The rationales • Vertical partitioning • Group frequently accessed columns together • Minimize disk I/Os and improve query performance • Horizontal partitioning • Facilitate parallelism • Minimize the number of distributed transactions • PAX storage layout • Cache-conscious storage layout • Improve CPU cache hits and OLAP performance

  11. Distributed Indexing • Why Index? • OLTP queries: high selectivity, low latency expectation • Cloud storage system: huge volume of data • Parallel scan: scan 1T data to get 10 tuples? • Why Distributed? • Central server may become bottleneck • Facilitate parallelism and load balance • Objectives • Provide DBMS-like indexes • Multiple indexes of different types (hash, range, multi-dimensional, bitmap) • Extensibility of indexes a separate research issue

  12. Idea from P2P Network • Each cluster node • acts as a peer in the P2P overlay • maintains “local” indexes such as B+-trees and R-trees • Index building • when data are imported • publish the index entries to different indexes based on P2P routing protocols

  13. Q(x,y) key h(key1) Challenges of Distributed Indexes E y • Different overlays are required to support different types of indexes • BATON for B-tree [1] • CAN for R-tree [2] • Chord for Hash • Overlay routing and maintenance cost is too high • Load balancing issue A (x,y) Peer C h(key2) F B D 1.Efficient B+-tree Based Indexing for Cloud Data ProcessingS. Wu, D. Jiang, B. C. Ooi, K. L. Wu. VLDB 2010 2. Indexing Multi-dimensional Data in a Cloud SystemJ. Wang, S. Wu, H. Gao, J. Li, B. C. Ooi. ACM SIGMOD , 2010.

  14. Distributed Index Architecture • Optimizations: • Index + base table vs. Index-only plan • materialize portion of data record • Adaptive network connections

  15. Data Access Optimizer Parallel scan Index scan Estimate data access cost Query Q cpscanvs. ciscan Data access plan

  16. BIDS: Bitmap Index for Database Service in the Cloud • Challenges of supporting a large number of indexes • Large index data • Compact size of BIDS • BIDS supports a wider range of queries • If the queries only involve indexed attributes, we can completely answer queries via the indexes

  17. Query Processing with BIDS • For TPC-H Q6: • Retrieve the following bitmap indexes: • B1 : x<=shipdate<x+1 • B2 : y<=discount<y+2 • B3 : quantity<z • B4 : bitmap index for extendedprice • B5 : bitmap index for discount • Filter=B1 And B2 And B3 • For tuples pass the Filter, compute • extendedprice*discount via B4 and B5 SELECT sum(extendedprice*discount) as revenueFROM LineitemWHERE shipdate≥ x and shipdate < x + 1 year and discount ≥ y and discount < y + 0.02 and quantity < z

  18. BIDS: Bitmap Index for Database Service in the Cloud • A column has too many unique values? • The size of bitmap index may be larger than the original dataset • Compression solutions • WAH encoding [1] • Bit-sliced encoding [2] • Partial index • All indexes are buffered in the distributed memory • Index update? • Kesheng Wu, Ekow J. Otoo, and ArieShoshani. Compressing Bitmap Indexes for Faster Search Operations. In SSDBM’02. • 2. Denis Rinfret, Patrick O'Neil, and Elizabeth O'Neil. Bit-sliced index arithmetic. In SIGMOD’01.

  19. Evaluation • TPC-H dataset scale 30GB • System size: 5 – 35 nodes • Multi-dimensional query • Distributed multi-dimensional index on (totalprice, orderdate) attribute of Orders table • Base approach? SELECT custkey, count(orderkey), sum(totalprice) FROM Orders WHERE totalprice≥ y and totalprice ≤ y + 100 and orderdate≥ z and orderdate ≤ z + 1 month GROUP BY (custkey)

  20. Evaluation • Data freshness observed by OLAP scans in concurrence with OLTP updates • System size: 64 nodes • Data size: 32GB to 512GB • Update rate: 5 nodes, each submits 100ops/sec, follows uniform and normal distribution • Metrics: maximal version difference • Comparisons: es2 vs. recent

  21. Other results • Data import performance • Metadata catalog maintenance • Additional index performance • OLAP query performance epiC system More info on epiCproject -- http://www.comp.nus.edu.sg/~epiC

  22. Conclusions

  23. Thank you! Questions & Answers

More Related