1 / 34

C-Store: A Column-oriented DBMS

C-Store: A Column-oriented DBMS. New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston). Extended for Big Data Reading Group Presentation by Shimin Chen . Relational Database. Attribute1. Attribute2. Attribute3. Record 1. Record 2. Record 3.

fox
Télécharger la présentation

C-Store: A Column-oriented DBMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group Presentation by Shimin Chen

  2. Relational Database Attribute1 Attribute2 Attribute3 Record 1 Record 2 Record 3 e.g. Customer(cid, name, address, discount) Product(pid, name, manufacturer, price, quantity) Order(oid, cid, pid, quantity)

  3. Current DBMS -- “Row Store” Record 1 Record 2 Record 3 Record 4 E.g. DB2, Oracle, Sybase, SQLServer, …

  4. Row Stores are Write Optimized(use white board) • Store fields in one record contiguously on disk • Use small (e.g. 4K) disk blocks • Use B-tree indexing • Align fields on byte or word boundaries • Assume shifting data values is costly • Transactions: write-ahead logging

  5. Row Stores are Write Optimized • Can insert and delete a record in one physical write • Good for on-line transaction processing (OLTP) • But not for read mostly applications • Data warehouses • Customer Relationship Management (CRM) • Electronic library card catalogs • …

  6. Column Stores

  7. At 100K Feet…. • Read-optimized: • Periodically a bulk load of new data • Long period of ad-hoc queries • Benefit: • Ad-hoc queries read 2 columns out of 20 • Column store reads 10% of what a row store reads • Previous pioneering work: • Sybase IQ (early ’90s) • Monet (see CIDR ’05 for the most recent description)

  8. C-Store Technical Ideas • Data storage: • Only materialized views (perhaps many) • Compress the columns to save space • No alignment • Big disk blocks • Innovative redundancy • Optimize for grid (cluster) computing • Focus on Sorting not indexing • Automatic physical DBMS design • Column optimizer and executor

  9. How to Evaluate This Paper…. • None of the ideas in isolation merit publication • Judge the complete system by its (hopefully intelligent) choice of • Small collection of inter-related powerful ideas • That together put performance in a new sandbox

  10. Outline • Overview • Read-optimized column store • Query execution and optimization • Handling transactional updates • Performance • Summary

  11. Data Model • Projection (materializedview): • some number of columns from a fact table • plus columns in a dimension table – with a 1-n join between Fact and Dimension table • (conceptually) no duplicate elimination • Stored in order of a storage key(s) • Note: base table is not stored anywhere

  12. Example • Logical base tables: • EMP (name, age, salary, dept) • DEPT (dname, floor) • Example projections • EMP1 (name, age | age) • EMP2 (dept, age, DEPT.floor | DEPT.floor) • EMP3 (name, salary | salary) • DEPT1 (dname, floor | floor)

  13. Optimize for Grid Computing • I.e. shared-nothing • Horizontal partitioning and intra-query parallelism as in Gamma • Paper talks about “Grid computers … may have tens to hundreds of nodes …”

  14. Projection Detail #1 • Each projection is horizontally partitioned into “segment”s • Segment identifier • Unit of distribution and parallelism • Value-based partitioning, key range of sort key(s) • Column-wise store inside segment • Storage key: ordinal record number in segment • calculated as needed

  15. Projection Detail #2 • Different encoding schemes for different columns • Depends on ordering and value distribution • Self-order, few distinct values:(value, position, num_entries) • Foreign-order, few distinct values:(value, bitmap), bitmap is run-length encoded • Self-order, many distinct values:block-oriented, delta value encoding • Foreign-order, many distinct values:gzip

  16. Different Indexing

  17. Big Disk Blocks • Tunable • Big (minimum size is 64K)

  18. Reconstructing Base Table from Projections • Join Index: • Projection T1 has M segments, projection T2 has n segments • T1 and T2 are on same base table • Join index consists of M tables, one per T1 segment • Entry: segment ID and storage key of corresponding record in T2 • In general, needs multiple join indices for reconstructing a base table • Join index is costly to store and maintain • Each column expected to be in multiple projections • Reduce # of join indices

  19. Innovative Redundancy • Hardly any warehouse is recovered by redo from log • Takes too long! • Store enough projections to ensure K-safety • Column can be in K different projections • Rebuild dead objects from elsewhere in the network

  20. Automatic Physical DBMS Design • Accept a “training set” of queries and a space budget • Choose the projections and join indices auto-magically • Re-optimize periodically based on a log of the interactions

  21. Outline • Overview • Read-optimized column store • Query execution and optimization • Handling transactional updates • Performance • Summary

  22. Operators • Decompress • Select: generate bitstring • Mask: bitstring+projection selected rows • Project: choose a subset of columns • Concat: combine multiple projections that are sorted in the same order • Sort • Permute: according to a join index • Join • Aggregation operators • Bitstring operators

  23. Execution • Query plan: a tree of operators (data flow) • Leaf: accessing the data storage • Internal: calls “get_next” • Operators return 64KB blocks

  24. Column Optimizer (discussion) • Cost-based estimation for query plan construction • Chooses projections on which to run the query • Cost model includes compression types • When to perform “mask” operator • Build in snowflake schemas • Which are simple to optimize without exhaustive search • Looking at extensions

  25. Outline • Overview • Read-optimized column store • Query execution and optimization • Handling transactional updates • Performance • Summary

  26. Online Updates Are Necessary • Transactional updates are necessary even in read-mostly environment • Online updates for error corrections • Real-time data warehouses • Reduce the delay between OLTP system and warehouse towards zero

  27. Solution – a Hybrid Store Write-optimized Column store (Batch rebuilder) Tuple mover Read-optimized Column store (What we have been talking about so far)

  28. Write Store • Column store • Horizontally partitioned as the read store • 1:1 mapping between RS segments and WS segments • Storage keys are explicitly stored • Btree: sort key  storage key • No compression (the data size is small)

  29. Handling Updates • Optimize read-only query: do not hold locks • Snapshot isolation • The query is run on a snapshot of the data • Ensure transactions related to this snapshot have already committed • Each WS site: insertion vector (with timestamps), deletion vector, (updates become insertions and detetions) • Maintain a high water mark and a low water mark of WS sites: • HWM: all transactions before HWM have committed • LWM: no records in read store are inserted before LWM • Queries can specify a time before HWM

  30. HWM and epochs • TA: time authority updates the coarse timer (epochs)

  31. Transactions • Undo from a log (that does not need to be persistent) • Redo by rebuild from elsewhere in the network

  32. Tuple-Mover • Read RS segment • Combine WS segment into a new version of the RS segment, do not update in place • Record last move time for this segment in WS • Tlast_move LWM • Time authority will periodically sends out a new LWM epoch number

  33. Current Performance Varying storage: • 100X popular row store in 40% of the space • 10X popular column store in 70% of the space • 7X popular row store in 1/6th of the space • Code available with BSD license

  34. Summary • Column store is optimized for read queries • Cluster parallelism • Interesting data organization • Handling write queries

More Related