1 / 26

C-Store: How Different are Column-Stores and Row-Stores?

C-Store: How Different are Column-Stores and Row-Stores?. Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009. Row and Column Stores. Limitation of Current Comparison. Assuming a row store does not use any column-oriented physical design.

janna
Télécharger la présentation

C-Store: How Different are Column-Stores and Row-Stores?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009

  2. Row and Column Stores

  3. Limitation of Current Comparison • Assuming a row store does not use any column-oriented physical design. • Columns of the same logical tuple is traditionally stored together. • In fact it is possible to simulate a column store in a row store.

  4. Simulating a Column Store in a Row Store • Vertical Partitioning • Indexing Every Column • Materialized views • C-Tables • Vertical Partitioning + Run-Length Encoding

  5. Simulation 1: Vertical Partitioning • Each column makes its own table • N columns result in N tables, each table stores (tuple-id, attr) pairs. • Need to do tuple reconstruction • In C-store, columns are sorted in the same order. • In MonetDB, base columns are kept in insertion order, updates only happen to cracker columns. • In this “simulation” scenario, an integer “tuple-id” column is associated to each column. • Use hash join on “tuple-id” column to reconstruct tuples.

  6. Two Problems of Vertical Partitioning • It requires the “tuple-id” column to be explicitly stored in each column. • Waste space and disk bandwidth. • Most row-stores store a relatively large header on every tuple • further waste space.

  7. Tuple Header • A tuple header provides metadata about the tuple. • For example, in Postgres, each tuple contains a 27 bytes header including information such as • Insert transaction timestamp. • Number of attributes in the tuple. • NULL flags • Length of tuple header

  8. Tuple Header in a Column Store • A column-store puts tuple header in separate columns • Some information in the tuple header can be removed • For example, in MonetDB, each logical column stores (key, attr) pairs. • Hence, the number of attributes is always 2. • And there is no need for NULL flags, say we simply don’t store the tuple with NULL value.

  9. Simulation 2: Indexing Every Column • Base relations are stored using a standard, row-oriented physical design. • And an additional secondary B-Tree is built on every column of every table. • Join columns on tuple-id. • This approach answers queries by reading values directly from the indexes.

  10. One Problems of Indexing Every Column • If a column has no predicate on it, this approach requires • its index to be scanned to extract the needed values. • One optimization is to create indexes with composite keys SELECT AVG(salary) FROM emp WHERE age > 40; If we have a composite index with an (age, salary) key, we can answer this query directly from the index.

  11. Simulation 3: Materialized Views • For every query in the benchmark workload, create an optimal set of materialized views. • Involve only the columns needed to answer queries. • Do not pre-join columns from different tables in these materialized views. • Problem of materialized views. • It requires query workload to be known in advance.

  12. Comparison Results by Abadi et al. (SIGMOD’2008) • Using a simplified version of TPC-H called the Star Schema Benchmark (SSBM). • Major Results: • None of the three attemps to simulate a column store in a row store are particularly effective. • The Materialized View approach is the best • The Vertical Partitioning approach is the moderate • The Indexing Every Column approach is the worst.

  13. Relevant Factors for Successful Simulation • For successfully simulating a column store in a row store, we may have to • Store tuple headers separately • Use virtual tuple-id to do tuple reconstruction (i.e., joins). • Maintain heap files (for base relations) in guaranteed position order.

  14. Simulation 4: the C-Table Approach(By Bruno, CIDR’2009) The main idea is to extend the Vertical Partitioning approach to explicitly the Run-Length Encoding (RLE) of tuple values. A C-Table is a sequence of (f, v, c) triples: f is the starting tuple-id or position. v is the data value. c is the count (length). First sort by a, then b, and finally c.

  15. An Interesting Property of C-Table • For any pair of tuples t1 and t2, possibly on different c-tables, • the range [f1, f1 + c1 -1] and [f2, f2 + c2 -1] do not partially overlap, • i.e., they are either disjoint, or the one associated with the column deeper in the sort order is included in the other. • This property allows us to use specific query rewriting to combine information from different c-tables to answer queries.

  16. Query Rewriting for C-Tables The Schema D1: (lineitem | l_shipdate, l_suppkey)

  17. Comparison Results by Bruno. (CIDR’2009) • Using TPC-H. • Using for comparison a loose lower bound on any column store implementation • Manually compute how many pages in disk need to be read by any column store execution plan, • And measure the time taken to just read the pages. • Results: • The C-Table approach can compete with column stores. • Vision: • Row-stores should be able to incorporate column specific optimizations.

  18. Column-Specific Optimizations • Late materialization • Construct tuples as late as possible. • Block iteration • Multiple-tuples-at-a-time • Column-specific compression • Run-Length Encoding • Invisible join on star schema (New) • Like semi-join in distributed DBMS.

  19. Schema of the SSBM Benchmark

  20. An Example Query for Illustrating Invisible Join

  21. Invisible Join: First Phase Each predicate is applied to the appropriate dimension table to extract a list of dimension table keys that satisfy the predicate.

  22. Invisible Join: Second Phase Each hash table is used to extract the positions of tuples in the fact table that satisfy the corresponding predicate.

  23. Invisible Join: Third Phase The third phase uses the list of satisfying positions P in the fact table to get foreign key values and hence needed data values from the corresponding dimension table.

  24. Which Column-Specific Optimization is most significant? • Late materialization • Improves performance by a factor of 3. • Block iteration • Improves performance by 5% - 50%. • Column-specific compression • Improves performance by a factor of 2 on average • Improves performance by a factor of 10 on queries that access sorted data. • Invisible join on star schema • Improves performance by 50% - 70%.

  25. Open Question • Building a complete row-store that can transform into a column-store on workloads where column-stores perform well.

  26. References • Daniel J. Abadi, Samuel R. Madden, and Nabil Hachem. Column-Stores vs. Row-Stores: How Different Are They Really?. In SIGMOD, 2008. • Allison L. Holloway, David J. DeWitt. Read-optimized databases, in depth. In VLDB, 2008. • N Bruno. Teaching an Old Elephant New Tricks. In CIDR’2009.

More Related