Managing Web Data with Wide Tables: A Comprehensive Overview

Using Wide Table to Manage Web Data: A Survey Bin Yang byang@fudan.edu.cn

Outline • Why use the Wide Table • Data Model • Physical Implementation • Distributed Deployment • Query Executions • Other Issues

Why use the Wide Table? • Several Scenarios • New generation of e-commerce application • By Rekesh Agrawal, VLDB 2001, ICDE 2003 • By Jennifer L. Beckmann, ICDE 2006 &Eric Chu, SIGMOD 2007 • Data publishing and annotating service • del.icio.us, flicker • UTab CIDR 2007 • Novel web application • Google Base: allow user structured their uploaded content • Ebay: market-place • Craigslist: community portals • Search engine crawler data • Bigtable • Medical information system • Distributed workload management system • Condor • Semantic Web data storage • By Daniel J. Abadi, VLDB 2007, Best paper

The character of Wide Table • Data • Very wide schema, has many attributes • Sparsity • Schema evolves constantly and quickly • Query • More flexible query than SQL • More structured than keyword search

Conventional Database Horizontal Representation • N-ary horizontal representation • Optimized for dense and slowly evolved data • Problem • Large Number of Columns. Current database system has a restriction on the number of columns in one table. Eg: 1012 (DB2 and Oracle) • Sparsity. Many NULLs occupy storage, also increase the size of index, even affect the sort result. • Schema Evolution. Frequently altering schema is expensive • Query performance. If a few columns are used in the query, the query incurs a large performance penalty. • Debate • Whether such wide table is a real need or just inappropriate design • The distribution of non-NULL is unknown at schema design time • Schema evolvement • Reconstruct cost

Data Model • n-ary Horizontal Representation • 2-ary Binary Representation • 3-ary Vertical Representation • Hybrid Representation

n-ary Horizontal Representation • A basic example. See figure 1. • As many columns as the number of attributes • When a new attribute is brought in, the schema has to altered • Another approach: Attributes are divide into “dense” and “sparse”. Dense are stored in a horizontal table, Sparse are stored in a plain text object.

2-ary Binary Representation • DSM (Decomposed Storage Model) SIGMOD 1985 • Decomposed horizontal tables into as many 2-ary tables as the number of columns. • Number of columns. Just two. • Sparsity. NULL need not to store. Left Outer Join to reconstruct. • Schema evolution. Equals to add or delete a DSM table. On the other hand, too many tables makes DBMS hard to manage. • Performance of queries involved a few attributes are increased.

Number of columns. Just three. • Sparsity. NULL need not to store. • Schema evolution. Equals to add or delete a row. • Queries of vertical tables are much more complicated. • 3-ary Vertical Representation • Unlike Horizontal Representation, Binary Representation and Vertical Representation decouple the logical and physical storage of entities. • Because most current applications and development tools are designed for horizontal format, reconstruct from either Binary Representation and Vertical Representation to horizontal table is necessary.

Hybrid Representation • Used in Bigtable and HBase • Divide into several column family. Row + timestamp equals surrogate in DSM. • Less tables than DSM. • Vertical representation is used in each table

Physical Implementation • Row Oriented Storage (stores data row by row) • Positional Format • Interpreted Format • Column Oriented Storage (stores data column by column)

Positional Format • Always has a header • Relation-id:schema • Tuple-id, tuple length • NULL-bitmap, timestamp • Fixed length • 8, 255, 1, 10, 4 bytes respectively • Variable length • A,B: fixed • C,D,E: variable • Offset:length, pointer are in header • Variable length attributes always follows the fixed length attribute. • Fixed length attribute is pre-allocated no matter whether it is null • Schema evolving is expensive in both fixed and variable length record.

Deal with NULL in Positional Format Storage • NULL-bitmap: Record header using it to indicate which attribute is NULL • A bit in NULL-bitmap • Full size of the fixed length attribute is wasted • Variable length attribute is omitted • Bitmap only • Not pre-allocated for fixed length attribute • Location is more complicated than pre-allocated • PostgreSQL • A special value: The size should be small • Length-data pairs: Length is 0 • Interpreted format

Interpret Format

Column Oriented Storage • Row-oriented is write-optimized • Push all of the fields of a single record out to disk • Column-oriented is read-optimized • Firstly applied in data warehouse system • Advantages • Support efficient queries over wide schema, improved bandwidth utilization and cache locality • Efficiently deal with sparse columns, improve data compression • Use CPU cycles to save disk bandwidth • Query on compressed representation, just decompress when presented the data to user • Disadvantage • Increased cost of inserts • Increased record reconstruction cost • Typical example of Column Oriented • C-Store http://db.csail.mit.edu/projects/cstore/ • MonetDB http://monetdb.cwi.nl/projects/monetdb/Home/index.html

Compression method • Run Length Code (RLE) • A list of non-NULL values • Offset • Bitmap • Position Ranges • Challenge • A “storage wizard” to automatically decide positional? Interpreted? Horizontal? Vertical? • According to density, frequency of access

Distributed Deployment • In order to provide more storage capacity, high availability and high performance • Many nodes, each with private disk and private memory (shared-nothing architecture) • GFS (Google File System) • One master • single point failure – shadow • similar to Napster • Many Chunkserver • Data transfer between chunkserver without interaction with master • Each chunk has 3 replicas • Availability & Performance • Bigtable • Based on GFS • Partition horizontally according to row key, each partition is called tablet • C-Store • Just implement projections • Each projection is partition horizontally • Different projections may have same columns • Projections may have replicas, while different replicas can have different sort order • K-safe: can tolerate K failures

Query Execution • Query on Binary Representation • Query on Vertical Representation • Query on Interpreted Representation • Partial and sparse index • Keyword search • Partitions, Hidden Schema and Virtual Relation • Ranking

Query on Binary Representation • Suitable for queries which involved a small number of attributes • Transformation between binary and horizontal • Store non-NULL only • Physical layout • One is clustered on surrogate, suitable for B2H • The other is clustered on the attribute value. Suitable for specific queries.

Query on Vertical Representation • Transformation between vertical and horizontal • Physical layout • One is clustered on object identifier • The other is clustered on attribute name. Suitable for V2H. • Query on Interpreted Representation • No need to reconstruct the horizontal representation • EXTRACTION operation • Get the offset of each attribute • Expensive, execute in batch

Partial Index • In horizontal table, column involved in frequent queries is always indexed. • In vertical table, all three columns are indexed • The entire table are indexed. The size is much more bigger • The large indices adversely impact the performance of vertical representation (each update has to modify the index) • Partial index: only the rows of interest needs to be indexed • Proposed by Stonebraker 1989 SIGMOD • Use a predicate, only the tuples which is evaluated as true are indexed • Challenge: How to identify the real interest rows • Format: CREATE index-type INDEX on relationname(column name) where predicate CREATE B-tree INDEX on EMP(salary) where salary < 500 • Sparse Index • In interpreted table, only index the non-NULL values. • Index size is proportional to the number of non-NULL value. • For insertion and deletion operation, only the index on the attributes that are non-NULL need to update

Keyword search • Most suitable way • Ordinary user don’t know the exactly attribute name, because there are “too many attributes” • Ordinary user may not write a SQL query • Inverted index in classical IR • One inverted index on data value (traditional IR) • One inverted index on attribute name (UTab) • Problem • Too many records may contain a specific keyword • Zipf-like distribution, accepted by most users • Number of attributes contained the term • Number of rows contained the term • Imprecision • Maybe a user want to find keyword in some specific attribute • More structured keyword search

A example of structured keyword search • Fuzzy attribute • Name based schema matching techniques • WordNet semantic dictionary • Suppose: A2 is fuzzy attribute, and A2 is similar to C21 and C22 • Alternative • Run keyword search on the data value to obtain a set of objects Z. • Find out A, which is the set of attributes in Z that contain the keyword. Match A with fuzzy attributes to get the B. • Return objects in Z which has attribute B

Partitions, Hidden Schema and Virtual Relation • Vertical partition the data set in a Wide table • Scanning the vertical partition is more efficient than scanning the base table • How to partition • A reasonable number of partitions • Partitions contain minimal null values • Each base-table tuple is preferably store entirely in one partition • A common way • Group together co-occurring attributes • Challenge: How to define the degree of co-occurring

Classification • Disjoint partition (Hidden Schema) • Joint partition (Virtual Relation) • After get the partition • Materialized views (positional format because of it is dense) • Covering index (a way to find out the efficient partial index) • Provide browsing-based interface • Hidden Schema (by Eric Chu, Jeffrey Naugthon, wisconsin-madison) • Jaccard coefficient • Statistical information • Virtual relation (Applied in UTab by NUS) • Clustered on attributes and tags • Semantic information

Ranking • SQL query is always unordered set of qualifying records • Flexible query should have a order • Keyword search • Most classical method is tf-idf • Modern search engine always involve many aspects with different weights • Timestamp • Classification of keyword in attribute name and value data, different weights • Hidden schema or virtual relation should also have a ranking • Decreasing order according to the number of tuples it contains

Other Issues • Weak data types • Arbitrary string • Timestamp • Consistency • CAP theorem • Consistent, availability, tolerance of partitions • Write-once-read-many, most queries are read-only

[1] R. Agrawal, A. Somani, and Y. Xu. Storage and querying of ecommerce data. In Proc. Of VLDB, pages 149-158, 2001. • [2] G.P. Copeland and S.N. Khhoshafian. A decomposition storage model. In SIGMOD 1985. • [3] J. L. Beckmann, A. Halverson, R. Krishnamurthy, and J. F. Naughton. Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In Proc. of ICDE, 2006. • [4] B. Yu, G. Li, B. C. Ooi, and L.Z. Zhou. One Table Stores All: Enabling Painless Free-and- • Easy Data Publishing and Sharing. In CIDR 2007. • [5] M. Stonebraker, D.J. Abadi, A. Batkin et al. C-Store: a Column-Oriented DBMS. In Proc. • of VLDB 2005. • [6] E. Chu, J. Beckmann, J. Naughton. The Case for aWide-Table Approach to Manage Sparse • Relational Data Sets. In SIGMOD 2007. • [7] C. Cunningham, C.A. Galindo-Legaria, and G. Graefe. PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS. In VLDB 2004. • [8] F. Chang, J Dean, S. Ghemawat et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI 2006. • [9] S. Ghemawat, H. Gobioff and S.T. Leung. The Google File System. In SOSP 2003. • [10] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. • In OSDI 2004. • [11] D.J. Abadi. Column Stores For Wide and Sparse Data. In CIDR 2007. • [12] J. Madhavan, S.R. Jeffery, and S. Cohen. Web-scale Data Integration: You can only afford to Pay As You Go. In CIDR 2007. • [13] A.S. Hoque. Storage and Querying of High Dimensional Sparsely Populated Data in Compressed Representation. EurAsia-ICT 2002. • [14] V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases.In Proc. of VLDB, 2002. • [15] J. Madhavan, A. Halevy and S. Cohen et al. Structured Data Meets the Web: A Few Observations. IEEE Data Engineering Bulletion, 29(4), December 2006. • [16] Hadoop website. http://lucene.apache.org/hadoop/ • [17] HBase website. http://wiki.apache.org/lucene-hadoop/Hbase • [18] Delicious website. http://del.icio.us/ • [19] Flickr website. http://www.flickr.com/ • [20] Google Base website. http://base.google.com/ • [21] Google Co-op website. http://www.google.com/coop • [22] M. Stonebraker. The case for partial indexes. SIGMOD Record, 1989. • [23] WordNet website. http://wordnet.princeton.edu/ • [24] R. Agrawal, R. Srikant and Y. Xu. Database Technologies for Electronic Commerce. In Proc. of VLDB, 2002. • [25] Database Complete Book • [26] E.A.Brewer. Combining Systems and Databases: A Search Engine Retrospective.

Managing Web Data with Wide Tables: A Comprehensive Overview