1 / 28

Using Wide Table to Manage Web Data: A Survey

Using Wide Table to Manage Web Data: A Survey. Bin Yang byang@fudan.edu.cn. Outline. Why use the Wide Table Data Model Physical Implementation Distributed Deployment Query Executions Other Issues. Why use the Wide Table?. Several Scenarios New generation of e-commerce application

harris
Télécharger la présentation

Using Wide Table to Manage Web Data: A Survey

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Wide Table to Manage Web Data: A Survey Bin Yang byang@fudan.edu.cn

  2. Outline • Why use the Wide Table • Data Model • Physical Implementation • Distributed Deployment • Query Executions • Other Issues

  3. Why use the Wide Table? • Several Scenarios • New generation of e-commerce application • By Rekesh Agrawal, VLDB 2001, ICDE 2003 • By Jennifer L. Beckmann, ICDE 2006 &Eric Chu, SIGMOD 2007 • Data publishing and annotating service • del.icio.us, flicker • UTab CIDR 2007 • Novel web application • Google Base: allow user structured their uploaded content • Ebay: market-place • Craigslist: community portals • Search engine crawler data • Bigtable • Medical information system • Distributed workload management system • Condor • Semantic Web data storage • By Daniel J. Abadi, VLDB 2007, Best paper

  4. The character of Wide Table • Data • Very wide schema, has many attributes • Sparsity • Schema evolves constantly and quickly • Query • More flexible query than SQL • More structured than keyword search

  5. Conventional Database Horizontal Representation • N-ary horizontal representation • Optimized for dense and slowly evolved data • Problem • Large Number of Columns. Current database system has a restriction on the number of columns in one table. Eg: 1012 (DB2 and Oracle) • Sparsity. Many NULLs occupy storage, also increase the size of index, even affect the sort result. • Schema Evolution. Frequently altering schema is expensive • Query performance. If a few columns are used in the query, the query incurs a large performance penalty. • Debate • Whether such wide table is a real need or just inappropriate design • The distribution of non-NULL is unknown at schema design time • Schema evolvement • Reconstruct cost

  6. Data Model • n-ary Horizontal Representation • 2-ary Binary Representation • 3-ary Vertical Representation • Hybrid Representation

  7. n-ary Horizontal Representation • A basic example. See figure 1. • As many columns as the number of attributes • When a new attribute is brought in, the schema has to altered • Another approach: Attributes are divide into “dense” and “sparse”. Dense are stored in a horizontal table, Sparse are stored in a plain text object.

  8. 2-ary Binary Representation • DSM (Decomposed Storage Model) SIGMOD 1985 • Decomposed horizontal tables into as many 2-ary tables as the number of columns. • Number of columns. Just two. • Sparsity. NULL need not to store. Left Outer Join to reconstruct. • Schema evolution. Equals to add or delete a DSM table. On the other hand, too many tables makes DBMS hard to manage. • Performance of queries involved a few attributes are increased.

  9. Number of columns. Just three. • Sparsity. NULL need not to store. • Schema evolution. Equals to add or delete a row. • Queries of vertical tables are much more complicated. • 3-ary Vertical Representation • Unlike Horizontal Representation, Binary Representation and Vertical Representation decouple the logical and physical storage of entities. • Because most current applications and development tools are designed for horizontal format, reconstruct from either Binary Representation and Vertical Representation to horizontal table is necessary.

  10. Hybrid Representation • Used in Bigtable and HBase • Divide into several column family. Row + timestamp equals surrogate in DSM. • Less tables than DSM. • Vertical representation is used in each table

  11. Physical Implementation • Row Oriented Storage (stores data row by row) • Positional Format • Interpreted Format • Column Oriented Storage (stores data column by column)

  12. Positional Format • Always has a header • Relation-id:schema • Tuple-id, tuple length • NULL-bitmap, timestamp • Fixed length • 8, 255, 1, 10, 4 bytes respectively • Variable length • A,B: fixed • C,D,E: variable • Offset:length, pointer are in header • Variable length attributes always follows the fixed length attribute. • Fixed length attribute is pre-allocated no matter whether it is null • Schema evolving is expensive in both fixed and variable length record.

  13. Deal with NULL in Positional Format Storage • NULL-bitmap: Record header using it to indicate which attribute is NULL • A bit in NULL-bitmap • Full size of the fixed length attribute is wasted • Variable length attribute is omitted • Bitmap only • Not pre-allocated for fixed length attribute • Location is more complicated than pre-allocated • PostgreSQL • A special value: The size should be small • Length-data pairs: Length is 0 • Interpreted format

  14. Interpret Format

  15. Column Oriented Storage • Row-oriented is write-optimized • Push all of the fields of a single record out to disk • Column-oriented is read-optimized • Firstly applied in data warehouse system • Advantages • Support efficient queries over wide schema, improved bandwidth utilization and cache locality • Efficiently deal with sparse columns, improve data compression • Use CPU cycles to save disk bandwidth • Query on compressed representation, just decompress when presented the data to user • Disadvantage • Increased cost of inserts • Increased record reconstruction cost • Typical example of Column Oriented • C-Store http://db.csail.mit.edu/projects/cstore/ • MonetDB http://monetdb.cwi.nl/projects/monetdb/Home/index.html

  16. Compression method • Run Length Code (RLE) • A list of non-NULL values • Offset • Bitmap • Position Ranges • Challenge • A “storage wizard” to automatically decide positional? Interpreted? Horizontal? Vertical? • According to density, frequency of access

  17. Distributed Deployment • In order to provide more storage capacity, high availability and high performance • Many nodes, each with private disk and private memory (shared-nothing architecture) • GFS (Google File System) • One master • single point failure – shadow • similar to Napster • Many Chunkserver • Data transfer between chunkserver without interaction with master • Each chunk has 3 replicas • Availability & Performance • Bigtable • Based on GFS • Partition horizontally according to row key, each partition is called tablet • C-Store • Just implement projections • Each projection is partition horizontally • Different projections may have same columns • Projections may have replicas, while different replicas can have different sort order • K-safe: can tolerate K failures

  18. Query Execution • Query on Binary Representation • Query on Vertical Representation • Query on Interpreted Representation • Partial and sparse index • Keyword search • Partitions, Hidden Schema and Virtual Relation • Ranking

  19. Query on Binary Representation • Suitable for queries which involved a small number of attributes • Transformation between binary and horizontal • Store non-NULL only • Physical layout • One is clustered on surrogate, suitable for B2H • The other is clustered on the attribute value. Suitable for specific queries.

  20. Query on Vertical Representation • Transformation between vertical and horizontal • Physical layout • One is clustered on object identifier • The other is clustered on attribute name. Suitable for V2H. • Query on Interpreted Representation • No need to reconstruct the horizontal representation • EXTRACTION operation • Get the offset of each attribute • Expensive, execute in batch

  21. Partial Index • In horizontal table, column involved in frequent queries is always indexed. • In vertical table, all three columns are indexed • The entire table are indexed. The size is much more bigger • The large indices adversely impact the performance of vertical representation (each update has to modify the index) • Partial index: only the rows of interest needs to be indexed • Proposed by Stonebraker 1989 SIGMOD • Use a predicate, only the tuples which is evaluated as true are indexed • Challenge: How to identify the real interest rows • Format: CREATE index-type INDEX on relationname(column name) where predicate CREATE B-tree INDEX on EMP(salary) where salary < 500 • Sparse Index • In interpreted table, only index the non-NULL values. • Index size is proportional to the number of non-NULL value. • For insertion and deletion operation, only the index on the attributes that are non-NULL need to update

  22. Keyword search • Most suitable way • Ordinary user don’t know the exactly attribute name, because there are “too many attributes” • Ordinary user may not write a SQL query • Inverted index in classical IR • One inverted index on data value (traditional IR) • One inverted index on attribute name (UTab) • Problem • Too many records may contain a specific keyword • Zipf-like distribution, accepted by most users • Number of attributes contained the term • Number of rows contained the term • Imprecision • Maybe a user want to find keyword in some specific attribute • More structured keyword search

  23. A example of structured keyword search • Fuzzy attribute • Name based schema matching techniques • WordNet semantic dictionary • Suppose: A2 is fuzzy attribute, and A2 is similar to C21 and C22 • Alternative • Run keyword search on the data value to obtain a set of objects Z. • Find out A, which is the set of attributes in Z that contain the keyword. Match A with fuzzy attributes to get the B. • Return objects in Z which has attribute B

  24. Partitions, Hidden Schema and Virtual Relation • Vertical partition the data set in a Wide table • Scanning the vertical partition is more efficient than scanning the base table • How to partition • A reasonable number of partitions • Partitions contain minimal null values • Each base-table tuple is preferably store entirely in one partition • A common way • Group together co-occurring attributes • Challenge: How to define the degree of co-occurring

  25. Classification • Disjoint partition (Hidden Schema) • Joint partition (Virtual Relation) • After get the partition • Materialized views (positional format because of it is dense) • Covering index (a way to find out the efficient partial index) • Provide browsing-based interface • Hidden Schema (by Eric Chu, Jeffrey Naugthon, wisconsin-madison) • Jaccard coefficient • Statistical information • Virtual relation (Applied in UTab by NUS) • Clustered on attributes and tags • Semantic information

  26. Ranking • SQL query is always unordered set of qualifying records • Flexible query should have a order • Keyword search • Most classical method is tf-idf • Modern search engine always involve many aspects with different weights • Timestamp • Classification of keyword in attribute name and value data, different weights • Hidden schema or virtual relation should also have a ranking • Decreasing order according to the number of tuples it contains

  27. Other Issues • Weak data types • Arbitrary string • Timestamp • Consistency • CAP theorem • Consistent, availability, tolerance of partitions • Write-once-read-many, most queries are read-only

  28. [1] R. Agrawal, A. Somani, and Y. Xu. Storage and querying of ecommerce data. In Proc. Of VLDB, pages 149-158, 2001. • [2] G.P. Copeland and S.N. Khhoshafian. A decomposition storage model. In SIGMOD 1985. • [3] J. L. Beckmann, A. Halverson, R. Krishnamurthy, and J. F. Naughton. Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In Proc. of ICDE, 2006. • [4] B. Yu, G. Li, B. C. Ooi, and L.Z. Zhou. One Table Stores All: Enabling Painless Free-and- • Easy Data Publishing and Sharing. In CIDR 2007. • [5] M. Stonebraker, D.J. Abadi, A. Batkin et al. C-Store: a Column-Oriented DBMS. In Proc. • of VLDB 2005. • [6] E. Chu, J. Beckmann, J. Naughton. The Case for aWide-Table Approach to Manage Sparse • Relational Data Sets. In SIGMOD 2007. • [7] C. Cunningham, C.A. Galindo-Legaria, and G. Graefe. PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS. In VLDB 2004. • [8] F. Chang, J Dean, S. Ghemawat et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI 2006. • [9] S. Ghemawat, H. Gobioff and S.T. Leung. The Google File System. In SOSP 2003. • [10] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. • In OSDI 2004. • [11] D.J. Abadi. Column Stores For Wide and Sparse Data. In CIDR 2007. • [12] J. Madhavan, S.R. Jeffery, and S. Cohen. Web-scale Data Integration: You can only afford to Pay As You Go. In CIDR 2007. • [13] A.S. Hoque. Storage and Querying of High Dimensional Sparsely Populated Data in Compressed Representation. EurAsia-ICT 2002. • [14] V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases.In Proc. of VLDB, 2002. • [15] J. Madhavan, A. Halevy and S. Cohen et al. Structured Data Meets the Web: A Few Observations. IEEE Data Engineering Bulletion, 29(4), December 2006. • [16] Hadoop website. http://lucene.apache.org/hadoop/ • [17] HBase website. http://wiki.apache.org/lucene-hadoop/Hbase • [18] Delicious website. http://del.icio.us/ • [19] Flickr website. http://www.flickr.com/ • [20] Google Base website. http://base.google.com/ • [21] Google Co-op website. http://www.google.com/coop • [22] M. Stonebraker. The case for partial indexes. SIGMOD Record, 1989. • [23] WordNet website. http://wordnet.princeton.edu/ • [24] R. Agrawal, R. Srikant and Y. Xu. Database Technologies for Electronic Commerce. In Proc. of VLDB, 2002. • [25] Database Complete Book • [26] E.A.Brewer. Combining Systems and Databases: A Search Engine Retrospective.

More Related