180 likes | 299 Vues
This paper explores the development of a storage engine for the Semantic Web, addressing similar requirements found in e-commerce applications. It leverages findings from prior research on storing and retrieving e-commerce data, particularly in handling complex schemas and large attribute sets. The paper discusses various physical representation strategies, including vertical and horizontal models, and presents an algebra for transforming queries between these representations. Insights into performance optimizations, such as enhanced table functions and query mapping layers, are also emphasized, showcasing lessons learned applicable to Semantic Web storage solutions.
E N D
Assertion • Storage engine for semantic web has requirements similar to those for e-commerce aplications. • Draw upon results and lessons from • R. Agrawal, A. Somani, Y. Xu: Storage and Retrieval of E-Commerce Data. VLDB-2001.
Typical E-Commerce Data Characteristics An Experimental E-marketplace for Computer components • Constantly evolving schema • Sparsely populated data (about 50-100 attributes/component) • Nearly 2 Million components • More than 2000 leaf-level categories • Large number of Attributes (5000)
Alternative Physical Representations • Horizontal • One N-ary relation • Binary • N 2-ary relations • Vertical • One 3-ary relation
Conventional horizontal representation (n-ary relation) • DB Catalogs do not support thousands of columns (DB2/Oracle limit: 1012 columns) • Storage overhead of NULL values Nulls increase the index size and they sort high in DB2 B+ tree index • Hard to load/update • Schema evolution is expensive • Querying is straightforward
Monitor Name Val Height Output PAN DVD-L75 7 inch Name Val Name Val 3.75 KLH DVD221 Digital PAN DVD-L75 KLH DVD221 S-Video Binary Representation(N 2-ary relations) • Dense representation • Manageability is hard because of large number of tables • Schema evolution expensive • Decomposition Storage Model [Copeland et al SIGMOD 85], [Khoshafian et al ICDE 87] • Monet: Binary Attribute Tables [Boncz et al VLDB Journal 99] • Attribute Approach for storing XML Data [Florescu et al INRIA Tech Report 99]
Oid (object identifier) Key (attribute name) Val (attribute value) Vertical representation(One 3-ary relation) • Objects can have large number of attributes • Handles sparseness well • Schema evolution is easy • Implementation of SchemaSQL [LSS 99] • Edge Approach for storing XML Data [FK 99]
Querying over Vertical Representation is Complex • Simple query on a Horizontal scheme SELECT MONITOR FROM H WHERE OUTPUT=‘Digital’Becomes quite complex: SELECT v1.Val FROM vtable v1, vtable v2 • WHERE v1.Key = ‘Monitor’ • AND v2.Key = ‘Output’ • AND v2.Val = ‘Digital’ • AND v1.Oid = v2.Oid Writing applications becomes much harder. What can we do ?
Horizontal view (H) Attr1 Attr2 … Attrk … Query Mapping Layer Vertical table (V) Oid Key Val Solution • Provide horizontal view of the vertical table • Translation layer automatically maps operations on H to operations on V
Transformation Algebra • Defined an algebra for transforming expressions over horizontal views into expressions over the vertical representation. • Two key operators: • v2h () • h2v ()
Sample Algebraic Transforms • v2h () Operation – Convert from vertical to horizontal k(V) = [Oid(V)] [i=1,kOid,Val(Key=‘Ai’(V))] • h2V () Operation – Convert from horizontal to vertical k(H) = [i=1,kOid,’Ai’Ai(Ai ‘’(V))] [i=1,k Oid,’Ai’Ai(i=1,kAi=‘’(V)) • Similar operations such as Unfold/Fold and Gather/Scatter exist in SchemaSQL [LSS 99] and [STA 98] respectively • Complete transforms in VLDB-2001 Paper
From the Algebra to SQL • Equivalent SQL transforms for algebraic transforms • Select, Project • Joins (self, two verticals, a horizontal and a vertical) • Cartesian Product • Union, Intersection, Set difference • Aggregation • Extend DDL to provide the Horizontal View CREATE HORIZONTAL VIEW hview ON VERTICAL TABLE vtable USING COLUMNS (Attr1, Attr2, … Attrk, …)
Alternative Implementation Strategies • VerticalSQL • Uses only SQL-92 level capabilities • VerticalUDF • Exploits User Defined Functions and Table Functions to provide a direct implementation • Binary (hand-coded queries) • 2-ary representation with one relation per attribute (using only SQL-92 transforms)
density = 10%, 1000 cols x 20K rows 25 20 VerticalSQL_oid 15 Execution time (seconds) VerticalSQL_key 10 5 0 0.1% 1% 5% Join selectivity Data Organization Matters: Clustering by Key significantly outperforms by Oid Join
density = 10% 60 50 40 HorizontalSQL Execution time (seconds) 30 VerticalSQL 20 Binary 10 0 200x100K 400x50K 800x25K 1000x20K Table (#cols x #rows) • VerticalSQL comparable to Binary and outperforms Horizontal Projection of 10 columns
30 20 VerticalSQL Execution time (seconds) Binary 10 VerticalUDF 0 200x100K 400x50K 800x25K 1000x20K Table (#cols x #rows) • VerticalUDF is the best approach density = 10% Projection of 10 columns
Summary Vertical (w/ Mapping) Binary (w/ Mapping) Horizontal Manageability + + - Flexibility - + - + Querying + + - Performance + +
Remarks • Lessons of this study directly apply to building storage engine for semantics webs • Performance of vertical representation can be further improved by: • Enhanced table functions • First class treatment of table functions • Native support for v2h and h2v operations • Partial indices