DBXplorer : A System For Keyword-Based Search Over Relational Databases

Sanjay Agrawal Microsoft Research SurajitChaudhuri Microsoft Research Gautham Das Microsoft Research DBXplorer : A System For Keyword-Based Search Over Relational Databases Presented by AnushaMannem

Outline • Introduction • Overview Of DBXplorer • Publish & Design of Symbol Table • Finding Matches For Keyword Search • Support for Generalized Matches(Token Matching) • Experimental Results • Conclusions

Introduction • Internet search engines have popularized keyword based search. • Alternative to Keyword-Based search is Structured Search. Ex: Searching for books in Bookseller's database Books Travel  Lonely planet  Asia (or) Books Travel  Rough Guides  Europe Other similar example is http://m.www.yahoo.com/ • Example for Keyword-Based Search is searching for “Jim Gary” on Microsoft intranet to obtain matched rows, i.e., rows in DB where ‘Jim Gary’ occur.

Overview of DBXplorer • DBXplorer, an efficient and scalable keyword search utility for relational databases. • The goal of DBXplorer is to enable keyword searches on multiple databases without requiring the users to know the schema of respective databases. • How it works? When a DBXplorer is given a set of query keywords, it returns all the rows (either from single table, or by joining tables connected by foreign-key joins) such that each row contains all the keywords.

Overview Of DBXplorer • Enabling this keyword search requires 2 steps (a)Publish Step (b)Search Step

Overview of Publish Steps • Publish : A databases is enabled for keyword search through the following steps. • Step 1 : A database is identified, along with the set of tables or columns within the database to be published. • Step 2 : Auxiliary tables are created for supporting keyword searches. The most important structure is Symbol Table S that is used at search time to efficiently determine the locations of query keywords in the database

Overview of Search Steps • Search: Given a query consisting of a set of keywords, it is answered as follows. • Step 1: Looks up symbol table to identify the tables, and columns/rows of the database which contain the query keywords. • Step 2: Identify and enumerate all possible joins (subsets of table if joined) whose rows might contain required keywords • Step 3: Generate SQL statements for each join (gives rows which contain all keywords), and the final rows are ranked and presented to the user.

Architecture of DBXplorer • Publish and Search components are packaged as two separate COM objects.

Architecture of DBXplorer • Publish Component provides interface to • Select a database • Select tables/columns within the database to publish ,and • modify/remove/maintain the publication • For given set of keywords, the search component provides interfaces to • Retrieve matching databases from set of published databases. • Selectively identify tables, columns/rows that need to be searched within each database identified in step (a)

Different Symbol Table Designs • Exact match problem is considered i.e., each keyword in the query must match the value of an attribute in a row of a table. • Symbol Table S,which stores the information of the query keywords at different levels (row/column ) of granularity. • The two granularity levels are : • Column level granularity (Pub-Col) : for each keyword , we only store the list of columns where they occur. • Row Level granularity (Pub-Cell) : for each keyword we keep track of all the rows that contains the keyword.

Factors That Effect Granularity • Space and Time Requirements Pub-Col is much faster and smaller when compared to Pub-Cell. • Keyword Search Performance Pub-Col is effective if there are an indexes on the published columns • Ease of Maintenance Pub-Col is easier to maintain as it requires an update only if the insertions cause new values in column data. • Hence, Pub-Col symbol table is always better than Pub-Cell table, unless certain columns do not have indexes.

Pub-Col Representation • Symbol table is stored as a relation with two attributes : Keyword and ColId. • Alternative way is to keep hashed values of keywords in the symbol table . • FK-Comp: If the set of values in column c1 is a subset of the values in another columnc2 due to key-foreign key relationship, we retain only a single hash table entry for keywords common to both as (keyval,c1), since the foreign key constraint can be used to infer presence of keyval in c2.

Pub-Col Representation • CP-Comp : (a)Partition H into a minimum number of bipartite cliques (a bipartite clique is any subgraph of H with a maximal number of edges). (b) Compress each clique. (map the symbol table S to a bipartite graph H having two node sets, HashVal & ColId, where every row (v, c) in table S corresponds to an edge in H.) V2 V3 v4 c1 c2 x Uncompressed hash table ColumnsMap table Compressed hash table

Pub-Col Algorithm

Search Component • Step 1 : Symbol table is searched(using generated SQL) to identify the database tables, columns/cells that contain at least one query keyword. • Step 2 : Enumerating Join trees • Step 3 : Identifying matching rows

Enumerating Join Trees • Identify and enumerate all potential subsets of tables in the database that, if joined, might contain rows having all keywords. • The resulting relationwill contain all potential rows having all keywordsspecified in the query. • If we view theschema graph G asan undirected graph, this stepenumerates join trees, i.e., sub-trees of G such that: • the leaves belong to MatchedTables and • together, theleaves contain all keywords of the query keywords Join Trees

Searching For Rows • The input to this final search step is enumerated join trees. • Each join tree is then mapped to a single SQL statement that joins the tables as specified in the tree, and select those rows that contain all keywords. • The retrieved rows are ranked before being output. • Our approach is to rank the rows by number of joins involved E.g., documents in which keywords occur close to one another are ranked higher than documents in which keywords are far apart.

Search Algorithm

Token Matches • Token matches - the keyword in the querymatches only a token or sub-string of an attribute value (e.g., retrieve rows of address by specifying only a street name). • Pub-Prefix method • B+ tree indexes can be used toretrieve rows whose cell matches a given prefix string • This clause is of the form WHERE T.C LIKE ‘P%K%’ • During publishing of a database, for every keyword K,the entry (hash(K), T.C, P) is kept in the symbol table ifthere exists a string in column T.C which • contains atoken K, and • has prefix P

Token Matches Let the hash values of thesearchable tokens i.e., ‘string’, ‘ball’ and ‘round’ be 1, 2and 3 respectively Pub-Prefix table Database table T Consider searching keyword “string” Pub-Prefix table returns prefixes “th” and “no” and subsequent SQL will contain (T.C LIKE ‘th%string%’) OR (T.C LIKE ‘no%string%’)

Experimental Results • Setup: The Experiments are on a 450 MHz 256 MB Intel P-III machine. • Four databases were used for evaluation: • TPC-H data of sizes from 100 to 500 MB • USR is Microsoft employee address DB of 130 MB with 19 tables • ML is a 375 MB mailing list DB with 38 tables • KB is a 365 MB DB with 84 tables containing information on articles and help manuals on various shipped products

Experimental Results • In particular it has been showed that : • Pub-Col table is compact when compared to Pub-Cell • Pub-Col scales linearly with data size, and is independent of data distribution. • Pub-Prefix is compact compared to Pub-Cell and has a significantly better performance when full text indexes are not present

Size of Symbol Table Publish & Search performance of Pub-Col & Pub-Cell are compared:

Symbol Table Building Time

Query performance

User Interface For DBXplorer Fig: Matching Databases

User Interface For DBXplorer Fig: Join Trees

User Interface For DBXplorer Fig: Matching Rows

Conclusion • DBXplorer has been implemented using a commercial relational database and web server and allows users to interact via a browser at front-end. • Although the authors discussed only the case where there is a single database, this techniques can be extended to keyword search over multiple databases. • DBXplorer support exact matches and generalized matches up to some extent. • As mentioned earlier, the Pub-Col alternative is the best when columns have indexes on them. If a full-text index is available, use Pub-Col with the full-text index. Instead, if only a traditional index is available and the column width is small, use Pub-Prefix, otherwise use Pub-Cell.

DBXplorer : A System For Keyword-Based Search Over Relational Databases

DBXplorer : A System For Keyword-Based Search Over Relational Databases

Presentation Transcript

Design Theory for Relational Databases

Keyword Search on Structured and Semi-Structured Data

Chapter 5

Chapter 3: Relational Model

Data

Keyword Search on Structured and Semi-Structured Data

Evidence-Based Practice and Nursing Databases

Chapter 19: Distributed Databases

The PATENTSCOPE search system

Chapter 3: Relational Model

Chapter 6: Formal Relational Query Languages

INFORMATION SYSTEMS IN BIOLOGY: FOCUS ON DATABASE

EBI services

Chapter 19: Distributed Databases

How to search with the PATENTSCOPE search system

Chapter 3: Relational Model

Chapters 3-6 Relational Data Models, Relational Constraints, and Relational Algebra

HAPTER 4

Chapter 11: Storage and File Structure

Relational Algebra

Relational Algebra and My SQL(II)