1 / 29

DBXplorer : A System For Keyword-Based Search Over Relational Databases

Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautham Das Microsoft Research. DBXplorer : A System For Keyword-Based Search Over Relational Databases. Presented by Anusha Mannem.

maili
Télécharger la présentation

DBXplorer : A System For Keyword-Based Search Over Relational Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sanjay Agrawal Microsoft Research SurajitChaudhuri Microsoft Research Gautham Das Microsoft Research DBXplorer : A System For Keyword-Based Search Over Relational Databases Presented by AnushaMannem

  2. Outline • Introduction • Overview Of DBXplorer • Publish & Design of Symbol Table • Finding Matches For Keyword Search • Support for Generalized Matches(Token Matching) • Experimental Results • Conclusions

  3. Introduction • Internet search engines have popularized keyword based search. • Alternative to Keyword-Based search is Structured Search. Ex: Searching for books in Bookseller's database Books Travel  Lonely planet  Asia (or) Books Travel  Rough Guides  Europe Other similar example is http://m.www.yahoo.com/ • Example for Keyword-Based Search is searching for “Jim Gary” on Microsoft intranet to obtain matched rows, i.e., rows in DB where ‘Jim Gary’ occur.

  4. Overview of DBXplorer • DBXplorer, an efficient and scalable keyword search utility for relational databases. • The goal of DBXplorer is to enable keyword searches on multiple databases without requiring the users to know the schema of respective databases. • How it works? When a DBXplorer is given a set of query keywords, it returns all the rows (either from single table, or by joining tables connected by foreign-key joins) such that each row contains all the keywords.

  5. Overview Of DBXplorer • Enabling this keyword search requires 2 steps (a)Publish Step (b)Search Step

  6. Overview of Publish Steps • Publish : A databases is enabled for keyword search through the following steps. • Step 1 : A database is identified, along with the set of tables or columns within the database to be published. • Step 2 : Auxiliary tables are created for supporting keyword searches. The most important structure is Symbol Table S that is used at search time to efficiently determine the locations of query keywords in the database

  7. Overview of Search Steps • Search: Given a query consisting of a set of keywords, it is answered as follows. • Step 1: Looks up symbol table to identify the tables, and columns/rows of the database which contain the query keywords. • Step 2: Identify and enumerate all possible joins (subsets of table if joined) whose rows might contain required keywords • Step 3: Generate SQL statements for each join (gives rows which contain all keywords), and the final rows are ranked and presented to the user.

  8. Architecture of DBXplorer • Publish and Search components are packaged as two separate COM objects.

  9. Architecture of DBXplorer • Publish Component provides interface to • Select a database • Select tables/columns within the database to publish ,and • modify/remove/maintain the publication • For given set of keywords, the search component provides interfaces to • Retrieve matching databases from set of published databases. • Selectively identify tables, columns/rows that need to be searched within each database identified in step (a)

  10. Different Symbol Table Designs • Exact match problem is considered i.e., each keyword in the query must match the value of an attribute in a row of a table. • Symbol Table S,which stores the information of the query keywords at different levels (row/column ) of granularity. • The two granularity levels are : • Column level granularity (Pub-Col) : for each keyword , we only store the list of columns where they occur. • Row Level granularity (Pub-Cell) : for each keyword we keep track of all the rows that contains the keyword.

  11. Factors That Effect Granularity • Space and Time Requirements Pub-Col is much faster and smaller when compared to Pub-Cell. • Keyword Search Performance Pub-Col is effective if there are an indexes on the published columns • Ease of Maintenance Pub-Col is easier to maintain as it requires an update only if the insertions cause new values in column data. • Hence, Pub-Col symbol table is always better than Pub-Cell table, unless certain columns do not have indexes.

  12. Pub-Col Representation • Symbol table is stored as a relation with two attributes : Keyword and ColId. • Alternative way is to keep hashed values of keywords in the symbol table . • FK-Comp: If the set of values in column c1 is a subset of the values in another columnc2 due to key-foreign key relationship, we retain only a single hash table entry for keywords common to both as (keyval,c1), since the foreign key constraint can be used to infer presence of keyval in c2.

  13. Pub-Col Representation • CP-Comp : (a)Partition H into a minimum number of bipartite cliques (a bipartite clique is any subgraph of H with a maximal number of edges). (b) Compress each clique. (map the symbol table S to a bipartite graph H having two node sets, HashVal & ColId, where every row (v, c) in table S corresponds to an edge in H.) V2 V3 v4 c1 c2 x Uncompressed hash table ColumnsMap table Compressed hash table

  14. Pub-Col Algorithm

  15. Search Component • Step 1 : Symbol table is searched(using generated SQL) to identify the database tables, columns/cells that contain at least one query keyword. • Step 2 : Enumerating Join trees • Step 3 : Identifying matching rows

  16. Enumerating Join Trees • Identify and enumerate all potential subsets of tables in the database that, if joined, might contain rows having all keywords. • The resulting relationwill contain all potential rows having all keywordsspecified in the query. • If we view theschema graph G asan undirected graph, this stepenumerates join trees, i.e., sub-trees of G such that: • the leaves belong to MatchedTables and • together, theleaves contain all keywords of the query keywords Join Trees

  17. Searching For Rows • The input to this final search step is enumerated join trees. • Each join tree is then mapped to a single SQL statement that joins the tables as specified in the tree, and select those rows that contain all keywords. • The retrieved rows are ranked before being output. • Our approach is to rank the rows by number of joins involved E.g., documents in which keywords occur close to one another are ranked higher than documents in which keywords are far apart.

  18. Search Algorithm

  19. Token Matches • Token matches - the keyword in the querymatches only a token or sub-string of an attribute value (e.g., retrieve rows of address by specifying only a street name). • Pub-Prefix method • B+ tree indexes can be used toretrieve rows whose cell matches a given prefix string • This clause is of the form WHERE T.C LIKE ‘P%K%’ • During publishing of a database, for every keyword K,the entry (hash(K), T.C, P) is kept in the symbol table ifthere exists a string in column T.C which • contains atoken K, and • has prefix P

  20. Token Matches Let the hash values of thesearchable tokens i.e., ‘string’, ‘ball’ and ‘round’ be 1, 2and 3 respectively Pub-Prefix table Database table T Consider searching keyword “string” Pub-Prefix table returns prefixes “th” and “no” and subsequent SQL will contain (T.C LIKE ‘th%string%’) OR (T.C LIKE ‘no%string%’)

  21. Experimental Results • Setup: The Experiments are on a 450 MHz 256 MB Intel P-III machine. • Four databases were used for evaluation: • TPC-H data of sizes from 100 to 500 MB • USR is Microsoft employee address DB of 130 MB with 19 tables • ML is a 375 MB mailing list DB with 38 tables • KB is a 365 MB DB with 84 tables containing information on articles and help manuals on various shipped products

  22. Experimental Results • In particular it has been showed that : • Pub-Col table is compact when compared to Pub-Cell • Pub-Col scales linearly with data size, and is independent of data distribution. • Pub-Prefix is compact compared to Pub-Cell and has a significantly better performance when full text indexes are not present

  23. Size of Symbol Table Publish & Search performance of Pub-Col & Pub-Cell are compared:

  24. Symbol Table Building Time

  25. Query performance

  26. User Interface For DBXplorer Fig: Matching Databases

  27. User Interface For DBXplorer Fig: Join Trees

  28. User Interface For DBXplorer Fig: Matching Rows

  29. Conclusion • DBXplorer has been implemented using a commercial relational database and web server and allows users to interact via a browser at front-end. • Although the authors discussed only the case where there is a single database, this techniques can be extended to keyword search over multiple databases. • DBXplorer support exact matches and generalized matches up to some extent. • As mentioned earlier, the Pub-Col alternative is the best when columns have indexes on them. If a full-text index is available, use Pub-Col with the full-text index. Instead, if only a traditional index is available and the column width is small, use Pub-Prefix, otherwise use Pub-Cell.

More Related