Effective Keyword-Based Selection of Relational Databases

Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina

Introduction • The database selection problem for relational data sources • Demand for keyword-based searches of structured databases • What about searches over multiple structured databases?

Current Approach • Document collections are summarized with keyword frequency lists • Inadequate because… • Relational tables are typically normalized • Results must consider number of join operations Example: query Q = {multimedia, database, VLDB}

The Usefulness of a Database in Answering a Keyword Query • A database is useful if it has high quality results to the keyword query. • What are “high quality” results? • The results contain all the query keywords. • There is a keyword relationship – the query keywords can be connected meaningfully in the database.

This paper proposes… • A method that summaries the relationships between keywords in a database. • A ranking method based on the keyword summaries in order to select the most useful databases for a given keyword query.

Why Summaries? • We have an equation which measures the usefulness of a database DB to query Q • But calculating this score isn’t feasible for every database in the system! • So we estimate the usefulness of the database using a summary of the relationships between all pairs of keywords in the database.

Results of a Keyword Query • A minimal tree of tuples containing all the query keywords * tuples in the tree are connected according to their relationships defined in the database schema • The more distant the relationships, the weaker the relationships connecting the tuples. • distance - the number of join operations in a joining sequence of tuples Examples: • t1 t2 t3, distance = 2 • t4, distance = 0 • DB2 of Figure 1, Q = {multimedia, binder} • t1 t5 t12or t4 t9 t12, distance = 2 • t15 t10 t1 t5 t12, distance = 4

The Matrices • Each relational database DB has several types of matrices • one D matrix – represents the presence or absence of each keyword in each tuple in DB • multiple T matrices – represents the relationship between tuples in a relational database at each distance • construct the Wd matrices – the frequencies of keyword pairs at distance d;Example: Figure 3

Calculate the Key Relationship Matrix (KRM) • wd(ki, kj) is the frequency of d-distance joining sequences to connect the pair of keywords ki and kj • is the maximum number of join operations (user-specified) Example: Figure 4 Q = {database:VLDB} for DB2 in Figure 3, the only non-zero is when d = 2 1/(2+1) * 1 occurrence = 0.33

Implementation in SQL • The generation of the matrices: D, T1, T2, … T and W0, W1, … , W , for each DB, can be performed conveniently inside the local RDBMS using SQL. • Tables are created for each matrix. • Tables can be maintained dynamically when tuples are added to the database or old tuples are deleted.

Data Sources Selection Using KRM • Estimate the score of the data source DB to Q through the scores of the individual pairs. • With the KR-summary, we can effectively rank a set of databases. A higher score of the data source DB to Q indicates a better ranking.

Definitions & Metrics for Comparing Experimental Results • real ranking – the “real” score of each database based on Equation 2-1 • recall – compares the accumulated score of the top l databases selected based on the summaries of the source databases against the total available score when we select top l databases according to real ranking (summaries vs. real ranking) • precision – measures the fraction of the top l selected databases which have high quality results

Experimental Results Observations: • The selection performance of KR-summaries generally gets better as gets larger. • The precision and recall performance for different values of tends to cluster into groups. • There are big gaps in both precisions and recalls between KR-summaries when 0 ≤ ≤ 1 and when is greater.

Conclusion “The estimation method is effective in assigning high ranks to useful databases, although less relevant or irrelevant databases might be selected.”

Effective Keyword-Based Selection of Relational Databases

Effective Keyword-Based Selection of Relational Databases

Presentation Transcript

DBXplorer : A System For Keyword-Based Search Over Relational Databases

A Graph Method for Keyword-based Selection of the top-K Databases

Relational Databases

Relational Databases

DBXplorer: A System for Keyword-Based Search over Relational Databases.

DBXplorer: A System for Keyword-Based Search over Relational Databases

Relational Databases

RELATIONAL DATABASES

A Graph Method for Keyword-based Selection of the top-K Databases

Relational Databases

Relational Databases

Relational Databases

Efficient IR-Style Keyword Search over Relational Databases

Relational Databases

DISCOVER: Keyword Search in Relational Databases

Efficient IR-Style Keyword Search over Relational Databases

Relational Databases

Relational Databases

Efficient Keyword Search across Heterogeneous Relational Databases

Relational Databases

Effective Keyword Based Selection of Relational Databases

Effective Keyword Search in Relational Databases