Efficient IR-Style Keyword Search over Relational Databases

Seminar on Databases and the Internet The Hebrew University of Jerusalem, Winter 2006 Efficient IR-Style Keyword Searchover Relational Databases 12 December 2005

Introduction This presentation is mainly based upon the work of Hristidis, Gravano, and Papakonstantinou. The work consists of showing several Efficientalgorithms for Information-retrieval Keyword search, based on the DISCOVER Architecture.

Contents • Introduction • Goal and Motivation • Framework and examples • Architecture • Algorithms • Experimental Results • Criticism and Conclusion

Goal and Motivation We present a detailed framework and methods for combining IR-style keyword search over relational databases What is Information Retrieval Keyword Search in general? Mainly, it’s this…

Goal and Motivation …But not always: SELECT * FROM ComplaintsC WHERE CONTAINS (C.comment, ’disk crash’, 1) > 0 ORDER BY score(1) DESC

Goal and Motivation Current status: • RDBMSs (Such as Oracle) provide querying capabilities for text attributes, provided that an exact colum is specified. • Only AND semanticsare being used. • Limited ranking functions. • Known approaches for query processingstrategies areinefficient (and sometimes even infeasible).

Goal and Motivation In particular, we’d like: • Efficient ways to generate “top k” results according to some form of “ranking”. • The Use AND and OR semantics (not just the default AND) when gaining results. • Assembling keyword occurances from multiple attributes - perhaps in “unforseen” ways – without needing to specify columns.

Goal and Motivation We would like to apply same (or similar) methods and rules that apply in this world, Prioritizing - K-best results first Efficient Searching Use of AND, OR Semantics

Goal and Motivation Why should we care?? • Keyword queries require little or no knowledge about the database semantics. • Ranking results correctly (and returning only relevant tuples) is, of course, highly desirable. • Efficient implementation should reduce the querying process to a fraction of the time of a naïve implementation.

Framework • Query Model: • A database with n relations R1,…, Rn. • relations possibly have primary key to foreign key constraints. • The schema graph G is a directed graph, in which for each primary to foreign key relationship between Ri and Rj, there’s an edge (i,j) :

Framework A possible instance of the schema graph can be: Complaints Products Customers

Framework Joining trees of tuples: • Given a schema graph G for a database, a joining tree of tuples T is a tree of tuples where each edge (ti,tj) in T, where ti∈ Ri and tj∈ Rj and, which satisfies 2 properties: • (Ri,Rj) ∈ G(The schema graph we talked about) • ti ⋈ tj ∈ Ri, ⋈Rj • The size(T) of a joining tree is the number of tuples in T.

Framework A joining tree of tuples for our example: Complaints ⋈ Products ⋈ Customers

Framework “Top-k” keyword query • a “top-k” keyword query is a list of keywords Q={w1… wm}. The result for such a query is a list of the k joining trees of tuples T whose score(T,Q)is the highest, so that: • each tree T in a result is minimal: cannot have a zero-scored leaf. • no tuple appears more than once in a joining tree of tuples.

Framework For example, the query Q = {Netvista, Maxtor} should yield the following results: C1 (by itself) Complaints Products Customers

Framework And the following: p2  c3 Complaints Products Customers

Framework And the following: p1  c1 Complaints Products Customers

Framework S - a constant tf - Term frequency of w (w ∈ Q) in ai N - number of tuples in ai’srelation dl, avdl - (average) attribute value size df - number of tuples in ai’s relation with the word w Score (ai,Q) • A method to evaluate the relevance of a tree of tuples. Consists of a single-attribute (ai) IR-style relevance scoring function:

Framework Combined Score (T,Q) • another function should be used to combine the single attributes into a final score: • those are only optional candidates • This framework can handle many functions - as long as they satisfy the Tuple monitonicity property: • if individual Scores of tuples in T’ < individual Scores of T, then the combined score of the trees will also have this property.

Framework Candidate Networks (CN) • can be thought of as a join expression that involves tuple sets plus (perhaps) “base” relations, that do not have occurrences of query keywords, but help to connect relations that do… Q = {IBM, Architect} Complaints{} ⋈ ⋈ customersQ ProductsQ

Framework For example, all the candidate networks (With scores) For Q = {Maxtor,Netvista}: P = products C = complaints U = customers

Architecture • Follows is a quick overview of the system architecture needed in order to efficiently implement top-k keyword queries. • Description relies much on the DISCOVER architecture, but is not really OS/RDBMS specific.

Architecture • The architecture consists of: • an IR Engine • a CN generator • an Execution Engine User Keywords IR Engine IR index Tuple Sets Candidate Network Database Generator Schema Candidate Networks Execution engine Parameterized SQL queries Database

Architecture IR Engine • Modern RDBMSs include IR-style text-indexing functionality (e.g. Oracle Text). • It is useful to think of the IR-engine as an indexer that gives a SCORE>0 to tuples that have occurrences of the keywords User Keywords IR Engine IR index Tuple Sets Candidate Network Database Generator Schema Candidate Networks Execution engine Parameterized SQL queries Database

Architecture IR Engine • The proposed architecture exploits this functionality -upon arrival of a query Q, generates for each relation the tuple set RQ = { t ∈R | Score(t,Q) > 0} • The tuple sets are then sorted by decreasing scoreand passed on to the next module. User Keywords IR Engine IR index Tuple Sets Candidate Network Database Generator Schema Candidate Networks Execution engine Parameterized SQL queries Database

Architecture CN Generator • receives non-empty tuple sets (Such as CQ, PQ), and thegeneralschema graph. • attempts to join those sets, perhaps using “base” relations (U{ }… remember?) - generates Candidate Networks (CNs)! User Keywords IR Engine IR index Tuple Sets Candidate Network Database Generator Schema Candidate Networks Execution engine Parameterized SQL queries Database

Architecture CN Generator • Also receives a parameter M, that bounds the maximum tuple sets participating in a CN (either free or non-free).] • Why is this boundary needed? User Keywords IR Engine IR index Tuple Sets Candidate Network Database Generator Schema Number of CN Might be exponential in query size! Candidate Networks Execution engine Parameterized SQL queries Database

Architecture CN Generator The generated CNs MUST satisfy: • No “leaf” of a tuple set is a “free” tuple set (P{}…). • No RSR tuple set exists – a tree of tuples cannot include duplicate tuples! User Keywords IR Engine IR index Tuple Sets Candidate Network Database Generator Schema Candidate Networks Execution engine Parameterized SQL queries Database

Architecture Execution Engine • This is the module that actually contacts the RDBMS query tools, in order to generate the top-k results. • This is our focus! (as it’s the most hard to implement efficiently) User Keywords IR Engine IR index Tuple Sets Candidate Network Database Generator Schema Candidate Networks Execution engine Parameterized SQL queries Database

Sparse algorithm example Recall the database from before, with the query Q= {Maxtor, Netvista} Complaints Products Customers

Architecture - demonstration User User {Maxtor, netvista} Keywords IR Engine IR index Tuple Sets Candidate Network Database Generator Schema Candidate Networks Execution engine Parameterized SQL queries Database Database

Architecture - demonstration We now turn our attention to how THIS is done User User {Maxtor, netvista} Keywords IR Engine IR index Tuple Sets Candidate Network Database Generator Schema Candidate Networks Execution engine Parameterized SQL queries Database Database

First of all, what do we have so far? • An architecture that constructs Candidate Networks from keyword queries, using “black box” functions of modern RDBMSs, and some given SCORE functions. • A notion of what should be done in order to produce the keyword query results. So, how would you do it???

Naïve algorithm • The naïve approach: simply issue an SQL query for each CN. • The results from all the queries are then combined using Sort-Merge-Join. • Main problem – runtime. • What characteristic(s) can we use in order to make our algorithm more efficient?

Naïve algorithm is too slow • Remember that the IR Engine returns Tuple sets that are ranked in DESCENDING orderin respect to the SCORE() function. • So, when applying COMBINE(Score(T,Q)) for a whole CN, we can get an ESTIMATE of its maximal possible score For CNi (MPSi). • We can use this knowledge to disregard “unfruitful” CNs!!

Sparse Algorithm • For every CNi, compute MPSi. • If MPSi does not exceed the lowest “best-k” match for the query found so far, DISCARD CNi . • Otherwise, join tuples in CNi as usual… • As a further optimization, CNs are evaluated in ASCENDING SIZE order - smaller CNs, are evaluated first, while “heavy” CNs might be discarded after only short calculation steps!

Sparse algorithm example Remember this database, with the query Q= {Maxtor, Netvista} ? Complaints Products Customers

Sparse algorithm example • Suppose we want to find the Top-2 best results for this query Q={Maxtor, Netvista} on our existing database. • The CN generator supplies our execution engine with the following Candidate Networks, with M=3: • We start off with CQ,let’s take a look:

Sparse algorithm example CQ consists of all the tuples (with Different scores, of course): ComplaintsQ C1 – it’s SCORE is 0.33 C2 – it’s SCORE is 0.33 C3 – it’s SCORE is 1.33

Sparse algorithm example • We start off with CQ , no need to calculate MPS(CQ) – but we do it anyway! • We already know everything! (We got these exact results from the IR engine! • We now turn to examine the CN PQ ... MPS(CQ)= 1.33

Sparse algorithm example These are the relevant tuples that PQ consists of: ProductsQ P1 – it’s SCORE is 1 P2 – it’s SCORE is 1

Sparse algorithm example • Let’s look at the algorithm function over PQ : • We calculate MPS(PQ) = 1, so it might still yield some result that can be added to the TOP-K Queue. • We now turn to examine the CN CQ  PQ ... MPS(CQ)= 1.33 MPS(PQ)= 1 P1 = 1

Sparse algorithm example C1P1SCORE: 0.66 C2P2SCORE: 0.66 C3P2SCORE: 1.17 These are the joins of CQ PQ: ProductsQ ComplaintsQ

Efficient IR-Style Keyword Search over Relational Databases