Answering Approximate Queries over Autonomous Web Databases

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern University Shenyang, 110004, China

Organization 1.Motivation 2.Approximate Query 3.Approximate Query Results Ranking 4.Algorithmic Solutions 5.Conclusion and Further Research

1 Motivation • Empty answers: When the query over the Web database is too selective, the answer may be empty or too little. In that case, it is desirable to have the option of relaxing the original query for presenting more relevant items that can meet user’s needs and preferences closely. • Approximate query results ranking: Users’ preferences are usually used to be a hint for ranking. In real applications, however, the preferences are often associated to specified contexts, and the ranking accuracy is affected greatly by the interest degrees of pair-wise preferences.

2.1 Definition of Approximate query 2 Approximate Query Definition 1. (approximate query) Consider an autonomous Web database R with categorical and numerical attributes A = {A1, A2,…, Am} and a query Q over R with a conjunctive selection condition of the form Q = ∧i{1,…, k}(Ai θ ai), where km and θ  {>, <, =, , , , between}. Note that, if is the operator between and ai is an interval which is represented by [ai1, ai2], Ai θ ai has the form of “Ai between ai1 AND ai2”.Each Ai in the query condition is an attribute from A and ai is a value (or interval) in its domain. By relaxing Q, an approximate query Q which is used to find all tuples of R that show similarity to Q above a threshold Tsim(0, 1) is obtained. Specifically, Q(R) = {t | tR, Similarity (Q, t) > Tsim}.

2.2 Attribute Weight Assignment • Importance of Specified Categorical Attributes • For a point query “Ai = v”, we define IDFi(v) as log(n/Fi(v)) which represents the importance of attribute value v in the database, where n is the number of tuples in the database and Fi(v) is the frequency of tuples in the database of Ai = v. In this paper, the importance of specified categorical attribute value is treated as • the importance of its corresponding attribute.

2.2 Attribute Weight Assignment • Importance of Specified Numerical Attributes • Let {v1, v2, …, vn} be the values of attribute A that occur in the database. For specified attribute value v in the query, it defined IDF(v) as shown in Equation (1) where h is the bandwidth parameter. • (1) • A popular estimate for the bandwidth is h = 1.06n−1/5, where  is the standard deviation of {v1, v2, …, vn}. Intuitively, the denominator in Equation represents the sum of “contributions” to v from every the other point vi in the database. These contributions are modeled as (scaled) Gaussian distributions, so that the further v is from vi, the smaller is the contribution from vi. The importance of specified • numerical attribute value is also treated as the importance of its corresponding attribute.

2.2 Attribute Weight Assignment Moreover, if query condition is generalized as “AiINQi”, where Qi is a set of values for categorical attributes, or a range [lb, ub] for numeric attributes, We define the maximum log(n/Fi(v)) of each different value v in Qi. The generalized importance measuring function is shown in Equation (2). (2) By normalized processing, the weight wi of attribute Ai specified by the query can be calculated by (3) in which, k is the number of attributes specified by the query.

2.3 Attribute Values Similarity Assessment • Categorical Attribute Values • The similarity between two AV-pairs can be measured as the similarity shown by their supertuples. The supertuples contain sets of keywords for each attribute in the relation and the Jaccard Coefficient is used to determine the similarity between two supertuples. Thus, in this paper the similarity coefficient between two categorical values is then calculated as a sum of the Set • similarity on each attribute, • (4) • where C1, C2 are supertuples with m attributes, A is the Set corresponding to the i-th attribute, J(,) is the Jaccard Coefficient • and is computed as J(A, B) = |AB|/|AB|.

2.3 Attribute Values Similarity Assessment • Numerical Attribute Values • Because of the continuity of numerical data, we propose an approach to estimate the similarity coefficient between two numerical values. Let {v1, v2, …, vn} be the values of numerical attribute A occurring in the database. Then the similarity coefficient NSim(v, q) between v and q can be defined by Equation (5), where h is the same as mentioned in Equation (1). • (5) • For a numerical condition Ai = q of Q, let i be a sub-threshold for Ai, according to Equation (5), we can then get the relaxation • range of numerical attribute Ai as follows: • (6)

2.4 Query Relaxation Algorithm For each condition Ci in the original query Q, by extracting values of its corresponding attribute Ai having similarity above the sub-threshold i and adding them into its query range, we can get the relaxed condition Ci. By join all the relaxed conditions, the approximate query Q is formed. Algorithm 1 The query rewriting algorithm Input: Original query Q= {C1, …,Ck}, sub-threshold {1,…, k}. Output: An approximate query Q = {C1, …, Ck}. 1. For i = 1, …, k 2. Ci ← Ci 3. If Ai is a categorical attribute 4. v Dom(Ai) 5. If VSim(ai, v) > i 6. Add v into the query range of Ci 7. End If 8. End If 9. If Aiis a numerical attribute 10. Replace query range of Ci with [q–h, q+h] 11. End If 12. Q = Q  Ci 13. End For 14. Return Q

3.1 Definition of Contextual Preferences 3 Approximate Query Results Ranking Consider a database relation r with n tuples r = {t1,…, tn} with schema R(A1,…, Ak). Let Dom(Ai) is the active domainof attribute Ai. Definition (contextual preferences) Contextual preferences are of the form {Ai = ai1Ai = ai2 , d | X}, where X is ∧jl(Aj θ aj), with ai1, ai2 Dom(Ai), l {1,…, k}, θ  {>, <, =, , , , between, in} and aj  Dom(Aj), d is the interest degree of preference (i.e., compared to ai2 , the interest degree of ai1 is d, where 0.5  d  1, and it can be learned from the past data or the users feedback). The left-hand side of a preference specifies the choice and the interest degree while the right-hand side is the context.

For collecting the contextual preferences with interest degrees, we employed the association-rules mining to execute on the workload of CarDB. We say that {A1 = a A1 = b, d | X } if conf( Xa )>conf(Xb), where conf(Xa) is the confidence of the association rule Xa in the workload, i.e., And then, the interest degree d can be obtained by:

3.2 Contextual Preferences Processing For any single preference p and any pair of tuples (ti, tj), p either prefers ti to tj (denoted by ti  ptj) or tj to ti (denoted by tjpti) or it is inapplicable with respect to ti and tj (denoted by ti~ ptj). Thus, every preference p defines an interest degree of preference (dpref) over any pair of tuples t, t’ that evaluates as follows:

3.3 Problem of Query Results Ranking Problem 1. (approximate query results ranking problem): Assume a set of preferences P ={ ,…, } and an approximate query Q. The approximate query results ranking problem asks for an order of of Q(R) such that where, The objective of the problem is to find the order over the set of tuples in Q(R) that agrees as much as possible with the input preferences. Additionally, the degree of agreement with a class of preferences is weighted by the similarity between the contexts of those preferences and the approximate query Q.

The similarity between context X and approximate query Q using their vector representations as follows: where, the vector representation of an approximate query Q is a vector VQ of size N. The i-th element of the vector corresponds to pair D[i]. If D[i] satisfies one of the conditions of Q, then VQ[i] can be computed as follows:

3.4 Approach of Query Results Ranking Proc 1 (creating orders): For each preference class PXicreate a order of the tuples in R such that The output of this procedure is a set of mcontext, order pairs of the form Xi, , where Xi is the context for the preferences in PXi and is the order of the tuples in R that (approximately) satisfies the following Equation. The score of tuple t in that corresponds to Xi is: where (t) represents the position of t in order .

Proc 2 (clustering orders): In order to reduce the number of context, order pairs, we needto find representative orders ,…, for m initial pairs Xi, , where, l < m. These orders partition the space of m initial context, order pairs into l groups. Each group i is characterized by order and a disjunction of contexts such that for each order is a representative order for the initial order . The score of tuple t in   is given by: where (t) represents the position of t in order .

Proc 3 (ranking the top-k answers) For an approximate query Q over relation R, using the output of step 2, compute the set Qk(R) Q(R) R with | Qk(R)| = k, such that tQk(R) and t’{R- Qk(R)} it holds that score(t, Q) > score(t’, Q), with . score(t, Q) =

4Algorithmic Solutions 4.1 Creating Orders Input:r ={t1,…, tn}, a set of preferences from a class Output: A pair X,  where is an order of the tuples in r such that as many of the preferences in are satisfied. 1. S = {t1,…,tn}, rank = 0 2. For all i{1,…, n} Do 3. 4. End For 5. While S  Do 6. rank + = 1, 7. (tv) = rank, S = S – { tv } 8. For all tS Do 9. p(t) = p(t) –wX (ttv) 10. End For 11. End While

4.2 Finding Representative Orders • Input:Tm = { ,…, }, U ={ ,  | Tm, Tm}, l • Output: A set of representative l orders Tl = { , ... , } • 1. Let B = {} be a buffer that can hold m ,  • 2. While Tm≠  and l > 0 Do • 3. B ←  • 4. For each Tm Do • Pick si =  ,  with minimum rsi from Ui = { ,  | Tm, | | = [2, • |Tm| - l + 1]} • 6. B ←B +{si} • 7. End For • 8. Pick s =  ,  with minimum rs from B • 9. Tm← Tm – – { }, Tl ←Tl+ s, l← l – 1 • 10. End While • 11. Return Tl

4.3 Ranking top-k Answers Input:Tl = { ,…, },  {X1,…,Xm}, an approximate query Q. Output: top-k answer tuples. 1. B ={} be a buffer that can hold k tuples ordered by score 2. L be an l size array storing the last score from each order 3. Repeat 4. For all i {1,…,l} Do 5. Retrieve next tuple t from 6. Compute as the score of t 7. Update L with score of t in 8. If tQ (R) 9. Get score of t from other orders { | Tland ji} via random access 10. score(t, Q) summing up of all the retrieved scores 11. Insert t, score(t, Q) in the correct position in B 12. End If 13. End For 14. Until 15. Return B

Conclusions and Outlook • Proposed an approximate query method for finding relevant answer items; • Given a formal definition of the contextual preference with interested degree; • Proposed an approximate query results ranking approach based on contextual preferences; • it would be interesting to research how to minimize the updating cost when the database and preferences are varied.

Answering Approximate Queries over Autonomous Web Databases

Answering Approximate Queries over Autonomous Web Databases

Presentation Transcript

Query Processing over Incomplete Autonomous Web Databases

Query Processing over Incomplete Autonomous Databases

Answering Imprecise Queries over Autonomous Web Databases

Evaluating Top-k Queries Over Web-Accessible Databases

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Answering Queries: Problems

Query Processing over Incomplete Autonomous Databases

Answering Approximate Queries Efficiently

Answering Imprecise Queries over Web Databases

Answering Relationship Queries on the Web

Completeness of Queries over Incomplete Databases

Approximate Selection Queries over Imprecise Data

Answering Queries Using Views

Evaluating top-k Queries over Web-Accessible Databases

Evaluating Top-k Queries over Web-Accessible Databases

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

Congressional Samples for Approximate Answering of Group-By Queries

Answering Approximate Queries Efficiently

Answering Imprecise Queries over Web Databases