Reverse Top- k Queries

Reverse Top-k Queries Akrivi Vlachou*, Christos Doulkeridis*, Yannis Kotidis#, Kjetil Nørvåg* *Norwegian University of Science and Technology (NTNU), Trondheim, Norway #Athens University of Economics and Business (AUEB), Greece

Outline • Motivation & Preliminaries • Monochromatic Reverse Top-k Queries • Bichromatic Reverse Top-k Queries • Threshold-based Algorithm • Materialized Views • Experimental Evaluation • Conclusions & Future Work

Rank-aware Query Processing • Huge amount of available data • Users prefer to retrievea limited set of k ranked data objects that best match their preferences (top-k queries)

Top-k Query • Given a scoring function f(), retrieve the k object that best match the user preferences • Linear scoring function f w(p)= Σw[i]*p[i] • Weight w[i]: • relative importance of attribute i • Definition TOPk(w): Given a weighting vector w and a positive integer k, find the k data points p with the minimumf(p) scores Query line of w at point p: defines the score of p Query space of w defined by point p: number of enclosed points determines the rank of p

customer customer customer customer Reversing the Top-k Query • From the perspective of manufacturers: • it is important that a product is returned in the highest ranked positions for as many user preferences as possible • estimate the impact of a product compared to their competitors products • advertise a product to potential customers Which customers would be interested? sales representative

customer customer customer customer Reversing the Top-k Query • Reverse top-k query: Given a potential product q and a positive integer k, which are the weighting vectors w for which q is in the top-k query result set? • Two different versions Monochromatic: • no knowledge of user preferences Bichromatic: • a dataset with user preferences is given sales representative

Car Database Example • A database containing information about different cars • Different users have different preferences • Bob prefers a cheap car, and does not care much about the age • the best choice (top-1) for Bob is the car p1 with score 2.5 • Tom prefers a newer car rather than a cheap car • the best choice for Tom and Max is the car p2

Car Database Example Query point q=p2, k=1: • Bichromatic reverse top-k: {(0.2,0.8), (0.5,0.5)} • advertise product to Tom and Max • Monochromatic reverse top-k: line segment w[price]=[1/7,5/6] • estimate the impact of p2 as 69% Query point q=p3, k=1: empty result set for the bichromatic query

mRTOP1(q) Monochromatic Reverse Top-k Query • mRTOPk(q): Given a point q, a positive number k and a dataset S, the result set of the monochromatic reverse top-k query is the locus for which there exists p in TOPk(wi) such that fwi(q) ≤ fwi(p). • The solution spaceW can be split into a finite set of non-adjacent partitions such that query point q has the same rank for all the weighting vectors. • For the monochromatic case: we focus on the 2-d space 2 1 2 Solution space

Geometric Interpretation d=2, k =1 • If q belongs to the convex hull, then there exists exactly one partition in mRTOP1(q) • Weighting vectors that are perpendicular to pq and qr define the line segment • For weighting vectors with smaller and larger slopes than w1, the relative order of p and q changes • Monochromatic reverse top-k,k>1: • The solution space may contain more than 1 partition

Bichromatic Reverse Top-k Query • bRTOPk(q): Given a point q, a positive number k and two datasets S and W, where S represents data points and W is a dataset containing different weighting vectors, a weighting vector wibelongs to the result set, if and only if there exists p in TOPk(wi) such that fwi(q) ≤ fwi(p) • Naïve approach: • for each weighting vector process the top-k query • test if query point q is in the top-k list

Threshold-based Algorithm (RTA) • Goal: • reduce the number of top-k evaluations by discarding weighting vectors • Threshold-based Algorithm (RTA): • sort the weighting vectors based on pairwise similarity • top-k queries defined by similar vectors, have similar result sets • evaluate the first top-k query, calculate a threshold • For each weighting vector • possibly prune based on threshold • refine threshold

p2 p3 p1 p4 p6 p5 p8 p9 p10 p7 w3 w1 w2 Example of RTA Algorithm (k=2) Buffer: p1, p2 • Evaluate top-2 query for w1 • Set threshold based on w2 • fw2(q) > threshold  discard w2 • Refine threshold for w3 q W=[ w1, w2, w3 ]

Materialized Views • Threshold-based Algorithm (RTA) • reduce the top-k evaluations by discarding some weighting vectors that are not in the reverse top-k result set • process at least as many top-k evaluations as the cardinality of the result set • Materialized Views • find weighting vectors that belong definitely to the result without top-k evaluation

w1, w2, w3 Materialized Views • Grid-based space partitioning • cell Ci • lower left corner CiL • upper right corner CiU • We store for each cell Cithe results of reverse top-k queries for corners CiLand CiU

w1, w2, w3 w1, w2, w3 , w4 Materialized Views • Given a point q enclosed in cell Ci • all weighting vectors in RTOPk(CiU) belong to the result set of q • only weighting vectors in RTOPk(CiL) - RTOPk(CiU) have to be examined • Materialized views can be generalized for arbitrary k<Kvalues

Experimental Setup • Comparison between Naïve and RTA (varying dimensionality, cardinality, data distribution – real data) • Queries: uniform and k-skyband points • Metrics: • time • I/Os • number of top-k evaluations

RTA vs. Naïve uniform distribution of S and uniform weights W |S|=10K, |W|=10K, top-k=10, skyband query points • RTA outperforms naive by 1 to 2 orders of magnitude • as dimensionality increases, |RTOPk(q)| decreases leading to fewer top-k evaluations

naive requires |W| top-k query evaluations |W|=5K, correlated dataset: RTA needs on 544 out of 5000 top-k evaluations (saves 89.12% of the cost) the average size of the result set is 459 Scalability of RTA Algorithm various distributions (UN, AC, CO) of S and uniform weights W |S|=10K or|W|=10K, d=5, top-k=10, skyband query points

Performance of RTA on Real Data NBAconsists of 17265 tuples, d=5(number of points scored,rebounds, assists, steals and blocks) HOUSE consists of 127930 tuples, d=6 (income spent on gas, electricity, water, heating, insurance, and property tax) • uniform and clustered weights W (|W|=10K) • clustered weights lead to fewer top-k evaluations

Outline • Motivation & Preliminaries • Example of Reverse Top-k Queries • Monochromatic Reverse Top-k Queries • Bichromatic Reverse Top-k Queries • Threshold-based Algorithm • Materialized Views • Experimental Evaluation • Conclusions & Future Work

Conclusions and Future Work • We introduced reverse top-k queries • geometric interpretation of the solution space • efficient algorithm for bichromatic reverse top-k query • materialized reverse top-k views • Future Work • interpretation of solution space for higher dimensions (monochromatic reverse top-k) • improve the performance of the bichromatic reverse top-k computation

Thank you! Related work: Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis, Kjetil Nørvåg: "Reverse Top-k Queries" Akrivi Vlachou, Christos Doulkeridis, Kjetil Nørvåg, Yannis Kotidis: "Identifying the Most Influential Data Objects with Reverse Top-k Queries" More information: http://www.idi.ntnu.no/~vlachou/

Reverse Top- k Queries

Reverse Top- k Queries

Presentation Transcript

Evaluating Top- K Selection Queries

Evaluating Top-k Queries Over Web-Accessible Databases

Influence Zone: Efficiently Processing Reverse k Nearest Neighbors Queries

On Top-n Reverse Top-k Queries: Variants, Algorithms, and Applications

Answering Top-k Queries Using Views

Top-k Queries on Temporal Data

SLICE: Reviving Regions-Based Pruning for Reverse k Nearest Neighbors Queries

Top- k Queries on Uncertain Data

Fast Algorithms for Top-k Personalized PageRank Queries

Influence Zone: Efficiently Processing Reverse k Nearest Neighbors Queries

Answering Top-k Queries Using Views

6 Rank Aggregation and Top-k Queries

Supporting top-k join queries in relational databases

Sliding-window Top-k Queries on Uncertain Streams

Answering Top-k Queries Using Views

Cleaning Uncertain Data for Top-k Queries

Best Position Algorithms for Top-k Queries

Continuous Top-k Dominating Queries

Evaluating top-k Queries over Web-Accessible Databases

Evaluating Top-k Queries over Web-Accessible Databases

Supporting Top- k join Queries in Relational Databases

Count / Top-k Continuous Queries on P2P Networks