Efficient Top-K Object Ranking via Document Relationships and Early Termination Techniques

Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation Kaushik Chakrabarti Venkatesh Ganti Jiawei Han Dong Xin Presented by: Vaidergorn Eitan

Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions

Introduction • More and more document collections, has their documents relate to objects. • Laptop reviews site: Laptop reviews

Laptop reviews Introduction OF: I need the best “lightweight” & a “business use” laptop. OF (Object Finder) Queries

Introduction • The goal: • Get Top K • Exploiting the relationships between documents and objects. • Exploiting the Fact that we need only K.

Introduction • Search Objects - SOs - Documents • Target Objects - TOs

System Overview • FTS (Full Text Search): • Input: Keyword/s. • Output: Ranked lists of documents

System Overview • FTS (Full Text Search): • Most relational DBMS now support FTS functionality.

System Overview T is used only for the final lookup of the TO values • DBMS: • T • R

Scoring Functions • The OF evaluation system returns top K target objects that has the best scores according to scoring function.

Scoring Functions • W={w1,w2,…,wN} • keywords in the OF query. • Li • ranked sorted list • <document id, DocScore> • Dt • list of documents related to

Scoring Functions • Score matrix Mt – for each t in TOs • Score(t) - the relevance score for the TO t. • compute rows score • compute cols score

Scoring Functions Row-marginal Class:

Scoring Functions Column-marginal Class:

Scoring Functions • Fcomb is monotonic: Fcomb(x1,…,xn) ≤ Fcomb(y1,…,yn) when xi ≤ yi • Fagg is subset monotonic: Fagg(S) ≤Fagg(S’) if S ≤ S’. • Fagg distributes over append: Fagg(R1 append R2)= Fagg(Fagg(R1),Fagg(R2)).append here is ordered concatenation of tuples.

SQL Implementation

Early Termination Approach • Intuition: top scoring documents typically contribute the most to the scores of high scoring TOs. • The TOs related to these top scoring documents are most likely to be the best candidate matches. • We progressively retrieve documents in the decreasing order of their scores, and maintain upper and lower bound scores for the related TOs.

Early Termination Approach • Generate-only Approach: • Rely on bounds • stops when identified the best K TOs • Generate-Prune Approach: • candidate generation • Stop condition more relaxed • pruning phase.

Candidate Generation • Ci • We retrive in chunks from Li. • Prefix(Li) • documents retrieved so far from the Lis (rank list). • SeenTOs • current aggregation scores. • AggResulti - For each Li, table containing • numSeen • aggScore • upper bound and lower bound scores.

Candidate Generation

Candidate Generation • The Algorithm has 5 steps:

Candidate Generation • Step1 - Retrieve Documents : • we retrieve the next Ci from each Li. • Reduce the number of join queries (with R).

Candidate Generation • Step2 - Update SeenTOs: Prefix(L2) Prefix(L1) AggResult(2) AggResult(1)

Candidate Generation Prefix(L2) Prefix(L1) AggResult(2) AggResult(1)

Candidate Generation • Step3 - Compute bounds: • t.lb= Fcomb(t.aggScore[1],…t.aggScore[N]).

Candidate Generation • B: • maximum number of documents in any ranked list Li that can contribute to the score of any target object t. • xi • DocScore of last document retrieved from Li. • t.ub[i]= Fagg(t.aggScore[i], Fagg(xi,xi,..,)). t.ub= Fcomb(t.ub[1],…,t.ub[N]). (B- t.numseen[i]) times t1.ub[1]=1.0+1.0*(2-1)=2 t2.ub[1]=1.0+1.0*(2-1)=2

Candidate Generation • Step4 - Stopping Condition: We can stop when there are at least K objects in SeenTOs whose lower bound scores are higher than the upper bound score of any unseen TOs. • UnseenUB=Fcomb (Fagg(x1,x1,…),…, Fagg(xN,xN,…,). • So the stopping criterion is: LBK ≥ UnseenUB • LBK – the Kth high LB B times

Candidate Generation Prefix(L1) X1=0.2; X2=0.3

Candidate Generation • LBK ≥ UnseenUB • UnseenUB= ((0.2+02)+(0.3+0.3))=1 • LB3= 1.1

Candidate Generation • Step5 - Identify candidates: • Top(List,X) • the top X elements in the list. • The set of candidates is defined by Top(UB,h) • h - least value which satisfies:LBK≥UBh+1

Candidate Generation • LBK≥UBh+1 • LB3= 1.1 • LB3≥UB4+1 => h=4 • Top(LB,3)={t1,t3,t4} Top(UB,4)={t1,t2,t3,t4}. • Top(UB,h)={t1,t2,t3,t4}

Pruning to the Final Top-K

Pruning to the Final Top-K • UB={t1(2.5), t2(1.8), t3(1.6), t4(1.6)} K=3 • t1=((1+0.1)+(0.1+1))=2.2 • t1=2.2, t2=1.6, t3=1.6, t4=1.6 • UB={t1(2.2), t2(1.6), t3(1.6), t4(1.6)} • The final top-k results are {t1, t2, t3}

Exact Top-K with Approximate scores K=2 • Exact Top-K with Approximate Scores: • Crossing Objects: its rank in LB is more than K and its rank in UB is K or less. • Boundary Objects: a pair of target objects (A,B): • The top K in UB and LB are same. • A is the Kth object in LB and uth object in UB (u ≤ k) • B is the (K+1)th object in UB and lth object in LB (l ≥ K+1) • LBK ≤ UBK+1

Experiment • Our documents comprise of a collection of 714,192 news articles from 03’-04’ obtained from MSNBC news portal. • We index those news articles inside SQL Server FTS engine. • We extract three types of named entities: PersonNames, OrganizationNames, and LocationNames.

Experiment • To get realistic OF queries, we picked the following top 10 sport news queries on Google in 2004 .

Experiment • “PersonNames” the desired entity type for all the queries. All our measurements are averaged across the 10 queries. • Implementation all 3 approaches to evaluate OF queries: SQL implemetation, GenPrune,GenOnly. • SUM as the combination function.SUM as the aggregation function.

Experiment

Conclusions • Class of OF queries and defined its semantics. • Two broad class of scoring functions, which exploit relationships between documents and objects, to compute the relevance score of the target objects for a given set of keywords. • We present early termination techniques which shows that our approach is 4-5 times faster than SQL implementation.

Questions

Efficient Top-K Object Ranking via Document Relationships and Early Termination Techniques