490 likes | 612 Vues
This presentation discusses a novel approach to ranking target objects by leveraging relationships between documents in collections. It outlines a system that employs scoring functions, SQL implementation, and an early termination strategy to efficiently compute the top-K target objects. By focusing on high-scoring documents, the method aims to minimize retrieval costs while maximizing relevance. Experimental results demonstrate the effectiveness of the proposed algorithm in producing accurate rankings swiftly, making it highly applicable for real-time information retrieval tasks.
E N D
Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation Kaushik Chakrabarti Venkatesh Ganti Jiawei Han Dong Xin Presented by: Vaidergorn Eitan
Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions
Introduction • More and more document collections, has their documents relate to objects. • Laptop reviews site: Laptop reviews
Laptop reviews Introduction OF: I need the best “lightweight” & a “business use” laptop. OF (Object Finder) Queries
Introduction • The goal: • Get Top K • Exploiting the relationships between documents and objects. • Exploiting the Fact that we need only K.
Introduction • Search Objects - SOs - Documents • Target Objects - TOs
Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions
System Overview • FTS (Full Text Search): • Input: Keyword/s. • Output: Ranked lists of documents
System Overview • FTS (Full Text Search): • Most relational DBMS now support FTS functionality.
System Overview T is used only for the final lookup of the TO values • DBMS: • T • R
Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions
Scoring Functions • The OF evaluation system returns top K target objects that has the best scores according to scoring function.
Scoring Functions • W={w1,w2,…,wN} • keywords in the OF query. • Li • ranked sorted list • <document id, DocScore> • Dt • list of documents related to
Scoring Functions • Score matrix Mt – for each t in TOs • Score(t) - the relevance score for the TO t. • compute rows score • compute cols score
Scoring Functions Row-marginal Class:
Scoring Functions Column-marginal Class:
Scoring Functions • Fcomb is monotonic: Fcomb(x1,…,xn) ≤ Fcomb(y1,…,yn) when xi ≤ yi • Fagg is subset monotonic: Fagg(S) ≤Fagg(S’) if S ≤ S’. • Fagg distributes over append: Fagg(R1 append R2)= Fagg(Fagg(R1),Fagg(R2)).append here is ordered concatenation of tuples.
Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions
Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions
Early Termination Approach • Intuition: top scoring documents typically contribute the most to the scores of high scoring TOs. • The TOs related to these top scoring documents are most likely to be the best candidate matches. • We progressively retrieve documents in the decreasing order of their scores, and maintain upper and lower bound scores for the related TOs.
Early Termination Approach • Generate-only Approach: • Rely on bounds • stops when identified the best K TOs • Generate-Prune Approach: • candidate generation • Stop condition more relaxed • pruning phase.
Candidate Generation • Ci • We retrive in chunks from Li. • Prefix(Li) • documents retrieved so far from the Lis (rank list). • SeenTOs • current aggregation scores. • AggResulti - For each Li, table containing • numSeen • aggScore • upper bound and lower bound scores.
Candidate Generation • The Algorithm has 5 steps:
Candidate Generation • Step1 - Retrieve Documents : • we retrieve the next Ci from each Li. • Reduce the number of join queries (with R).
Candidate Generation • Step2 - Update SeenTOs: Prefix(L2) Prefix(L1) AggResult(2) AggResult(1)
Candidate Generation Prefix(L2) Prefix(L1) AggResult(2) AggResult(1)
Candidate Generation Prefix(L2) Prefix(L1) AggResult(2) AggResult(1)
Candidate Generation • Step3 - Compute bounds: • t.lb= Fcomb(t.aggScore[1],…t.aggScore[N]).
Candidate Generation • B: • maximum number of documents in any ranked list Li that can contribute to the score of any target object t. • xi • DocScore of last document retrieved from Li. • t.ub[i]= Fagg(t.aggScore[i], Fagg(xi,xi,..,)). t.ub= Fcomb(t.ub[1],…,t.ub[N]). (B- t.numseen[i]) times t1.ub[1]=1.0+1.0*(2-1)=2 t2.ub[1]=1.0+1.0*(2-1)=2
Candidate Generation • Step4 - Stopping Condition: We can stop when there are at least K objects in SeenTOs whose lower bound scores are higher than the upper bound score of any unseen TOs. • UnseenUB=Fcomb (Fagg(x1,x1,…),…, Fagg(xN,xN,…,). • So the stopping criterion is: LBK ≥ UnseenUB • LBK – the Kth high LB B times
Candidate Generation Prefix(L1) X1=0.2; X2=0.3
Candidate Generation • LBK ≥ UnseenUB • UnseenUB= ((0.2+02)+(0.3+0.3))=1 • LB3= 1.1
Candidate Generation • Step5 - Identify candidates: • Top(List,X) • the top X elements in the list. • The set of candidates is defined by Top(UB,h) • h - least value which satisfies:LBK≥UBh+1
Candidate Generation • LBK≥UBh+1 • LB3= 1.1 • LB3≥UB4+1 => h=4 • Top(LB,3)={t1,t3,t4} Top(UB,4)={t1,t2,t3,t4}. • Top(UB,h)={t1,t2,t3,t4}
Pruning to the Final Top-K • UB={t1(2.5), t2(1.8), t3(1.6), t4(1.6)} K=3 • t1=((1+0.1)+(0.1+1))=2.2 • t1=2.2, t2=1.6, t3=1.6, t4=1.6 • UB={t1(2.2), t2(1.6), t3(1.6), t4(1.6)} • The final top-k results are {t1, t2, t3}
Exact Top-K with Approximate scores K=2 • Exact Top-K with Approximate Scores: • Crossing Objects: its rank in LB is more than K and its rank in UB is K or less. • Boundary Objects: a pair of target objects (A,B): • The top K in UB and LB are same. • A is the Kth object in LB and uth object in UB (u ≤ k) • B is the (K+1)th object in UB and lth object in LB (l ≥ K+1) • LBK ≤ UBK+1
Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions
Experiment • Our documents comprise of a collection of 714,192 news articles from 03’-04’ obtained from MSNBC news portal. • We index those news articles inside SQL Server FTS engine. • We extract three types of named entities: PersonNames, OrganizationNames, and LocationNames.
Experiment • To get realistic OF queries, we picked the following top 10 sport news queries on Google in 2004 .
Experiment • “PersonNames” the desired entity type for all the queries. All our measurements are averaged across the 10 queries. • Implementation all 3 approaches to evaluate OF queries: SQL implemetation, GenPrune,GenOnly. • SUM as the combination function.SUM as the aggregation function.
Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions
Conclusions • Class of OF queries and defined its semantics. • Two broad class of scoring functions, which exploit relationships between documents and objects, to compute the relevance score of the target objects for a given set of keywords. • We present early termination techniques which shows that our approach is 4-5 times faster than SQL implementation.