1 / 12

The Effect of Database Size Distribution on Resource Selection Algorithms

The Effect of Database Size Distribution on Resource Selection Algorithms. Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University lsi@cs.cmu.edu callan@cs.cmu.edu. Abstract.

kane
Télécharger la présentation

The Effect of Database Size Distribution on Resource Selection Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University lsi@cs.cmu.edu callan@cs.cmu.edu

  2. Abstract Task:Evaluate the performance of different resource selection algorithms in the environments of different DB size distributions. • Extend CORI resource selection algorithm • Extend the KL divergence algorithm by using DB sizes as priors • Experiments were done on four different testbeds with different characteristics to show ReDDE and extend KL divergence are more robust 2

  3. Previous Work:Resource Representation Resource Representation (Content Representation): • Query Based Sampling (Need no cooperation)(Callan, et al., 1999) • Submit randomly-generated queries and analyze returned docs • Does not need cooperation for individual DBs Resource Representation (Database Size Estimation): • Sample-Resample (Luo and Callan, 2003) Assume: Search engine indicates num of docs match a one-term query Strategy: Estimate df of a query term in sampled docs and in the whole collection; scale the num of sampled docs to get the DB size 3

  4. Previous Work:Resource Selection & Results Merging Resource Selection: • gGlOSS (Gravano, et al., 1995) • Represent DBs and queries as vectors and calculate the similarities • Kullback-Leibler (KL) divergence ( Xu and Croft, 1999) • Calculate the KL divergence between the word frequency distributions of the query and the DB. • CORI (Callan, et al., 1995) • A Bayesian Inference Network model. Has been shown effective on different testbeds Results Merging: • CORI results merging algorithm (Callan, et al., 1995) • Semi-Supervised Learning algorithm (Si and Callan, 2002) 4

  5. Resource Selection Algorithms that Normalize DB Size:The Old Version of CORI algorithm CORI algorithm is a Bayesian inference network and an adaptation of the Okapi formula to rank resources. Belief of DBi according to the query term rk is determined: Doc frequency Avg (Sampled) DB Length df_base Length of DBi (Sampled) df_factor DB frequency Num of DBs Belief of DBi to the query is the sum of belief for all terms 5

  6. Resource Selection Algorithms that Normalize DB Size:The Extended Version of CORI algorithm Three issues are addressed to incorporate the DB size factor • Df is scaled to estimate the actual df in the DB Estimated DB Size DB Sample Size • DB length is scaled. • df_base and df_factor are scaled. CORI_ext1 addresses first two points; CORI_ext2 addresses all three points 6

  7. Resource Selection Algorithms that Normalize DB Size:The Old and Extended Versions of KL-divergence algorithm By language model framework, KL-divergence algorithm calculates the conditional probability of DB given the query. DB independent constant In original KL-divergence algorithm P(Ci) is uniform distribution In extended KL-divergence algorithm P(Ci) is set according to DB Size 7

  8. Resource Selection Algorithms that Normalize DB Size:The ReDDE Algorithm • The goal of resource selection: • Select the (few) DBs that have the most relevant documents • Common strategy: • Pick DBs that are the “most similar” to the query • But similarity measures don’t always normalize well for DB size • Optimal strategy: • Rank DBs by the number of relevant documents they contain • It hasn’t been clear how to do this • An approximation of the optimal strategy: • Rank DBs by the percentage of relevant documents they contain • This can be estimated a little more easily… …but we need to make some assumptions 8

  9. Number of docs sampled from jth DB Estimated DB size “Everything at the top is (equally) relevant” Estimated Number of docs in the DB that contains dj CSDB (Rank) CCDB (Rank) Scale by DB Size a } b } c a a b b b Number of docs sampled from the DB that contains dj Scale by DB Size Normalize, to eliminate constant Cq. The ReDDE Algorithm:Estimating the Distribution of Relevant Documents 9

  10. Experimental Data Testbeds: • Trec123_100col: 100 DBs. Organized by source and publication date. DB sizes and distribution of relevant documents rather uniform • Trec123_AP_WSJ_60col (Relevant): 62 DBs. 60 from above, 2 by merging AP and WSJ DBs. DB sizes skewed and large DBs have much more relevant docs • Trec123_FR_DOE_81col (Non-Relevant): 83 DBs. 81 from above, 2 by merging FR and DOE DBs. DB sizes skewed and large DBs have not many relevant docs • Trec4_kmeans: 100 DBs. Organized by topic. DB sizes and distribution of relevant documents moderately skewed • Trec123_10col: 10 DBs. Each DB is built by merging 10 DBs in Trec123_100col in a round-robin way. DB sizes are large. 10

  11. Experimental Results:Resource Selection Measure: Percentage of num of rel docs included compared with relevance based ranking. Evaluated Ranking Trec4-kmeans (100 DBs) Trec123-100col (100 DBs) Best Ranking Large are Relevant Large are Non-Relevant Trec123_AP_WSJ_60col (2 Large,60 small DBs) Trec123_FR_DOE_81col (2 Large, 81 small DBs) 11

  12. Conclusion and Future Work Conclusions: • Database size plays an important role for resource selection algorithms especially in the environment of skewed relevant documents distribution • Extended KL-divergence and ReDDE algorithms tend to be most robust in the algorithms investigated in the paper • In some case, the performance of ReDDE decreases when more and more DBs are selected, may due to parameter setting Future work: • To adjust the parameters of ReDDE algorithm automatically 12

More Related