Federated Search of Text Search Engines in Uncooperative Environments

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University Advisor Jamie Callan (Carnegie Mellon University)

Outline Outline: • Introduction: Introduction to federated search • Research Problems: the state-of-the-art and our contribution • Demo: Demo of a prototype system for real world application! 2

Outline Outline: • Introduction: Introduction to federated search • Research Problems: the state-of-the-art and our contribution • Demo: Demo of a prototype system for real world application! 3

Can NOT Index (promptly) Searched by Federated Search Valuable Introduction Visible Web vs. Hidden Web • Visible Web: Information can be copied (crawled) and accessed by conventional search engines like Google or AltaVista • Hidden Web: Information hidden from conventional engines. • No arbitrary crawl of the data (e.g., ACM library) - Updated too frequently to be crawled (e.g., buy.com) • Larger than Visible Web (2-50 times) • Created by professionals Federated Search is a feature used to beat Google by search engines like www.find.com - Web: Uncooperative information sources 4

Engine N Engine 1 Engine 2 Engine 3 Engine 4 …… …… (3) Results Merging Introduction Components of Federated Search System . . . . . . . . . . (1)Resource Representation (2) Resource Selection 5

Introduction Modeling Federated Search • Application in real world • - But, not enough relevance judgments, not enough control… Require Thorough Simulation Modeling Federated Search in Research Environments • TREC Testbeds with about 100 information sources • Normal or moderately skewed size testbeds: Trec123 or Trec4_Kmeans • Skewed: Representative (large source with the same relevant doc density), Relevant (large source with higher relevant doc density), Nonrelevant (large source with lower relevant doc density) • Multiple type of search engines to reflect uncooperative environment 6

Outline Outline: • Introduction • Research Problems: the state-of-the-art and our contribution • Resource Representation • Resource Selection • Results Merging • A Unified Framework • Demo 7

Research Problems(Resource Representation) Previous Research on Resource Representation • Resource descriptions of words and the occurrences • - Query-Based Sampling (Callan, 1999): send query and get sampled doc • Centralized sample database: Collect docs from Query-Based Sampling (QBS) • - For query-expansion (Ogilvie & Callan, 2001), not very successful • - Successful utilization for other problems, throughout our new research • Information source size estimation • Capture-Recapture Model (Liu and Yu, 1999): • But require large number of interactions with information sources 8

Research Problems(Resource Representation) New Information Source Size Estimation Algorithm • Sample-Resample Model (Si and Callan, 2003) Estimate df of a term in sampled docs, Get total df from the source by resample query , Scale the number of sampled docs to estimate source size Estimated Size Experiments Measure: Absolute error ratio Actual Size 9

Outline Outline: • Introduction • Research Problems: the state-of-the-art and our contribution • Resource Representation • Resource Selection • Results Merging • A Unified Framework • Demo 10

Research Problems(Resource Selection) Goal of Resource Selection of Information Source Recommendation High-Recall: Select the (few) information sources that have the most relevant documents Previous Research on Resource Selection • “Big document” resource selection approach: Treat information sources as big documents, rank them by similarity of user query • Examples: CVV,CORI and KL-divergence They lose doc boundaries and do not optimize the goal of High-Recall New RElevant Doc Distribution Estimation (ReDDE) resource selection Estimate the percentage of relevant docs among sources and rank sources “Relevant Document Distribution Estimation Method for Resource Selection” (Luo Si & Jamie Callan, SIGIR ’03) 11

Source Scale Factor “Everything at the top is (equally) relevant” Rank on Centralized Complete DB Research Problems(Resource Selection) Relevant Doc Distribution Estimation (ReDDE) Algorithm Number of Relevant Docs Estimated Source Size Number of Sampled Docs Simple Rank on Centralized Complete DB with ranking on Centralized Complete DB 12

Research Problems(Resource Selection) Evaluated Ranking Experiments Measure: Desired Ranking 13

Outline Outline: • Introduction • Research Problems: the state-of-the-art and our contribution • Resource Representation • Resource Selection • Results Merging • A Unified Framework • Future Research 14

Research Problems(Results Merging) Goal of Results Merging Make different result lists comparable and merge them into a single list Difficulties: • Information sources may use different retrieval algorithms • Information sources have different corpus statistics Previous Research on Results Merging • Some methods download all docs and calculate comparable scores • large communication and computation costs • Some methods use heuristic combination: CORI method Semi-Supervised Learning (SSL) Merging (Si & Callan, 2002, 2003) Basic idea is to approximate centralized doc score by linear regression • Estimate linear models from overlap documents in both centralized sampled DB and individual ranked lists 15

Centralized Sample DB …… Engine N Engine 1 Engine 2 Resource Representation Resource Selection Overlap Docs …… Final Results . . . Research Problems(Results Merging) • SSL Results Merging (cont) In resource representation: • Build representations by QBS, collapse sampled docs into centralized sample DB In resource selection: • Rank sources, calculate centralized scores for docs in centralized sample DB In results merging: • Find overlap docs, build linear models, estimate centralized scores for all docs CSDB Ranking . . . . . . 16

Experiments Research Problems(Results Merging) Trec123 Trec4-kmeans 10 Sources Selected “Using Sampled Data and Regression to Merger Search Engine Results” (Luo Si & Jamie Callan, SIGIR ’02) “A Semi-Supervised Learning Method to Merge Search Engine Results” (Luo Si & Jamie Callan, TOIS ’03) 17

Outline Outline: • Introduction • Research Problems: the state-of-the-art and preliminary research • Resource Representation • Resource Selection • Results Merging • A Unified Framework • Demo 18

Research Problems(Unified Utility Framework) Goal of the Unified Utility Maximization Framework Integrate and adjust individual components of federated search to get global desired results for different applications Simply combine individual effective components together High-Recall vs. High-Precision High-Recall: Select sources that contain as many relevant docs as possible for information source recommendation High-Precision: Select sources that return many relevant docs at top part of final ranked list for federated document retrieval They are correlated but NOT identical, previous research does NOT distinguish them 19

Number of sources to select Research Problems(Unified Utility Framework) Unified Utility Maximization Framework (UUM) • Formalize federated search as mathematic optimization problem with respect to different goals of different applications Example: for document retrieval with High-Precision Goal: Number of rel docs in top part of rank list Retrieve fixed number of docs 20

Total number of documents to select Retrieve variable number of docs Research Problems(Unified Utility Framework) Unified Utility Maximization Framework (UUM) • Resource selection for federated document retrieval A variant to select variable number of docs from selected sources Solution: No simple solution, by dynamic programming “Unified Utility Maximization Framework for Resource Selection” (Luo Si & Jamie Callan, CIKM ’04) 21

Research Problems(Unified Utility Framework) Experiments Resource selection for federated document retrieval Trec123 Representative 3 Sources Selected SSL Merge 10 Sources Selected 22

Outline • Demo FedStats Project: Cooperative work with Jamie Callan, Thi Nhu Truong and Lawrence Yau 23

Outline • Demo Results Merging Experiments of FedStats for CORI and SSL 24

Future Research (Conclude) Conclude • Federated search has been hot research in last decade • Most of previous research is tied with “Big document” Approach • The new research advances the state-of-the-art • More theoretically solid foundation • More empirically effective • Better model real world applications Bridge from Cool Research to Practical Tool 25

Federated Search of Text Search Engines in Uncooperative Environments