Metasearch Engines: Solutions and Challenges in Text Retrieval

VLDB'99 TUTORIALMetasearch Engines: Solutions and Challenges Clement Yu Weiyi Meng Dept. of EECS Dept. of Computer Science U. of Illinois at Chicago SUNY at Binghamton Chicago, IL 60607 Binghamton, NY 13902 yu@eecs.uic.edu meng@cs.binghamton.edu

The Problem search search search engine 1 engine 2 engine n . . . . . . text text text source 1 source 2 source n How am I going to find the 5 best pages on “Internet Security”?

Metasearch Engine Solution user user interface query dispatcherresult merger search search search engine 1 engine 2 engine n . . . . . . text text text source 1 source 2 source n query result

Some Observations • most sources are not useful for a given query • sending a query to a useless source would • incur unnecessary network traffic • waste local resources for evaluating the query • increase the cost of merging the results • retrieving too many documents from a source is inefficient

A More Efficient Metasearch Engine user user interface database selector document selector query dispatcherresult merger search search search engine 1 engine 2 engine n . . . . . . text text text source 1 source 2 source n query result

Tutorial Outline 1. Introduction to Text Retrieval • consider only Vector Space Model 2. Search Engines on the Web 3. Introduction to Metasearch Engine 4. Database Selection 5. Document Selection 6. Result Merging 7. New Challenges

Introduction to Text Retrieval (1) Document representation • remove stopwords: of, the, ... • stemming: stemming stem • d = (d1 , ..., di , ..., dn) di : weight of ith term in d • tf *idf formula for computing di Example: consider term t of document d in a database of N documents. tf weight of t in d if tf > 0: 0.5 + 0.5*tf/max_tf idf weight of t: log(N/df) weight of t in d: (0.5 + 0.5*tf/max_tf)*log(N/df)

Introduction to Text Retrieval (2) Query representation • q = (q1 , ..., qi , ..., qn) qi: weight of ith term in q • compute qi : tf weight only • alternative: use idf weight for query terms not document terms • query expansion (e.g., add related terms)

Introduction to Text Retrieval (3) Similarity Functions • simple dot product: • favor long documents • Cosine function: • other similarity functions exist • normalized similarities: [0, 1.0] q  d

Introduction to Text Retrieval (4) Retrieval Effectiveness • relevant documents: documents useful to the user of query • recall: percentage of relevant documents retrieved • precision: percentage of retrieved documents that are relevant precision recall

Search Engines on the Web (1) Search engine as a document retrieval system • no control on web pages that can be searched • web pages have rich structures and semantics • web pages are extensively linked • additional information for each page (time last modified, organization publishing it, etc.) • databases are dynamic and can be very large • few general-purpose search engines and numerous special-purpose search engines

Search Engines on the Web (2) New indexing techniques • partial-text indexing to improve scalability • ignore and/or discount spamming terms • use anchor terms to index linked pages e.g.: WWWW [McBr94], Google [BrPa98], Webor [CSM97] Page 2: http://travelocity.com/ Page 1 . . . . . . airplane ticket and hotel . . . . . .

Search Engines on the Web (3) New term weighting schemes • higher weights to terms enclosed by special tags • title (SIBRIS [WaWJ89], Altavista, HotBot, Yahoo) • special fonts (Google [BrPa98]) • special fonts & tags (LASER [BoFJ96]) • Webor [CSM97] approach • partition tags into disjoint classes (title, header, strong, anchor, list, plain text) • assign different importance factors to terms in different classes • determine optimal importance factors

Search Engines on the Web (4) New document ranking methods • Vector Spreading Activation [YuLe96] • add a fraction of parents' similarities Example: Suppose for query q sim(q, d1) = 0.4 sim(q, d2) = 0.2 sim(q, d3) = 0.2 final score of d3 = 0.2 + 0.1*0.4 + 0.1*0.2 = 0.26 d1 d3 d2

Search Engines on the Web (5) New document ranking methods • combine similarity with rank • PageRank [PaBr98]: an important page is linked to by many pages and/or by important pages • combine similarity with authority score • authority [Klei98]: an important content page is highly linked to among initially retrieved pages and their neighbors

Introduction to Metasearch Engine (1) An Example Query: Internet Security Databases: NYT ... WP ... DB ... DB ... Retrieved results : t1, t2, ... p1, p2, … Merged results : p1, t1, ...

Introduction to Metasearch Engine (2) Database Selection Problem • Select potentially useful databases for a given query • essential if the number of local databases is large • reduce network traffic • avoid wasting local resources query

Introduction to Metasearch Engine (3) • Potentially useful database: contain potentially useful documents • Potentially useful documents: • global similarity above a threshold • global similarity among m highest • Need some knowledge about each database in advance in order to perform database selection • Database Representative

Introduction to Metasearch Engine (4) Document Selection Problem Select potentially useful documents from each selected local database efficiently Step 1: Retrieve all potentially useful documents while minimizing the retrieval of useless documents • from global similarity threshold to tightest local similarity threshold want all d: Gsim(q, d) > GT retrieve d from DBk : Lsim(q, d) > LTk LTk is largest : Gsim(q, d) > GT Lsim(q, d) > LTk

Introduction to Metasearch Engine (5) Efficient Document Selection Step 2: Transmit all potentially useful documents to result merger while minimizing the transmission of useless documents • further filtering to reduce transmission cost and merge cost Example: local DBk retrieve transmit filter d1 , …, ds d2, d7, d10

Introduction to Metasearch Engine (6) Result Merging Problem Objective: Merge returned documents from multiple sources into a single ranked list. Difficulty: Local document similarities may be incomparable or not available. Solutions: Generate "global similarities” for ranking. d11, d12, ... DB1 . . . . . . Merger d12, d54, ... dN1, dN2, ... DBN

Introduction to Metasearch Engine (7) An Ideal Metasearch Engine: • Retrieval effectiveness: same as that as if all documents were in the same collection. • Efficiency: optimize the retrieval process Implications: should aimed at: • selecting only useful search engines • retrieving and transmitting only useful documents • ranking documents according to their degrees of relevance

Introduction to Metasearch Engine (8) Main Sources of Difficulties: [MYL99] • autonomy of local search engines • design autonomy • maintenance autonomy • heterogeneities among local search engines • indexing method • document/query term weighting schemes • similarity/ranking function • document database • document version • result presentation

Introduction to Metasearch Engine (9) Impact of Autonomy and Heterogeneities [MLY99] • unwilling to provide database representatives or provide different types of representatives • difficult to find potentially useful documents • difficult to merge documents from multiple sources

Database Selection: Basic Idea Goal: Identify potentially useful databases for each user query. General approach: • use representative to indicate approximately the content of each database • use these representatives to select databases for each query Diversity of solutions • different types of representatives • different algorithms using the representatives

Solution Classification • Naive Approach: select all databases (e.g. MetaCrawler, NCSTRL) • Qualitative Approaches: estimate the quality of each local database • based on rough representatives • based on detailed representatives • Quantitative Approaches: estimate quantities that measure the quality of each local database more directly and explicitly • Learning-based Approaches: database representatives are obtained through training or learning

Qualitative Approaches Using Rough Representatives • typical representative: • a few words or a few paragraphs in certain format • manual construction often needed • can work well for special-purpose local search engines • very scalable storage requirement • selection can be inaccurate as the description is too rough

Qualitative Approaches Using Rough Representatives Example 1: ALIWEB [Kost94] • Representative has a fixed format: site containing files for the Perl Language Template-Type: DOCUMENT Title: Perl Description: Information on the Perl Programming Language. Includes a local Hypertext Perl Manual, and the latest FAQ in Hypertext. Keywords: perl, perl-faq, language • user query can match against one or more fields

Qualitative Approaches Using Rough Representatives Example 2: NetSerf [ChHa95] • Representative has a WordNet based structure: site for world facts listed by country topic: country synset: [nation, nationality, land, country, a_people] synset: [state, nation, country, land, commonwealth, res_publica, body_politic] synset: [country, state, land, nation] info-type: facts • user query is transformed to similar structure before match

Qualitative Approaches Using Detailed Representatives Use detailed statistical information for each term • employ special measures to estimate the usefulness/quality of each search engine for each query • the measures reflect the usefulness in a less direct/explicit way compared to those used in quantitative approaches. • scalability starts to become an issue

Qualitative Approaches Using Detailed Representatives Example 1: gGlOSS [GrGa95] • representative: for term ti -- document frequency of ti -- the sum of weights of ti in all documents • database usefulness: sum of high similarities usefulness(q, D, T) =

gGlOSS (continued) Suppose for query q , we have D1 d11: 0.6, d12: 0.5 D2 d21: 0.3, d22: 0.3, d23: 0.2 D3 d31: 0.7, d32: 0.1, d33: 0.1 usefulness(q, D1, 0.3) = 1.1 usefulness(q, D2, 0.3) = 0.6 usefulness(q, D3, 0.3) = 0.7

gGlOSS (continued) gGlOSS: usefulness is estimated for two cases • high-correlation case: if dfi dfj , then every document having ti also has tj . Example: Consider q = (1, 1, 1) with df1 = 2, df2= 3, df3 = 4, W1 = 0.6, W2 = 0.6 and W3 = 1.2. t1 t2 t3 t1 t2 t3 d1 0.2 0.1 0.3 0.3 0.2 0.3 d2 0.4 0.3 0.2 0.3 0.2 0.3 d3 0 0.2 0.4 0 0.2 0.3 d4 0 0 0.3 0 0 0.3 usefulness(q, D, 0.5) = W1 + W2 + df2*W3/df3 = 2.1

gGlOSS (continued) • disjoint case: for any two query terms ti and tj , no document contains both ti and tj . Example: Consider q = (1, 1, 1) with df1 = 2, df2= 1, df3 = 1, W1 = 0.5, W2 = 0.2 and W3 = 0.4 . t1 t2 t3 t1 t2 t3 d1 0.2 0 0 0.25 0 0 d2 0 0.2 0 0 0.2 0 d3 0.3 0 0 0.25 0 0 d4 0 0 0.4 0 0 0.4 usefulness(q, D, 0.3) = = W3 = 0.4

gGlOSS (continued) Some observations • usefulness dependent on threshold • representative has two quantities per term • strong assumptions are used • high-correlation tends to overestimate • disjoint tends to underestimate • the two estimates tend to form bounds to the sum of the similarities  T

Qualitative Approaches Using Detailed Representatives Example 2: CORI Net [CaLC95] • representative: (dfi , cfi ) for term ti dfi -- document frequency of ti cfi -- collection frequency of ti • cfi can be shared by all databases • database usefulness usefulness(q, D) = sim(q, representative of D) usefulness similarity dfi tfi cfi dfi

CORI Net (continued) Some observations • estimates independent of threshold • representative has less than two quantities per term • similarity is computed based on inference network • same method for ranking documents and ranking databases

Qualitative Approaches Using Detailed Representatives Example 3: D-WISE [YuLe97] • representative: dfi,j for term tj in database Di • database usefulness: a measure of query term concentration in different databases usefulness(q, Di) = k : number of query terms CVVj : cue validity variance of term tj across all databases; larger CVVj tj is more useful in distinguishing different databases

D-WISE (continued) N : number of databases ACVj : average cue validity of tj over all databases • Observations: • estimates independent of threshold • representative has one quantity per term • measure is difficult to understand ni : number of documents in database Di

Quantitative Approaches Two types of quantities may be estimated wrt query q: • the number of documents in a database D with similarities higher than a threshold T: NoDoc(q, D, T) = |{ d : d  D and sim(q, d) > T }| • the global similarity of the most similar document in D: msim(q, D) = max { sim(q, d) } dD • can be used to rank databases in descending order of similarity (or any desirability measure)

Estimating NoDoc(q, D, T) Basic Approach [MLYW98] • representative: (pi , wi) for term ti pi : probability that ti appears in a document wi : average weight of ti among documents having ti Example: normalized weights of tiin 10 documents are (0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4, 0.6, 0.6). pi = 0.6, wi = 0.4

Estimating NoDoc(q, D, T) Basic Approach (continued) Example: Consider query q = (1, 1). Suppose p1 = 0.2, w1 = 2, p2 = 0.4, w2 = 1. A generating function: (0.2 X 2 + 0.8) (0.4 X + 0.6) = 0.08 X 3 + 0.12 X 2 + 0.32 X + 0.48 a X b : a is the probability that a document in D has similarity b with q NoDoc(q, D, 1) = 10*(0.08 + 0.12) = 2

Estimating NoDoc(q, D, T) Basic Approach (continued) Consider query q = (q1, ..., qr). Proposition. If the terms are independent and the weight of term tiwhenever present in a document is wi (the average weight), 1  i  r, then the coefficient of X s in the following generating function is the probability that a document in D has similarity s with q.

Estimating NoDoc(q, D, T) Subrange-based Approach [MLYW99] • overcome the uniform term weight assumption • additional information for term ti : i : standard deviation of weights of ti in all documents mnwi : maximum normalized weight of ti

Estimating NoDoc(q, D, T) Example: weights of term ti : 4, 4, 1, 1, 1, 1, 0, 0, 0, 0 generating function (factor) using average weight 0.6*X 2 + 0.4 a more accurate function using subranges of weights 0.2*X 4 + 0.4*X + 0.4 In general, weights are partitioned to k subranges: pi1*X mi1 + ... + pik*X mik + (1 - pi) Probability pijand median mij can be estimated using di and the average of weights of ti . A special implementation: Use the maximum normalized weight as the first subrange by itself.

Estimating NoDoc(q, D, T) Combined-term Approach [LYMW99] • relieve the term independence assumption Example: Consider query : Chinese medicine . Suppose generating function for: Chinese: 0.1X3 + 0.3X + 0.6 medicine: 0.2X2 + 0.4 X + 0.4 Chinese medicine: 0.02 X5 + 0.04 X4 + 0.1X3 + … “Chinese medicine”: 0.05 Xw + ...

Estimating NoDoc(q, D, T) Criteria for combining “Chinese” and “medicine”: • The maximum normalized weight of the combined term is higher than the maximum normalized weight of each of the two individual terms (w > 3); • The sum of estimated probabilities of terms with exponents  w under the term independence assumption is very different from 1/N, N is the number of documents in database; • They are adjacent terms in previous queries.

Database Selection Using msim(q,D) Optimal Ranking of Databases [YLWM99b] User: for query q, find the m most similar documents or with the m largest degrees of relevance Definition: Databases [D1, D2, …, Dp] are optimally ranked with respect to q if there exists a k such that each of the databases D1, …, Dk contains one of the m most similar documents, and all of these m documents are contained in these k databases.

Database Selection Using msim(q,D) Optimal Ranking of Databases Example: For a given query q: D1 d1: 0.8, d2: 0.5, d3: 0.2, ... D2 d9: 0.7, d2: 0.6, d10: 0.4, ... D3 d8: 0.9, d12: 0.3, … other databases have documents with small similarities When m = 5: pick D1, D2, D3

Database Selection Using msim(q,D) Proposition: Databases [D1, D2, …, Dp] are optimally ranked with respect to a query q if and only if msim(q, Di)  msim(q, Dj), i < j Example: D1 d1: 0.8, … D2 d9: 0.7, … D3 d8: 0.9, … Optimal rank: [D3, D1, D2, …]

Metasearch Engines: Solutions and Challenges in Text Retrieval