Presentation Topic

Presentation Topic Searching Distributed Collections With Inference Networks Searching Distributed Collections Inference Network Keywords: COSC 6341-Information Retrieval

What is Distributed IR ? Homogeneous Collections Large single collection is partitioned and distributed over network to improve efficiency E.g.: Google Heterogeneous Collections Internet offers thousands of diverse collections available for searching on demand E.g.: P2P Architectural Aspects

P2P –peer to peer Architecture

How to search such a collection ? Consider the entire collection as a “single LARGE VIRTUAL Collection” Virtual Collection C2 C1 C3 C6 C5 C4

Search each collection individually Communication Costs?? Get the Results Search2 How to Merge the results ?? Search1 Results2 Results1 Virtual Collection Time required ?? C2 C1 Results3 C3 C6 Search3 Results6 C5 C4 Search6 Results4 Results5 Search4 Search5

Solution : An IR system must automatically: • Rank Collections • Collection Selection - Select specific collections to be searched • Merging the Ranked Results - Effectively Merge the results

Ranking Collection • Ranking of Collection can be addressed by Inference Network CORI net - Collection Retrieval Inference Network, is used to rank collections, as opposed to more commonDocument Retrieval Inference Network.

Inference Network dj kt ki k2 k1 q and q2 Document dj has index terms - k1,k2,ki Query Q composed of index terms-k1,k2,ki q1 = [( k1 and k2 ) or ki ] Boolean q2 = ( k1 and k2 ) Formulation Information Need I =q or q1 or or q1 I

Ranking Documents in Inference Network using tf - idf strategies • Term frequency ( tf ) =f (i, j)= P (ki|dj) • Inverse document Frequency ( idfi ) = P( q|¯k) • In an Inference network ,ranking of a document is computed as : P ( q ∆ dj ) = Cj * 1/|dj| * Σ f (i, j) * idfi * 1/(1-f (i, j)) P (ki | dj ) = influence of keyword ki on document dj P (q | ¯k ) = influence of index terms ki in the query node

Analogy between a Document Retrieval and a Collection retrieval Network. tf – idf Scheme for Documents df – icf Scheme for Collections • Documents • • tf – term frequency. • Number of occurrences of term in a document • • idf - inverse document freq. • f (Number of documents containing term) • dl - document length • D – Number of Documents • Collections • • df – document frequency • # of documents containing term • in a collection • • icf – inverse collection freq. • f (Number of collections containing • term) • cl – Collection length • C – Number of Collections

Comparison between Document Retrieval and Collection Retrieval Inference Networks Document Retrieval Inference Network Collection Retrieval Inference Network – CORI Net Retrieve documents based on a query Retrieve Collections based on a query

Why to use Inference Network ? • One system is used for ranking both Documents and Collections. Document retrieval becomes a simple process: • Use the query to retrieve a ranked list of collections • Select the top group of collections • Search the top group of collections, in parallel or sequentially • Merge the results from the various collections into a single ranking • To the retrieval Algorithm ,a CORI net looks like a document retrieval inference network with very big documents: • Each document is a surrogate for a complete collection

Interesting facts: • CORI net for 1.2 GB collection = 5MB(0.4 % of the original collection) • CORI net to search well-known collection of CACM (having 3000document collections) shows high values of df and icf but does not affect the computational complexity of retrieval.

Experiments on Trec Collection T = d_t +(1 –d_t) * log(df +0.5)/ log(max_df+1.0) I = log(|C|+0.5/cf) / log(|C|+1.0) Belief p(rk/ci) in collection ci due to observing term rk is given by: P ( rk / ci ) = d_b + ( 1 – d_b ) * T * I where, df=number of documents in ci containing term rk max_df=number of documents containing the most frequent term in ci |C| =number of collections cf=number of collections containing term rk d_t =minimum term frequency component when term rk occurs in collection ci d_b=minimum belief component when term rk occurs in collection ci

Effectiveness • This approach to ranking collections was evaluated using INQUERY retrieval system and TREC collection. • Experiments were conducted with 100queries • TREC Volume-1 Topics 51-150 • Mean squared error of the collection ranking for a single query is calculated as: 1/|C|*Σi € C (Oi - Ri)2 where, Oi = optimal ranking for collectioni based on the number of relevant documents it contained Ri = the rank for collection determined by the retrieval algorithm C = the set of collections being ranked

Results • Mean squared error averaged for first 50 queries=2.3471 • Ranking for almost 75% of the queries was perfect  • For rest of 25% - disorganized ranks  Reason: Ranking Collections is not exactly similar to Ranking Documents Scaling df max_df restricts small sets of interesting documents So, modification: “scaling df df+K” K=k *((1-b)+b*cw / ¯cw) Where, cw = number of words in the collection k,b =constants (b [0,1]) Thus, T = d_t +(1-d_t)* df / df +K

Modified results Best combination of b and k, b: b=0.75 and k=200 Mean squared error averaged over 50 queries = 1.4586(38% better than previous results) Rankings for 30 queries improved Rankings for 8 queries changed slightly Rankings for 12 queries did not change

Merging the Results Four approaches: • Interleaving • Raw scores • Normalized scores • Weighted scores Effectiveness increases

Raw Score Interleaving Step 1:Ranking Collection Step 2:Merging Results Assigning Weights 90 C1 D1-D10 SINGLE LARGE HOMOGENOUS COLLECTION D1-D60 C2 D11-D20 C3 D21-D30 2 69 D1,D5,D7 10,12,37 37 D1 D7 C4 D31-D40 C5 D41-D50 C6 D51-D60 3 1 D52 -32 D44 -29 D57-25 D5-12 D1-10 D59-1 D52 D5 D37 D57 D7 D59 D44, D37 D60,D52,D57,D59 29, 69 D44 D37 90, 32, 25, 1 D60 But again these weights from different collections may not be directly comparable…. This scheme is not satisfying, as we have only document Rankings and Collection rankings

Normalized Scores • In inference network, normalizing scores requires a preprocessing step prior to query evaluation • Preprocessing : The system obtains from each collection the statistics about how many documents each query term matches. The statistics are merged to obtain normalized idf .(Global weighing scheme) • Problem: High communicational and Computational costs. (if wide distributed network)

Weighted Scores • Weights can be based on document’s score and/or the collection ranking information. • Offers computational simplicity. • Weight w gives weight results for different collections. W = 1 + |C| * (s- ¯s / ¯s), where, |C|= the number of collections searched S = the collection’s score (not its rank)* ¯s = the mean of the collection scores *Assumption : Similar collections have similar weights

Weighted Scores • Rank (document) = score (document)* weight (collection) • This method favors documents from collections with high scores but also enables a good document from a poor collection. [which we are looking for]

Comparing results of the merging methods Source: TREC Collection Volume-1 Topics: 51-100

Merging Results-Pros and Cons • Interleaving: Extremely ineffective, losses in average precision. (The reason is that documents ranked high from non-relevant collection may reside near high-ranked documents from more relevant collections) • Raw Scores: Sometimes scores from different collections may not be directly comparable (like idf weights). Use of some terms in the collection may penalize its common use. • Normalized Scores: Resembles at most the search in single collections, but normalizing has significant communication and computational costs when collections are distributed across wide-area network • Weighted Scores: As effective as normalized scores, but less robust (introduces deviations in recall and precision)

Collection Selection Approaches: • Top n collections • Any collection with a score greater than some threshold • Top group (based on clustering) • Cost based selection

Results Eliminating collections reduces Recall.

Related work on collection selection:Group Manually and Select Manually • Collections are organized into groups with a common theme e.g., financial, technology, appellate court decisions • Users selects which group to search Found in commercial service providers e.g., experienced users like Librarians • Groupings determined manually (+) time consuming, inconsistent groupings, coarse groupings, not good for unusual information needs • Groupings determined automatically Broker agents maintain a centralized cluster index by periodically querying collections on eachsubject (+) automatic creation, better consistency, coarse groupings ( - )not good for unusual information needs

Rule-Based Selection The contents of each collection are described in a knowledge-base A rule-based system selects the collections for a query. EXPERT CONIT, a research system, tested on static and homogeneous collections (-) time consuming to create (-)inconsistent selection if rules change (-)coarse groupings so not good for unusual information needs

Optimization Represent a Collection With a Subset of Terms • Inference network with most frequent terms Atleast 20% most frequent words must be there

Proximity Information • Proximity of terms can be handled by CORI net. • CORI net for one collection would be 30% of the original collection. [before we had CORI net = 0.4% of original collection]

Retrieving Few documents • Usually a user is interested in first 10 or utmost 20 results. • Rest of the results are discarded ( Oh! Waste of resources and time) Example: C1 10DOCS C2 20DOCS C3 30DOCS TOP 10 TOP 10 TOP 10 MERGING OF THE RESULTS AND SELECTING TOP 10 MERGED RESULTS 20 Documents thrown out without even user looking into them Collections=C=3 Rankings of Interest=10 Docs retrieved= C*n=3*10=30 Docs discarded=(C-1)*n= (3-1)*10=20 USER

Experiments and results • The number of documents R retrieved from the ith ranked collection is: R(i) = M * n * [ 2(1+C – i ) / C * C + 1 ] where, M Є [1,C+1/2 ], M*n =number of documents to be retrieved from all collections

Result summary Collection: TREC Volume-1 C=5 M=2

Conclusions • Representing collections by terms and frequencies is effective. • Controlled vocabularies and schemas are not necessary. • Collections and documents can be ranked with one algorithm (using different statistics). e.g., GlOSS, inference networks • Rankings from different collections can be merged efficiently: – with precisely normalized scores (Infoseek’s method), or – without precisely normalized document scores, – with only minimal effort, and – with only minimal communication between client and server. • Large scale distributed retrieval can be accomplished now.

Open Problems • Multiple representations stemming, stopwords, query processing, indexing cheating / spamming • How to integrate relevance feedback query expansion browsing

References: • Primary Source: Searching Distributed Collections With Inference Networks By: James P. Callan , Zhihong Lu , W.Bruce Croft • Secondary Source: 1. Distributed Information retrieval By: James Allan, University of Massachusetts Amherst 2. Methodologies for Distributed Information Retrieval. By: Owen de Kretser, Alistair Moffat, Tim Shimmin, Justin Zobel. (The proceedings from 18th International Conference on Distributed Computing Systems )

Questions ??

Presentation Topic