Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University of Californ

Distributed IR for Digital LibrariesRay R. LarsonSchool of Information Management & SystemsUniversity of California, Berkeleyray@sims.berkeley.edu ECDL 2003, Trondheim -- Ray R. Larson

Overview • The problem area • Distributed searching tasks and issues • Our approach to resource characterization and search • Experimental evaluation of the approach • Application and use of this method in working systems ECDL 2003, Trondheim -- Ray R. Larson

The Problem • Prof. Casarosa’s definition of the Digital Library vision in yesterday afternoons plenary session -- Access to everyone for “all human knowledge” • Lyman and Varian’s estimates of the “Dark Web” • Hundreds or Thousands of servers with databases ranging widely in content, topic, format • Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results • How to select the “best” ones to search? • Which resource to search first? • Which to search next if more is wanted? • Topical /domain constraints on the search selections • Variable contents of database (metadata only, full text, multimedia…) ECDL 2003, Trondheim -- Ray R. Larson

Distributed Search Tasks • Resource Description • How to collect metadata about digital libraries and their collections or databases • Resource Selection • How to select relevant digital library collections or databases from a large number of databases • Distributed Search • How to perform parallel or sequential searching over the selected digital library databases • Data Fusion • How to merge query results from different digital libraries with their different search engines, differing record structures, etc. ECDL 2003, Trondheim -- Ray R. Larson

An Approach for Distributed Resource Discovery • Distributed resource representation and discovery • New approach to building resource descriptions based on Z39.50 • Instead of using broadcast search across resources we are using two Z39.50 Services • Identification of database metadata using Z39.50 Explain • Extraction of distributed indexes using Z39.50 SCAN • Evaluation • How efficiently can we build distributed indexes? • How effectively can we choose databases using the index? • How effective is merging search results from multiple sources? • Can we build hierarchies of servers (general/meta-topical/individual)? ECDL 2003, Trondheim -- Ray R. Larson

Z39.50 Overview UI Map Query Search Engine Map Results Map Query Internet Map Results Map Query UI Map Results ECDL 2003, Trondheim -- Ray R. Larson

Z39.50 Explain • Explain supports searches for • Server-Level metadata • Server Name • IP Addresses • Ports • Database-Level metadata • Database name • Search attributes (indexes and combinations) • Support metadata (record syntaxes, etc) ECDL 2003, Trondheim -- Ray R. Larson

Z39.50 SCAN • Originally intended to support Browsing • Query for • Database • Attributes plus Term (i.e., index and start point) • Step Size • Number of terms to retrieve • Position in Response set • Results • Number of terms returned • List of Terms and their frequency in the database (for the given attribute combination) ECDL 2003, Trondheim -- Ray R. Larson

Z39.50 SCAN Results Syntax: zscan indexname1 term stepsize number_of_terms pref_pos % zscan title cat 1 20 1 {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 27} {cat-fight 1} {catalan 19} {catalogu 37} {catalonia 8} {catalyt 2} {catania 1} {cataract 1} {catch 173} {catch-all 3} {catch-up 2} … zscan topic cat 1 20 1 {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 706} {cat-and-mouse 19} {cat-burglar 1} {cat-carrying 1} {cat-egory 1} {cat-fight 1} {cat-gut 1} {cat-litter 1} {cat-lovers 2} {cat-pee 1} {cat-run 1} {cat-scanners 1} … ECDL 2003, Trondheim -- Ray R. Larson

Resource Index Creation • For all servers, or a topical subset… • Get Explain information • For each index • Use SCAN to extract terms and frequency • Add term + freq + source index + database metadata to the XML “Collection Document” for the resource • Planned extensions: • Post-Process indexes (especially Geo Names, etc) for special types of data • e.g. create “geographical coverage” indexes ECDL 2003, Trondheim -- Ray R. Larson

MetaSearch Approach Search Engine MetaSearch Server Map Query Map Explain And Scan Queries Map Results Internet DB 1 DB2 Map Results Search Engine Map Query Distributed Index Search Engine Map Results Db 5 Db 6 ECDL 2003, Trondheim -- Ray R. Larson DB 3 DB 4

Known Issues and Problems • Not all Z39.50 Servers support SCAN or Explain • Solutions that appear to work well: • Probing for attributes instead of explain (e.g. DC attributes or analogs) • We also support OAI and can extract OAI metadata for servers that support OAI • Query-based sampling (Callan) • Collection Documents are static and need to be replaced when the associated collection changes ECDL 2003, Trondheim -- Ray R. Larson

Evaluation • Test Environment • TREC Tipster data (approx. 3 GB) • Partitioned into 236 smaller collections based on source and date by month (no DOE) • High size variability (from 1 to thousands of records) • Same database as used in other distributed search studies by J. French and J. Callan among others • Used TREC topics 51-150 for evaluation (these are the only topics with relevance judgements for all 3 TIPSTER disks ECDL 2003, Trondheim -- Ray R. Larson

Test Database Characteristics ECDL 2003, Trondheim -- Ray R. Larson

Harvesting Efficiency • Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb) • Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create a collection representative, over the network • Average of 14.07 seconds • Also tested larger databases (E.g. TREC FT database ~600 Mb with 7 indexes was harvested in 131 seconds. ECDL 2003, Trondheim -- Ray R. Larson

Our Collection Ranking Approach • We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time • Estimates from multiple extracted indexes are combined to provide an overall ranking score for a given resource (I.e., fusion of multiple query results) ECDL 2003, Trondheim -- Ray R. Larson

Probabilistic Retrieval: Logistic Regression Probability of relevance for a given index is based on logistic regression from a sample set documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: ECDL 2003, Trondheim -- Ray R. Larson

Statistics Used for Regression Variables • Average Absolute Query Frequency • Query Length • Average Absolute Collection Frequency • Collection size estimate • Average Inverse Collection Frequency • Number of terms in common between query and collection representative (Details in the proceedings) ECDL 2003, Trondheim -- Ray R. Larson

Other Approaches • GlOSS – Developed by the DL project at Stanford Univ. Uses fairly conventional TFIDF ranking • CORI – Developed by J. Callan and students at CIIR. Uses a ranking that exploits some of the features of the INQUERY system in merging evidence ECDL 2003, Trondheim -- Ray R. Larson

Evaluation • Effectiveness • Tested using the collection representatives described above (as harvested from over the network) and the TIPSTER relevance judgements • Testing by comparing our approach to known algorithms for ranking collections • Results were measured against reported results for the Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX) • Recall analog (How many of the Rel docs occurred in the top n databases – averaged) ECDL 2003, Trondheim -- Ray R. Larson

Titles only (short query) ECDL 2003, Trondheim -- Ray R. Larson

Long Queries ECDL 2003, Trondheim -- Ray R. Larson

Very Long Queries ECDL 2003, Trondheim -- Ray R. Larson

Current Usage • Mersey Libraries • Distributed Archives Hub • Related approaches • JISC Resource Discovery Network • (OAI-MHP Harvesting with Cheshire Search) • Planned use with TEL by the BL ECDL 2003, Trondheim -- Ray R. Larson

Future • Logically Clustering servers by topic • Meta-Meta Servers (treating the MetaSearch database as just another database) ECDL 2003, Trondheim -- Ray R. Larson

Distributed Metadata Servers Database Servers Meta-Topical Servers General Servers Replicated servers ECDL 2003, Trondheim -- Ray R. Larson

Conclusion • A practical method for metadata harvesting and an effective algorithm for distributed resource discovery • Further research • Continuing development of the Cheshire III system • Applicability of language modelling methods to resource discovery • Developing and Evaluating methods for merging cross-domain results, such as text and image or text and GIS datasets (or, perhaps, when to keep them separate) ECDL 2003, Trondheim -- Ray R. Larson

Further Information • Full Cheshire II client and server source is available ftp://cheshire.berkeley.edu/pub/cheshire/ • Includes HTML documentation • Project Web Site http://cheshire.berkeley.edu/ ECDL 2003, Trondheim -- Ray R. Larson

ECDL 2003, Trondheim -- Ray R. Larson

Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Collection Frequency Collection size estimate Average Inverse Collection Frequency Inverse Document Frequency (N = Number of collections M = Number of Terms in common between query and document ECDL 2003, Trondheim -- Ray R. Larson

CORI ranking ECDL 2003, Trondheim -- Ray R. Larson

Measures for Evaluation • Assume each database has some merit for a given query, q • Given a Baseline ranking B and an estimated (test) ranking E for • Let dbbiand dbei denote the database in the i-th ranked position of rankings B and E • Let Bi = merit(q, dbbi) and Ei = merit(q, dbei) • We can define some measures: ECDL 2003, Trondheim -- Ray R. Larson

Measures for Evaluation – Recall Analogs ECDL 2003, Trondheim -- Ray R. Larson

Measures for Evaluation – Precison Analog ECDL 2003, Trondheim -- Ray R. Larson

Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University of Californ