1 / 35

Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University of Californ

Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University of California, Berkeley ray@sims.berkeley.edu. Overview. The problem area Distributed searching tasks and issues Our approach to resource characterization and search

landry
Télécharger la présentation

Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University of Californ

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed IR for Digital LibrariesRay R. LarsonSchool of Information Management & SystemsUniversity of California, Berkeleyray@sims.berkeley.edu ECDL 2003, Trondheim -- Ray R. Larson

  2. Overview • The problem area • Distributed searching tasks and issues • Our approach to resource characterization and search • Experimental evaluation of the approach • Application and use of this method in working systems ECDL 2003, Trondheim -- Ray R. Larson

  3. The Problem • Prof. Casarosa’s definition of the Digital Library vision in yesterday afternoons plenary session -- Access to everyone for “all human knowledge” • Lyman and Varian’s estimates of the “Dark Web” • Hundreds or Thousands of servers with databases ranging widely in content, topic, format • Broadcast search is expensive in terms of bandwidth and in processing too many irrelevant results • How to select the “best” ones to search? • Which resource to search first? • Which to search next if more is wanted? • Topical /domain constraints on the search selections • Variable contents of database (metadata only, full text, multimedia…) ECDL 2003, Trondheim -- Ray R. Larson

  4. Distributed Search Tasks • Resource Description • How to collect metadata about digital libraries and their collections or databases • Resource Selection • How to select relevant digital library collections or databases from a large number of databases • Distributed Search • How to perform parallel or sequential searching over the selected digital library databases • Data Fusion • How to merge query results from different digital libraries with their different search engines, differing record structures, etc. ECDL 2003, Trondheim -- Ray R. Larson

  5. An Approach for Distributed Resource Discovery • Distributed resource representation and discovery • New approach to building resource descriptions based on Z39.50 • Instead of using broadcast search across resources we are using two Z39.50 Services • Identification of database metadata using Z39.50 Explain • Extraction of distributed indexes using Z39.50 SCAN • Evaluation • How efficiently can we build distributed indexes? • How effectively can we choose databases using the index? • How effective is merging search results from multiple sources? • Can we build hierarchies of servers (general/meta-topical/individual)? ECDL 2003, Trondheim -- Ray R. Larson

  6. Z39.50 Overview UI Map Query Search Engine Map Results Map Query Internet Map Results Map Query UI Map Results ECDL 2003, Trondheim -- Ray R. Larson

  7. Z39.50 Explain • Explain supports searches for • Server-Level metadata • Server Name • IP Addresses • Ports • Database-Level metadata • Database name • Search attributes (indexes and combinations) • Support metadata (record syntaxes, etc) ECDL 2003, Trondheim -- Ray R. Larson

  8. Z39.50 SCAN • Originally intended to support Browsing • Query for • Database • Attributes plus Term (i.e., index and start point) • Step Size • Number of terms to retrieve • Position in Response set • Results • Number of terms returned • List of Terms and their frequency in the database (for the given attribute combination) ECDL 2003, Trondheim -- Ray R. Larson

  9. Z39.50 SCAN Results Syntax: zscan indexname1 term stepsize number_of_terms pref_pos % zscan title cat 1 20 1 {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 27} {cat-fight 1} {catalan 19} {catalogu 37} {catalonia 8} {catalyt 2} {catania 1} {cataract 1} {catch 173} {catch-all 3} {catch-up 2} … zscan topic cat 1 20 1 {SCAN {Status 0} {Terms 20} {StepSize 1} {Position 1}} {cat 706} {cat-and-mouse 19} {cat-burglar 1} {cat-carrying 1} {cat-egory 1} {cat-fight 1} {cat-gut 1} {cat-litter 1} {cat-lovers 2} {cat-pee 1} {cat-run 1} {cat-scanners 1} … ECDL 2003, Trondheim -- Ray R. Larson

  10. Resource Index Creation • For all servers, or a topical subset… • Get Explain information • For each index • Use SCAN to extract terms and frequency • Add term + freq + source index + database metadata to the XML “Collection Document” for the resource • Planned extensions: • Post-Process indexes (especially Geo Names, etc) for special types of data • e.g. create “geographical coverage” indexes ECDL 2003, Trondheim -- Ray R. Larson

  11. MetaSearch Approach Search Engine MetaSearch Server Map Query Map Explain And Scan Queries Map Results Internet DB 1 DB2 Map Results Search Engine Map Query Distributed Index Search Engine Map Results Db 5 Db 6 ECDL 2003, Trondheim -- Ray R. Larson DB 3 DB 4

  12. Known Issues and Problems • Not all Z39.50 Servers support SCAN or Explain • Solutions that appear to work well: • Probing for attributes instead of explain (e.g. DC attributes or analogs) • We also support OAI and can extract OAI metadata for servers that support OAI • Query-based sampling (Callan) • Collection Documents are static and need to be replaced when the associated collection changes ECDL 2003, Trondheim -- Ray R. Larson

  13. Evaluation • Test Environment • TREC Tipster data (approx. 3 GB) • Partitioned into 236 smaller collections based on source and date by month (no DOE) • High size variability (from 1 to thousands of records) • Same database as used in other distributed search studies by J. French and J. Callan among others • Used TREC topics 51-150 for evaluation (these are the only topics with relevance judgements for all 3 TIPSTER disks ECDL 2003, Trondheim -- Ray R. Larson

  14. Test Database Characteristics ECDL 2003, Trondheim -- Ray R. Larson

  15. Test Database Characteristics ECDL 2003, Trondheim -- Ray R. Larson

  16. Harvesting Efficiency • Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 Mb) • Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create a collection representative, over the network • Average of 14.07 seconds • Also tested larger databases (E.g. TREC FT database ~600 Mb with 7 indexes was harvested in 131 seconds. ECDL 2003, Trondheim -- Ray R. Larson

  17. Our Collection Ranking Approach • We attempt to estimate the probability of relevance for a given collection with respect to a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for weight calculation at retrieval time • Estimates from multiple extracted indexes are combined to provide an overall ranking score for a given resource (I.e., fusion of multiple query results) ECDL 2003, Trondheim -- Ray R. Larson

  18. Probabilistic Retrieval: Logistic Regression Probability of relevance for a given index is based on logistic regression from a sample set documents to determine values of the coefficients (TREC). At retrieval the probability estimate is obtained by: ECDL 2003, Trondheim -- Ray R. Larson

  19. Statistics Used for Regression Variables • Average Absolute Query Frequency • Query Length • Average Absolute Collection Frequency • Collection size estimate • Average Inverse Collection Frequency • Number of terms in common between query and collection representative (Details in the proceedings) ECDL 2003, Trondheim -- Ray R. Larson

  20. Other Approaches • GlOSS – Developed by the DL project at Stanford Univ. Uses fairly conventional TFIDF ranking • CORI – Developed by J. Callan and students at CIIR. Uses a ranking that exploits some of the features of the INQUERY system in merging evidence ECDL 2003, Trondheim -- Ray R. Larson

  21. Evaluation • Effectiveness • Tested using the collection representatives described above (as harvested from over the network) and the TIPSTER relevance judgements • Testing by comparing our approach to known algorithms for ranking collections • Results were measured against reported results for the Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX) • Recall analog (How many of the Rel docs occurred in the top n databases – averaged) ECDL 2003, Trondheim -- Ray R. Larson

  22. Titles only (short query) ECDL 2003, Trondheim -- Ray R. Larson

  23. Long Queries ECDL 2003, Trondheim -- Ray R. Larson

  24. Very Long Queries ECDL 2003, Trondheim -- Ray R. Larson

  25. Current Usage • Mersey Libraries • Distributed Archives Hub • Related approaches • JISC Resource Discovery Network • (OAI-MHP Harvesting with Cheshire Search) • Planned use with TEL by the BL ECDL 2003, Trondheim -- Ray R. Larson

  26. Future • Logically Clustering servers by topic • Meta-Meta Servers (treating the MetaSearch database as just another database) ECDL 2003, Trondheim -- Ray R. Larson

  27. Distributed Metadata Servers Database Servers Meta-Topical Servers General Servers Replicated servers ECDL 2003, Trondheim -- Ray R. Larson

  28. Conclusion • A practical method for metadata harvesting and an effective algorithm for distributed resource discovery • Further research • Continuing development of the Cheshire III system • Applicability of language modelling methods to resource discovery • Developing and Evaluating methods for merging cross-domain results, such as text and image or text and GIS datasets (or, perhaps, when to keep them separate) ECDL 2003, Trondheim -- Ray R. Larson

  29. Further Information • Full Cheshire II client and server source is available ftp://cheshire.berkeley.edu/pub/cheshire/ • Includes HTML documentation • Project Web Site http://cheshire.berkeley.edu/ ECDL 2003, Trondheim -- Ray R. Larson

  30. ECDL 2003, Trondheim -- Ray R. Larson

  31. Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Collection Frequency Collection size estimate Average Inverse Collection Frequency Inverse Document Frequency (N = Number of collections M = Number of Terms in common between query and document ECDL 2003, Trondheim -- Ray R. Larson

  32. CORI ranking ECDL 2003, Trondheim -- Ray R. Larson

  33. Measures for Evaluation • Assume each database has some merit for a given query, q • Given a Baseline ranking B and an estimated (test) ranking E for • Let dbbiand dbei denote the database in the i-th ranked position of rankings B and E • Let Bi = merit(q, dbbi) and Ei = merit(q, dbei) • We can define some measures: ECDL 2003, Trondheim -- Ray R. Larson

  34. Measures for Evaluation – Recall Analogs ECDL 2003, Trondheim -- Ray R. Larson

  35. Measures for Evaluation – Precison Analog ECDL 2003, Trondheim -- Ray R. Larson

More Related