Enhancing Information Retrieval: Exploring Term Proximity and Multi-Term Query Techniques

Survey Jaehui Park 2008. 07. 17.

Introduction • Members • Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon • We are interested in • Issues in Information Retrieval • About crawling, indexing, searching and ranking methods • How to process multi-term queries in information retrieval environments • Ex) • Today • US Today • Today Weather • Paris Today Weather -> Multi-term queries express more complex information need than single queries.

Main Topic • Long Queries in Keyword Search • Keywords: • Compound query, Evidence Combination, Phrasal Query, Multi-term Query, Multiple Keyword Search, Multiword Unit, and so on. • Issues • proximity or distance • syntactic structure (order) • semantic • NLP remedies • …

Proximity • An intuitive concept for processing multiple term queries • Readings • Term Proximity Scoring for Keyword-Based Retrieval Systems • [ECIR 2003] Yves Rasolofo and Jacques Savoy • Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval • [TREC 2005] Stefan Buttcher and Charles L. A. Clarke • Efficient Text Proximity Search • [SPIRE 2007] Ralf Schenkel, et al. • Why Bigger Windows Are Better Than Smaller Ones • [TR-UM 1997] Ron Papka and James Allan • …

Term Proximity Scoring for Keyword-Based Retrieval Systems Yves Rasolofo and Jacques Savoy European Colloquium on IR Research(ECIR) 2003, LNCS 2633 2008. 07. 17. Presented by Jaehui Park

Introduction • Phrase, term proximity or term distance in IR • Focus on adding a word pair scoring module • Okapi probabilistic model + proximity measurement • Previous work • Salton & McGil [1983] • Generating statistical phrases based on word co-occurrence • Fagan [1987] • Considering syntactic relation or syntactic structures • Mitra et al. [1997] • “Once a good basic ranking scheme is used, the use of phrases do not have a major effect on precision at high ranks” • Arampatzis et al.[2000] • The lack of success when using NLP technique in IR • Hawking & Thistlewaite [1996] • The use of proximity scoring within the PADRE system (Z-mode method)

Okapi • Okapi [Robertson & Spark Jones 1976] • Document ranking function according to their relevance to a given search query based on the probabilistic retrieval model • Considering • Term frequency • Document length • The weight for a given term ti in document d

Okapi • Okapi [Robertson & Spark Jones 1976] (continued) • The weight for the term ti within a query • The retrieval status value (for a document according to a query)

Term Proximity Weighting • Improving retrieval performance by using term proximity scoring • Assumption • If a document contains sentences having at least two query terms within them, the probability that this document will be relevant must be greater. • The closer are the query terms, the higher is the relevance probability. • Objective • Assigning more importance to those keywords having a short distance between their occurrences.

Term Proximity Weighting • 1. expand the request(query) using keyword pairs extracted from the query’s wording • 2. compute a term pair instance weight • “information retrieval “ : 1.0 • “the retrieval of medical information” : 0.11 (1/9)

Term Proximity Weighting 3. sum all the corresponding term pairs 4. compute the contribution of all occurring term pairs in the document 5. compute the final retrieval status value

Experiments • Test Collections • TREC-8 document (528,155 docs) • Financial Times, Federal Register, Foreign Broadcast Information Service, LA Times • TREC-9, TREC-10 (1,692,096 docs)

Experiments Evaluation

Conclusion • The impact of a new term proximity algorithm on retrieval effectiveness for keyword-based system was examined. • Improve ranking for documents having query term pairs occurring within a given distance constraint. • The term proximity scoring approach • Improve precision after retrieving a few documents

Enhancing Information Retrieval: Exploring Term Proximity and Multi-Term Query Techniques

Enhancing Information Retrieval: Exploring Term Proximity and Multi-Term Query Techniques

Presentation Transcript

Survey

Survey

Survey

Survey

Survey

Survey

Survey

Survey

Survey

Survey

Survey

SURVEY

Survey

Survey

SURVEY

Survey

Survey

Survey

SURVEY

SURVEY

SURVEY

Sea Ice

Sea Ice