Enhancing Information Retrieval: Exploring Term Proximity and Multi-Term Query Techniques
This research paper, presented by Jaehui Park and colleagues, investigates the challenges and methodologies related to information retrieval (IR), particularly focusing on multi-term queries. It discusses the significance of term proximity, compound queries, and the importance of syntactic and semantic structures in improving retrieval effectiveness. By analyzing term pair scoring and proximity measures, the study aims to enhance the relevance of search results in IR systems. The findings reveal that closer proximity of query terms significantly impacts document relevance, thereby paving the way for more effective keyword searches.
Enhancing Information Retrieval: Exploring Term Proximity and Multi-Term Query Techniques
E N D
Presentation Transcript
Survey Jaehui Park 2008. 07. 17.
Introduction • Members • Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon • We are interested in • Issues in Information Retrieval • About crawling, indexing, searching and ranking methods • How to process multi-term queries in information retrieval environments • Ex) • Today • US Today • Today Weather • Paris Today Weather -> Multi-term queries express more complex information need than single queries.
Main Topic • Long Queries in Keyword Search • Keywords: • Compound query, Evidence Combination, Phrasal Query, Multi-term Query, Multiple Keyword Search, Multiword Unit, and so on. • Issues • proximity or distance • syntactic structure (order) • semantic • NLP remedies • …
Proximity • An intuitive concept for processing multiple term queries • Readings • Term Proximity Scoring for Keyword-Based Retrieval Systems • [ECIR 2003] Yves Rasolofo and Jacques Savoy • Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval • [TREC 2005] Stefan Buttcher and Charles L. A. Clarke • Efficient Text Proximity Search • [SPIRE 2007] Ralf Schenkel, et al. • Why Bigger Windows Are Better Than Smaller Ones • [TR-UM 1997] Ron Papka and James Allan • …
Term Proximity Scoring for Keyword-Based Retrieval Systems Yves Rasolofo and Jacques Savoy European Colloquium on IR Research(ECIR) 2003, LNCS 2633 2008. 07. 17. Presented by Jaehui Park
Introduction • Phrase, term proximity or term distance in IR • Focus on adding a word pair scoring module • Okapi probabilistic model + proximity measurement • Previous work • Salton & McGil [1983] • Generating statistical phrases based on word co-occurrence • Fagan [1987] • Considering syntactic relation or syntactic structures • Mitra et al. [1997] • “Once a good basic ranking scheme is used, the use of phrases do not have a major effect on precision at high ranks” • Arampatzis et al.[2000] • The lack of success when using NLP technique in IR • Hawking & Thistlewaite [1996] • The use of proximity scoring within the PADRE system (Z-mode method)
Okapi • Okapi [Robertson & Spark Jones 1976] • Document ranking function according to their relevance to a given search query based on the probabilistic retrieval model • Considering • Term frequency • Document length • The weight for a given term ti in document d
Okapi • Okapi [Robertson & Spark Jones 1976] (continued) • The weight for the term ti within a query • The retrieval status value (for a document according to a query)
Term Proximity Weighting • Improving retrieval performance by using term proximity scoring • Assumption • If a document contains sentences having at least two query terms within them, the probability that this document will be relevant must be greater. • The closer are the query terms, the higher is the relevance probability. • Objective • Assigning more importance to those keywords having a short distance between their occurrences.
Term Proximity Weighting • 1. expand the request(query) using keyword pairs extracted from the query’s wording • 2. compute a term pair instance weight • “information retrieval “ : 1.0 • “the retrieval of medical information” : 0.11 (1/9)
Term Proximity Weighting 3. sum all the corresponding term pairs 4. compute the contribution of all occurring term pairs in the document 5. compute the final retrieval status value
Experiments • Test Collections • TREC-8 document (528,155 docs) • Financial Times, Federal Register, Foreign Broadcast Information Service, LA Times • TREC-9, TREC-10 (1,692,096 docs)
Experiments Evaluation
Experiments Evaluation
Experiments Evaluation
Conclusion • The impact of a new term proximity algorithm on retrieval effectiveness for keyword-based system was examined. • Improve ranking for documents having query term pairs occurring within a given distance constraint. • The term proximity scoring approach • Improve precision after retrieving a few documents