160 likes | 299 Vues
This presentation explores a novel query-dependent ranking approach in information retrieval using K-Nearest Neighbor (KNN) methods. Recognizing the limitations of single ranking models, the proposed methodology utilizes KNN to match training queries, enhancing classification without rigid categories. By reducing feature space dimensions through Principal Component Analysis, the KNN approach demonstrates promising results in ranking with two diverse datasets. The study highlights experimental results using Rank-SVM to optimize performance, offering insights into offline model complexities and future research directions.
E N D
Query Dependent Ranking Using K-Nearest Neighbor SIGIR 2008 Presenter: Hsuan-Yu Lin
Outline • Introduction • Motivation • Method • Experiments • Conclusion
Introduction • Machine learning techniques proposed in information retrieval • In web search • Queries may vary largely in semantics and the users’ intensions • Goal • To Propose the approach of query-dependent ranking • Use K-Nearest Neighbor(KNN) to find the relative training queries for model training
Motivation • Single ranking model alone couldn’t deal with the cases properly • Not use query classification approach • It is hard to draw clear boundaries between the queries in different cateogries
Motivation • Reduce the feature space from 27 to 2-dimensions by using Principal Component Analysis
Motivation • Why to use KNN approach? • With high probability a query belongs to the same category as those of its neighbors • View KNN as an algorithm performing ‘soft’ classification in the query feature space
KNN Online • Define query feature: • For each query q, using a reference model(BM25)to find its top T ranked documents, and take the mean of the feature values of the T documents as a feature of the query
KNN Online • (b) and (c) cost much time for online algorithm
Time Complexity n: number of documents to be ranked for the test query k: number of nearest neighbors m: number of queries in the training data • It usually spend time on training model
Experiments • Dataset • Two datasets • Dataset 1: 1,500 training queries, 400 test queries • Dataset 2: 3,000 training queries, 800 test queries • Label: 0~5 (perfect, excellent, good, fair, bad) • Feature: 200 • Learning approach • Rank-SVM • Parameter • λ: 0.01 • T(Top T documents for a query):50 • K:400(dataset 1), 800(dataset 2) • Evaluation measure • NDCG
Experiments • Baseline • Single: single model approach • QC: query classification based approach • Classify queries into three categories(topic distillation, name page finding, homepage finding)
Experiments • Dataset 1:
Experiments • Y-axle: change ratio between online and offline method • If change ratio is small, means the two sets have large overlap
Conclusion • Using different models based on different properties of queries • Propose K-Nearest Neighbor approach for selecting training data • Future work • Complexity of offline processing is still high • Use KD-trees or other advanced structures for nearest neighbor search • Query feature definition