A Formal Study of Information Retrieval Heuristics

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign USA

Empirical Observations in IR • Retrieval heuristics are necessary for good retrieval performance. • E.g. TF-IDF weighting, document length normalization • Similar formulas may have different performances. • Performance is sensitive to parameter setting.

Inversed Document Frequency • Pivoted Normalization Method • Dirichlet Prior Method • Okapi Method 1+ln(c(w,d)) Parameter sensitivity Document Length Normalization Alternative TF transformation Term Frequency Empirical Observations in IR (Cont.)

Research Questions • How can we formally characterize these necessary retrieval heuristics? • Can we predict the empirical behavior of a method without experimentation?

Outline • Formalized heuristic retrieval constraints • Analytical evaluation of the current retrieval formulas • Benefits of constraint analysis • Better understanding of parameter optimization • Explanation of performance difference • Improvement of existing retrieval formulas

Let q be a query with only one term w. w q : If d1: and d2: then Term Frequency Constraints (TFC1) TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term. • TFC1

Let q be a query and w1, w2be two query terms. w1 w2 q: Assume and d1: If and d2: then Term Frequency Constraints (TFC2) TF weighting heuristic II: Favor a document with more distinct query terms. • TFC2

Doc 1 Doc 2 ... … SVM SVM Tutorial Tutorial … … … SVM SVM Tutorial Tutorial … Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection; Give higher weights to discriminative terms. Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial) SVMTutorial

w1 w2 q: Let q be a query and w1, w2be two query terms. d1: Assume d2: and and for all other words w. If and then Term Discrimination Constraint (Cont.) • TDC

LNC1 q: Let q be a query. d1: If for some word d2: but for other words then • LNC2 q: Let q be a query. If and d1: d2: then Length Normalization Constraints(LNCs) Document length normalization heuristic: Penalize long documents(LNC1); Avoid over-penalizing long documents (LNC2) .

Let q be a query with only one term w. w q: If d1: d2: and then TF-LENGTH Constraint (TF-LNC) TF-LN heuristic: Regularize the interaction of TF and document length. • TF-LNC

Analytical Evaluation

Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial) Doc 1 ... … SVM SVM SVM Tutorial Tutorial … Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection; Give higher weights to discriminative terms. Doc 2 … … Tutorial SVM SVM Tutorial Tutorial …

Benefits of Constraint Analysis • Provide an approximate bound for the parameters • A constraint may be satisfied only if the parameter is within a particular interval. • Compare different formulas analytically without experimentations • When a formula does not satisfy the constraint, it often indicates non-optimality of the formula. • Suggest how to improve the current retrieval models • Violation of constraints may pinpoint where a formula needs to be improved.

Optimal s (for average precision) Parameter sensitivity of s Avg. Prec. 0.4 s Benefits 1 : Bounding Parameters LNC2  s<0.4 • Pivoted Normalization Method

Negative when df(w) is large  Violate many constraints keyword query verbose query Avg. Prec Avg. Prec Okapi Pivoted s or b s or b Benefits 2 : Analytical Comparison • Okapi Method

Modified Okapi verbose query keyword query Avg. Prec. Avg. Prec. Okapi Pivoted s or b s or b Benefits 3: Improving Retrieval Formulas • Modified Okapi Method Make Okapi satisfy more constraints; expected to help verbose queries

Conclusions and Future Work • Conclusions • Retrieval heuristics can be captured through formally defined constraints. • It is possible to evaluate a retrieval formula analytically through constraint analysis. • Future Work • Explore additional necessary heuristics • Apply these constraints to many other retrieval methods • Develop new retrieval formulas through constraint analysis

The End Thank you!

A Formal Study of Information Retrieval Heuristics