A Formal Study of Information Retrieval Heuristics

A Formal Study of Information Retrieval Heuristics Hui Fang , Tao Tao , ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented by CHU Huei-Ming 2004/01/17

Outline • Formal Definitions of Heuristic Retrieval Constraints • Analysis of Three Representative Retrieval Formulas • Pivoted Normalization Method • Okapi Method • Dirichlet Prior Method • Experiments • Conclusion and Future Work

Formal Definitions of Heuristic Retrieval Constraints • Six intuitive and desirable constraints • Any reasonable retrieval formula should satisfy • Term Frequency Constraints (TFCs) • Term Discrimination Constraints (TDC) • Length Normalization Constraints (LNCs) • TF-Length Constraints (TF-LNC)

Formal Definitions of Heuristic Retrieval Constraints • Term Frequency Constraints (TFCs) • TFC1: q={w} , Assume |d1|=|d2|. If c(w,d1) > c(w,d2), then f(d1,q) > f(d2,q) • TFC2: q={w} , Assume |d1|=|d2|=|d3| , c(w,d1)>0, If c(w,d2) - c(w,d1) =1 , c(w,d3) - c(w,d2) =1 then f(d2,q) - f(d1,q) > f(d3,q) - f(d2,q)

Formal Definitions of Heuristic Retrieval Constraints • Term Discrimination Constraints (TDC) • Let q be a query , and w1,w2 q be two query term • Assume |d1|=|d2| , c(w1,d1) + c(w2,d1)= c(w1,d2) + c(w2,d2) • If idf(w1) ≥ idf(w2) and c(w1,d1) > c(w2,d2) , then f(d1,q) ≥ f(d2,q)

Formal Definitions of Heuristic Retrieval Constraints • Length Normalization Constraints (LNCs) • LNC1 • Let q be a query , d1 and d2 are two documents • If some word w’ q , c(w’,d2) = c(w’,d1) +1 but for any query term w, c(w,d2) = c(w,d1)then f(d1,q) ≥ f(d2,q) • LNC2 • Let q be a query ,∀ k >1 , d1 and d2 are two documents • If |d1| = k · |d2| and for all terms w , c(w, d1) = k · c(w, d2), • then f(d1, q) ≥ f(d2, q).

Formal Definitions of Heuristic Retrieval Constraints • TF-Length Constraints (TF-LNC) • q={w}, d1 and d2 are two documents • If c(w,d1) > c(w,d2) and |d1|=|d2| + c(w,d1) - c(w,d2) • then f(d1,q) > f(d2,q)

Formal Definitions of Heuristic Retrieval Constraints

Analysis of Three Representative Retrieval Formulas • Pivoted Normalization Method • Okapi Method • Dirichlet Prior Method

Analysis of Three Representative Retrieval FormulasPivoted Normalization Method • Retrieval function • Analyzing

Analysis of Three Representative Retrieval FormulasPivoted Normalization Method • Check TF-LNC constraint when |d1|=avdl , it is equivalent to the • TF-LNC is satisfied only if s is blow a certain upper bound

Analysis of Three Representative Retrieval FormulasPivoted Normalization Method • Check the LNC2 constraint

Analysis of Three Representative Retrieval FormulasPivoted Normalization Method • Consider common case when |d2|=avdl • Performance can be bad for a large s

Analysis of Three Representative Retrieval FormulasPivoted Normalization Method • Check TDC constraint • It is equivalent to c(w2,d1) ≥ c(w1,d2) this is conditional satisfied

Analysis of Three Representative Retrieval FormulasOkapi Method • Retrieval function • k1 (between 1.0~2.0 ) b (usually 0.75) and k3 (between 0 ~1000)

Analysis of Three Representative Retrieval FormulasOkapi Method • Analysis • When df(w)> N/2 ,the IDF part in the formula will be a negative value • When the IDF part is positive (mostly true for keyword query) • TFC and LNCs are satisfied • TF-LNC constraint : considering a common case when |d2|=avdl the constraint is equivalent to b ≤ avdl / c(w, d2) • TDC is equivalent to c(w2,d1) ≥ c(w1,d2) same as the formula above

Analysis of Three Representative Retrieval FormulasOkapi Method • Modify Okapi Method • Solve the problem of negative IDF • Replace the original IDF in Okapi with the regular IDF in the pivoted normalization formula • The performance is better on the verbose queries • Analysis result

Analysis of Three Representative Retrieval FormulasDirichlet Prior Method • Retrieval function • Use Dirichlet prior smoothing method to smooth a document language model • Rank the documents according to the likelihood of the query according to the estimated language model of each document

Analysis of Three Representative Retrieval FormulasDirichlet Prior Method • Analysis • LNC2 constraint is equivalent to c(w,d2) ≥ |d2| p(w|C) • Which is usually satisfied for content-carrying words • TDC constraint led to some lower bound for parameter

Analysis of Three Representative Retrieval FormulasDirichlet Prior Method • Analysis • TDC : consider a common case of w2 , p(w2|C)=1/avdl • Means for discriminative words with a high term frequency in a document , needs to be sufficiently large • In order to balance the TF and IDF appropriately

ExperimentsSetup • Document set • AP: news article , DOE: technical report, FR: government documents, • ADF :combination of AP, DOE, FR • Web: web data used in the TREC8 • Trec7: ad hoc data used in the TREC7 • Trec8: ad hoc data used in the TREC8

ExperimentsSetup • Query combination • Short-keyword (SK, keyword title) • Shot-verbose (SV, one sentence description) • Long-keyword (LK, keyword list) • Long-verbose (LV, multiple sentences) • Preprocessing • Only stemming with the Porter’s stemmer • No stop words have been removed

ExperimentsParameter Sensitivity • Pivoted normalization method • The analysis of LNC2 constraint for the pivoted normalization methods suggests the s should be smaller than 0.4

ExperimentsParameter Sensitivity • Okapi method k1 =1.2, k3 =1000, b changes from 0.1 to 1.0

ExperimentsParameter Sensitivity • Dirichlet prior method

ExperimentsPerformance Comparison

ExperimentsPerformance Comparison • For any query type, the performance of Dirichlet prior method is comparable to pivoted normalization method • For keyword queries, the performance of Okapi is comparable to the other two retrieval formulas • For verbose queries, the performance of Okapi may be worse than others due to the possible negative IDF part in the formula

ExperimentsPerformance Comparison • Average precision comparison

Conclusion and Future Work • Define six basic constraints that any reasonable retrieval function should satisfy • When the constraints is not satisfied, it often indicates non-optimality of the method

A Formal Study of Information Retrieval Heuristics