Learning to Rank

Learning to Rank Ming-Feng Tsai National Taiwan University

Ranking • Ranking vs. Classification • Training samples is not independent, identical distributed. • Criterion of training is not compatible to one of IR • Many ML approaches have been applied to ranking • RankSVM • T. Joachims, SIGKDD, 2002 (SVM Light) • RankBoost • Freund Y., Iyer, Journal of Machine Learning Research, 2003 • RankNet • C.J.C. Burges, ICML, 2005 (MSN Search)

Motivation • RankNet • Pro • Probabilistic ranking model • Good properties • Con • Training is not efficient • Criterion of training is not compatible to one of IR • Motivation • Based on the probabilistic ranking model • Improve efficiency and loss function

Probabilistic Ranking Model Model posterior by Pij The map from outputs to probabilities are modeled using a sigmoid function Probabilistic Ranking Model • Properties • Combined Probabilities • Consistency requirements P(A>B)=0.5, and P(B>C)=0.5, then P(A>C)=0.5 • Confidence, or lack of confidence, builds as expected. P(A>B)=0.6, and P(B>C)=0.6, then P(A>C)>0.6

Cross entropy loss function Let be the desired target values Total Cost Function: RankNet applied this loss function by Nerural Network (BP network) Applied this loss function by additive model Probabilistic Ranking Model

Derivation of cross entropy loss function for Additive Model

Candidates of Loss Functions • Cross entropy • KL-Divergence • This loss function is equivalent to cross entropy • Information Radius • KL-Divergence and cross entropy are asymmetric • information radius is symmetric, that is, IRad(p,q)=IRad(q,p) • Minkowski norm • This seems simpler than cross entropy in mathematical derivation for boosting

Fidelity Loss Function • Fidelity • A more reasonable loss function that is inspired from quantum computation • Hold the same properties in probabilistic ranking model proposed by Chris et al. • New properties • F(p, q) = F(q, p) • Loss is between 0 and 1 • get the minimum loss value 0 • the loss convergence

Fidelity Loss Function • Properties • Total loss function • Pair-level loss is considered e.g. the loss of (5, 4, 3, 2, 1) and (4, 3, 2, 1, 0) is zero • Query-level loss is also considered • More penalty for larger grade of pair e.g. (5, 0) and (5, 4)

Derivation for Additive Model We denote

Derivation for Additive Model

FRank Algorithm: FRank Given: ground truth Initialize: For t=1,2, …, T (a) For each weak learner candidate hi(x) (a.1) Compute optimal αt,i (a.2) Compute the fidelity loss (b) Choose the weak learner ht,i(x) with the minimal loss as ht(x) (c) Choose the corresponding αt,ias αt (d) Update pair weight by Wi,j Output the final ranking

Implementation • Finished • Threshold fast implementation • Faster 4 times • Alpha fast implementation • Faster 120 seconds per weak learner in 3w • Total loss fast implementation • Faster 3 times • Resume to training • Plan • Multi-Thread implementation • Parallel Computation • Margin Consideration (Fast, but with loss)

Preliminary Experimental Results • Data Set of BestRank Competition • Training Data: about 2,500 queries • Validation Data: about 1,100 queries • Features: 64 features • Evaluation • NDCG

Preliminary Experimental Results • Results of Validation Data

Next step

Interesting Analogy • Loss function • Pair-level loss • Query-level loss • Other considerations • Learning Model • Boosting, additive model • LogitBoost • Boosted Lasso • SVM • Neural Network • The whole new model • The dependence of retrieved web pages

F(x) Xi F(x,y) Xi Xj Pairwise • Pairwise Training • Ranking is reduced to a classification problem by using pairwise items as training samples • This increases the data complexity from O(n) to O(n2) • Suppose there are n samples evenly distributed on k ranks, the total number of pairwise samples is roughly n2/k

Pairwise • Pairwise Training • F(x,y) is more general function than F(x) – F(y) • Find properties that should be modeled by F(x,y) • Nonlinear relation between x and y • margin(r1, r30) > margin(r1, r10) > margin(r21, r30) • …

Pairwise • Pairwise Testing • In testing phrase, rank should be reconstructed from a partial orders graph, even inconsistent and incomplete • Topological sorting can only handle DAG in linear time • Problem • inconsistent • How to find the best spanning tree • incomplete • How to deal with the node without label

Pairwise • Spanning Tree • Related Content • Colley’s Bias Free College Football Ranking Method • Tree Reconstruction via partial order • …

Thanks your attention • Q&A

Additive Model • AdaBoost • Construct a classifier H(x) by the linear combination of the base classifier h(x) • In order to obtain the optimal base classifiers {hT(x)} and linear combination coefficients {αT}, we need to minimize the training error • For binary classification problems (1 or -1), the training error for the classifier H(x) can be written as

Additive Model • AdaBoost • For the simplicity of computation, it uses the exponential cost function as the objective function • Apparently, the exponential cost function upper bounds the training error err

Additive Model • AdaBoost • By setting the derivative of the equation above with respect to αT to be zero, we have the expression as follows: • With the expression of data distribution • the linear combination coefficient αT can be written as

Additive Model Given: (xi, yi) 1. Initialize: W1=1/N 2. For t=1,2, …, T (a) Train weak learner using distribution Wt (b) compute (c) compute (d) update 3. Output Back

NDCG • K Jaervelin, J Kekaelaeinen - ACM Transactions on Information Systems, 2002 • Example • Assume that the relevance scores 0 – 3 are used. G’=<3, 2, 3, 0, 0, 1, 2, 2, 3, 0, …> • Cumulated Gain (CG) CG’=<3, 5, 8, 8, 8, 9, 11, 13, 16, 16, …>

NDCG • Discount Cumulated Gain (DCG) let b=2, DCG’=<3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61, …> • Normalized Discount Cumulated Gain (NDCG) Ideal vector I’=<3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, …> CGI’=<3, 6, 9, 11, 13, 15, 16, 17, 18, 19, 19, 19, 19, …> DCGI’=<3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 11.21, 11.53, 11.83, 11.83, …> NDCG’=<1, 0.83, 0.89, 0.73, 0.62, 0.6, 0.69, 0.76, 0.89, 0.84, …> Back

Learning to Rank

Learning to Rank

Presentation Transcript

Learning to Rank (part 1)

Learning to Rank: A Machine Learning Approach to Static Ranking

Learning to rank Web Science 2013

Learning to Rank: New Techniques and Applications

Query Chains: Learning to Rank from Implicit Feedback

Learning to Rank --A Brief Review

Learning to Rank for Information Retrieval

BIO 204 RANK Keep Learning /bio204rank.com

BUS 362 RANK Keep Learning /bus362rank.com

GB 518 RANK Keep Learning /gb518rank.com

HCS 370 RANK Keep Learning /hcs370rank.com

BUS 519 RANK Marvelous Learning /bus519rank.com

HCS 588 RANK Keep Learning /hcs588rank.com

PSYCH 640 RANK Keep Learning /psych640rank.com

PSY 420 RANK Marvelous Learning /psy420rank.com

BSCOM 410 RANK Successful Learning / bscom410rank.com

HCS 586 RANK Successful Learning / hcs586rank.com

Learning to Rank with Ties

Learning to Rank – Theory and Algorithm

Query Chains: Learning to Rank from Implicit Feedback