1 / 66

Learning to Rank (part 1)

Learning to Rank (part 1). NESCAI 2008 Tutorial Yisong Yue Cornell University. Booming Search Industry. Goals for this Tutorial. Basics of information retrieval What machine learning contributes New challenges to address New insights on developing ML algorithms. (Soft) Prerequisites.

keladry
Télécharger la présentation

Learning to Rank (part 1)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning to Rank(part 1) NESCAI 2008 Tutorial Yisong Yue Cornell University

  2. Booming Search Industry

  3. Goals for this Tutorial Basics of information retrieval What machine learning contributes New challenges to address New insights on developing ML algorithms

  4. (Soft) Prerequisites Basic knowledge of ML algorithms Support Vector Machines Neural Nets Decision Trees Boosting Etc… Will introduce IR concepts as needed

  5. Outline (Part 1) Conventional IR Methods (no learning) 1970s to 1990s Ordinal Regression 1994 onwards Optimizing Rank-Based Measures 2005 to present

  6. Outline (Part 2) Effectively collecting training data E.g., interpreting clickthrough data Beyond independent relevance E.g., diversity Summary & Discussion

  7. Disclaimer This talk is very ML-centric Use IR methods to generate features Learn good ranking functions on feature space Focus on optimizing cleanly formulated objectives Outperform traditional IR methods

  8. Disclaimer This talk is very ML-centric Use IR methods to generate features Learn good ranking functions on feature space Focus on optimizing cleanly formulated objectives Outperform traditional IR methods Information Retrieval Broader than the scope of this talk Deals with more sophisticated modeling questions Will see more interplay between IR and ML in Part 2

  9. Brief Overview of IR Predated the internet As We May Think by Vannevar Bush (1945) Active research topic by the 1960’s Vector Space Model (1970s) Probabilistic Models (1980s) Introduction to Information Retrieval(2008) C. Manning, P. Raghavan & H. Schütze

  10. Basic Approach to IR • Given query q and set of docs d1, … dn • Find documents relevant to q • Typically expressed as a ranking on d1,… dn

  11. Basic Approach to IR • Given query q and set of docs d1, … dn • Find documents relevant to q • Typically expressed as a ranking on d1,… dn • Similarity measure sim(a,b)!R • Sort by sim(q,di) • Optimal if relevance of documents are independent. [Robertson, 1977]

  12. Vector Space Model Represent documents as vectors One dimension for each word Queries as short documents Similarity Measures Cosine similarity = normalized dot product

  13. Cosine Similarity Example

  14. Other Methods TF-IDF [Salton & Buckley, 1988] Okapi BM25 [Robertson et al., 1995] Language Models [Ponte & Croft, 1998] [Zhai & Lafferty, 2001]

  15. Machine Learning • IR uses fixed models to define similarity scores • Many opportunities to learn models • Appropriate training data • Appropriate learning formulation • Will mostly use SVM formulations as examples • General insights are applicable to other techniques.

  16. Training Data • Supervised learning problem • Document/query pairs • Embedded in high dimensional feature space • Labeled by relevance of doc to query • Traditionally 0/1 • Recently ordinal classes of relevance (0,1,2,3,…)

  17. Feature Space • Use to learn a similarity/compatibility function • Based off existing IR methods • Can use raw values • Or transformations of raw values • Based off raw words • Capture co-occurrence of words

  18. Training Instances

  19. Learning Problem • Given training instances: • (xq,d, yq,d) for q = {1..N}, d = {1 .. Nq} • Learn a ranking function • f(xq,1, … xq,Nq ) ! Ranking • Typically decomposed into per doc scores • f(x) ! R (doc/query compatibility) • Sort by scores for all instances of a given q

  20. How to Train? Classification & Regression Learn f(x) ! R in conventional ways Sort by f(x) for all docs for a query Typically does not work well 2 Major Problems Labels have ordering Additional structure compared to multiclass problems Severe class imbalance Most documents are not relevant

  21. Somewhat Relevant Very Relevant Not Relevant Conventional multiclass learning does not incorporate ordinal structure of class labels

  22. Somewhat Relevant Very Relevant Not Relevant Conventional multiclass learning does not incorporate ordinal structure of class labels

  23. Ordinal Regression Assume class labels are ordered True since class labels indicate level of relevance Learn hypothesis function f(x) ! R Such that the ordering of f(x) agrees with label ordering Ex: given instances (x, 1), (y, 1), (z, 2) f(x) < f(z) f(y) < f(z) Don’t care about f(x) vs f(y)

  24. Ordinal Regression Compare with classification Similar to multiclass prediction But classes have ordinal structure Compare with regression Doesn’t necessarily care about value of f(x) Only care that ordering is preserved

  25. Ordinal Regression Approaches • Learn multiple thresholds • Learn multiple classifiers • Optimize pairwise preferences

  26. Option 1: Multiple Thresholds Maintain T thresholds (b1, … bT) b1 < b2 < … < bT Learn model parameters + (b1, …, bT) Goal Model predicts a score on input example Minimize threshold violation of predictions

  27. Ordinal SVM Example [Chu & Keerthi, 2005]

  28. Ordinal SVM Formulation Such that for j = 0..T : And also: [Chu & Keerthi, 2005]

  29. Learning Multiple Thresholds • Gaussian Processes • [Chu & Ghahramani, 2005] • Decision Trees • [Kramer et al., 2001] • Neural Nets • RankProp [Caruana et al., 1996] • SVMs & Perceptrons • PRank [Crammer & Singer, 2001] • [Chu & Keerthi, 2005]

  30. Option 2: Voting Classifiers • Use T different training sets • Classifier 1 predicts 0 vs 1,2,…T • Classifier 2 predicts 0,1 vs 2,3,…T … • Classifier T predicts 0,1,…,T-1 vs T • Final prediction is combination • E.g., sum of predictions • Recent work • McRank [Li et al., 2007] • [Qin et al., 2007]

  31. Severe class imbalance • Near perfect performance by always predicting 0

  32. Option 3: Pairwise Preferences Most popular approach for IR applications Learn model to minimize pairwise disagreements %(Pairwise Agreements) = ROC-Area

  33. 2 pairwise disagreements

  34. Optimizing Pairwise Preferences • Consider instances (x1,y1) and (x2,y2) • Label order has y1 > y2

  35. Optimizing Pairwise Preferences • Consider instances (x1,y1) and (x2,y2) • Label order has y1 > y2 • Create new training instance • (x’, +1) where x’ = (x1 – x2) • Repeat for all instance pairs with label order preference

  36. Optimizing Pairwise Preferences • Result: new training set! • Often represented implicitly • Has only positive examples • Mispredicting means that a lower ordered instance received higher score than higher order instance.

  37. Pairwise SVM Formulation Such that: [Herbrich et al., 1999] Can be reduced to time [Joachims, 2005].

  38. Optimizing Pairwise Preferences Neural Nets RankNet [Burges et al., 2005] Boosting & Hedge-Style Methods [Cohen et al., 1998] RankBoost [Freund et al., 2003] [Long & Servidio, 2007] SVMs [Herbrich et al., 1999] SVM-perf [Joachims, 2005] [Cao et al., 2006]

  39. Rank-Based Measures Pairwise Preferences not quite right Assigns equal penalty for errors no matter where in the ranking People (mostly) care about top of ranking IR community use rank-based measures which capture this property.

  40. Rank-Based Measures Binary relevance Precision@K (P@K) Mean Average Precision (MAP) Mean Reciprocal Rank (MRR) Multiple levels of relevance Normalized Discounted Cumulative Gain (NDCG)

  41. Precision@K Set a rank threshold K Compute % relevant in top K Ignores documents ranked lower than K Ex: Prec@3 of 2/3 Prec@4 of 2/4 Prec@5 of 3/5

  42. Mean Average Precision Consider rank position of each relevance doc K1, K2, … KR Compute Precision@K for each K1, K2, … KR Average precision = average of P@K Ex: has AvgPrec of MAP is Average Precision across multiple queries/rankings

  43. Mean Reciprocal Rank Consider rank position, K, of first relevance doc Reciprocal Rank score = MRR is the mean RR across multiple queries

  44. NDCG Normalized Discounted Cumulative Gain Multiple Levels of Relevance DCG: contribution of ith rank position: Ex: has DCG score of NDCG is normalized DCG best possible ranking as score NDCG = 1

  45. Optimizing Rank-Based Measures Let’s directly optimize these measures As opposed to some proxy (pairwise prefs) But… Objective function no longer decomposes Pairwise prefs decomposed into each pair Objective function flat or discontinuous

  46. Discontinuity Example • NDCG = 0.63

  47. Discontinuity Example • NDCG computed using rank positions • Ranking via retrieval scores

  48. Discontinuity Example • NDCG computed using rank positions • Ranking via retrieval scores • Slight changes to model parameters • Slight changes to retrieval scores • No change to ranking • No change to NDCG

  49. Discontinuity Example • NDCG computed using rank positions • Ranking via retrieval scores • Slight changes to model parameters • Slight changes to retrieval scores • No change to ranking • No change to NDCG NDCG discontinuous w.r.t model parameters!

  50. [Yue & Burges, 2007]

More Related