1 / 32

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval. Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science University of Delaware Newark, DE ( CIKM ’09 ). Date: 2010/05/03 Speaker: Lin, Yi-Jhen Advisor: Dr. Koh, Jia-Ling. Agenda.

neka
Télécharger la présentation

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information ScienceUniversity of DelawareNewark, DE ( CIKM ’09 ) Date: 2010/05/03 Speaker: Lin, Yi-Jhen Advisor: Dr. Koh, Jia-Ling

  2. Agenda • Introduction- Motivation, Goal • Faceted Topic Retrieval - Task, Evaluation • Faceted Topic Retrieval Models- 4 kinds of models • Experiment & Results • Conclusion

  3. Introduction - Motivation • Modeling documents as independently relevant does not necessarily provide the optimal user experience.

  4. Introduction - Motivation Actually, we prefer System2 (since it has more information) Traditional evaluation measure would reward System1 since it has higher recall System2 is better !

  5. Introduction • Novelty and diversity become the new definition of relevance and evaluation measures . • They can be achieved through retrieving documents that are relevant to query, but cover different facets of the topic. we call faceted topic retrieval !

  6. Introduction - Goal • The faceted topic retrieval system must be able to find a small set of documents that covers all of the facets • 3 documents that cover 10 facets is preferable to 5 documents that cover 10 facets

  7. Faceted Topic Retrieval - Task Define the task in terms of • Information need : A faceted topic retrieval information need is one that has a set of answers – facets – that are clearly delineated • How that need is best satisfied : Each answer is fully contained within at least one document

  8. Faceted Topic Retrieval - Task Information need Facets (a set of answers) shift to coal invest in next generation technologies shift to biodiesel Invest in renewable energy sources increase use of renewable energy sources double ethanol in gas supply

  9. Faceted Topic Retrieval A Query : A sort list of keywords D1 D2 Our System Dn A ranked list of documents that contain as many unique facets as possible.

  10. Faceted Topic Retrieval -Evaluation • S-recall • S-precision • Redundancy

  11. Evaluation – an example for S-recall and S-precision • Total : 10 facets (assume all facets in documents are non-overlapped)

  12. Evaluation – an example for Redundancy

  13. Faceted topic retrieval models • 4 kinds of models- MMR (Maximal Marginal Relevance)- Probabilistic Interpretation of MMR- Greedy Result Set Pruning- A Probabilistic Set-Based Approach

  14. 1. MMR 2. Probabilistic Interpretation of MMR Let c1=0, c3=c4

  15. 3. Greedy Result Set Pruning • First, rank without considering novelty (in order of relevance) • Second, step down the list of documents, prunedocuments with similarity greater than some threshold ϴ  I.e., at rank i, remove any document Dj, j > i, with sim(Dj,Di) > ϴ

  16. 4. A Probabilistic Set-Based Approach • P(F ϵ D) :Probability of D contains F • the probability that a facet Fj occurs in at least one document in a set D is • the probability that all of the facets in a set F are captured by the documents D is

  17. 4. A Probabilistic Set-Based Approach • 4.1 Hypothesizing Facets • 4.2 Estimating Document-Facet Probabilities • 4.3 Maximizing Likelihood

  18. 4.1 Hypothesizing Facets Two unsupervised probabilistic methods : • Relevance modeling • Topic modeling with LDA Instead of extract facets directly from any particular word or phrase, we build a “ facet model ”P(w|F)

  19. 4.1 Hypothesizing Facets • Since we do not know the facet terms or the set of documents relevant to the facet, we will estimate them from the retrieved documents • Obtain m models from the top m retrieved documents by taking each document along with its k nearest neighbors as the basis for a facet model

  20. Relevance modeling • Estimate m ”facet models“ P(w|Fj) from a set of retrieved documents using the so-called RM2 approach: DFj: the set of documents relevant to facet Fj fk : facet terms

  21. Topic modeling with LDA • Probabilistic P(w|Fj) and P(Fj) can found through expectation maximization

  22. 4.2 Estimating Document-Facet Probabilities • Both the facet relevance model and LDA model produce generation probabilistic P(Di|Fj) • P(Di|Fj) : the probability that sampling terms from the facet model Fj will produce document Di

  23. 4.3 Maximizing Likelihood • Define the likelihood function • Constrain : • K : hypothesized minimum number required to cover the facets Maximizing L(y) is a NP-Hard problem • Approximate solution : • For each facet Fj, take the document Diwith maximum

  24. Experiment - Data A Query : A sort list of keywords D1 D2 Query Likelihood L.M. TDT5 Corpus (278,109 docs) D130 Top 130 retrieved documents

  25. Experiment - Data For 60 queries : D1 44.7 relevant documents per query Each document contains 4.3 facets 39.2 unique facets on average ( for average one unique facet per relevant document ) Agreement :72% of all relevant documents were judged relevant by both assessors D2 2 assessors to judge D130 Top 130 retrieved documents

  26. Experiment - Data • TDT5 sample topic definition Query Judgments

  27. Experiment – Retrieval Engines Using Lemur toolkit • LM baseline: a query-likelihood language model • RM baseline: a pseudo-feedback with relevance model • MMR: query similarity scores from LM baseline and cosine similarity for novelty • AvgMix (Prob MMR) : the probabilistic MMR model using query-likelihood scores from LM baseline and the AvgMix novelty score. • Pruning: removing documents from the LM baseline on cosine similarity • FM: the set-based facet model

  28. Experiment – Retrieval Engines • FM: the set-based facet model • FM-RM: each of the top m documents and their K nearest neighbors becomes a “facet model ”P(w|Fj), then compute the probability P(Di|Fj) • FM-LDA: use LDA to discover subtopics zj, and get P(zj|D), we extract 50 subtopics

  29. Experiments - Evaluation • Use five-fold cross-validation to train and test systems • 48 queries in four folds to train model parameters • Parameters are used to obtain ranked results on the remaining 12 queries • At the minimum optimal rank S-rec, we report S-recall, redundancy, MAP

  30. Results

  31. Results

  32. Conclusion • We defined a type of novelty retrieval task called faceted topic retrieval  retrieve the facets of information need in a small set of documents. • We presented two novel models: One that prunes a retrieval ranking and one a formally-motivated probabilistic models. • Both models are competitive with MMR, and outperform another probabilistic model.

More Related