A Generative Retrieval Model for Structured Documents

A Generative Retrieval Model for Structured Documents Le Zhao, Jamie Callan Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University Oct 2008

Background • Structured documents • Author Edited Fields • Library systems: Title, meta-data of books • Web documents: HTML, XML • Automatic Annotations • Part Of Speech, Named Entity, Semantic Role • Structured query • Human • Automatic

Example Structured Retrieval Queries • XML element retrieval • NEXI query (Wikipedia XML) a) //article[about(., music)] b) //article[about(.//section, music)] //section[about(., pop)] • Question Answering • Indri query (ASSERT style SRL annotation) • #combine[sentence]( #combine[target]( love #combine[./arg0] ( #any:person ) #combine[./arg1] ( Mary ) ) ) …music.. ..pop.. Music … …pop… … music… … music … D D arg0 arg1 [John] [loves] [Mary]

Motivation • Basis: Language Model + Inference Net (Indri search engine/Lemur) • Already supports field retrieval & indexing and retrieving relations between annotations • flexible query language – testing new query forms promptly • Main problems • Approximate matching (structure & keyword) • Evidence combination • Extension: from keyword retrieval model • Approximate structure & keyword matching • Combining evidence from inner fields • Goal: Outperform keyword in precision, thru • Coherent structured retrieval model & better understanding • Better smoothing & Guiding query formulation • Finer control via accurate & robust structured queries

Roadmap • Brief overview of Indri Field retrieval • Existing Problems • The generative structured retrieval model • Term & Field level Smoothing • Evidence combination alternatives • Experiments • Conclusions

Indri Document Retrieval • “#combine(iraq war)” • Scoring scope is “document” • Return a list of scores for docs • Language Model built from Scoring scope, smoothed with the collection model. • Because of smoothing partial matches can also be returned.

Indri Field Retrieval • “#combine[title](iraq war)” • Scoring scope is “title” • Return a list of scores for titles • Language Model built from Scoring scope (title), smoothed with Document and Collection models. • Results on Wikipedia collection:score document-number start end content-1.19104 199636.xml 0 2 Iraq War-1.58453 1595906.xml 0 3 Anglo-Iraqi War-1.58734 14889.xml 0 3 Iran-Iraq War-1.87811 184613.xml 0 4 2003 Iraq war timeline-2.07668 2316643.xml 0 5 Canada and the Iraq War-2.09957 202304.xml 0 5 Protests against the Iraq war-2.23997 2656581.xml 0 6 Saddam's Trial and Iran-Iraq War-2.35804 1617564.xml 0 7 List of Iraq War Victoria Cross Because of smoothing partial matches can also be returned.

Existing Problems

Evidence Combination • Topic: A document with multiple sections about iraq war, and discusses Bush’s exit strategy. #combine(#combine[section](iraq war) bush #1(exit strategy) )[section]’s could return scores: (0.2, 0.2, 0.1, 0.002) for one document • Some options • #max (Bilotti et al 2007): Only considers one match • #or: Favors many matches, even if weak matches • #and: Biased against many matches, even if good matches • #average: Favors many good matches, hurt by weak matches • … • What about documents that don’t contain a section element? • But, do have a lot of matching terms?

Evidence Combination Methods (1)

Bias toward short fields • Topic: Who loves Mary?#combine( #combine[target]( Loves #combine[./arg0]( #any:person ) #combine[./arg1]( Mary ) ) ) • PMLE(qi|E) = count(qi)/|E| • Produces very skewed scores when |E| is small • E.g., if |E| = 1, PMLE(qi|E) is either 0 or 1 • Biases toward #combine[target](Loves) • [target] usually length 1, arg0/1 longer • Ratio between having and not having a [target] match is larger than that of [arg0/1], with Jelinek-Mercer smoothing

The Generative Structured Retrieval model A new framework for structured retrieval A new term-level smoothing method

A New Framework • #combine( #combine[section](iraq war) bush #1(exit strategy)) • Query • Traditional: merely the sampled terms • New: specifies a graphical model, a generation process • Scoring scope is “document” • For one document, calculate probability of the model • Sections are used as evidence of relevance for the document • a hidden variable in the graphical model • In general, inner fields are hidden, and used to score outer fields. • Hidden variables are summed over to produce final score • Averaging the scores from section fields, (uniform prior over sections)

+1 prior field: Field level smoothing A New Framework:Field-Level Smoothing • Term level smoothing (traditional) • no [section] contains iraq or war • add “prior terms” to [section] – Dirichlet prior from collection model • Field level smoothing (new) • no [section] field in document– add “prior fields” Terms in S P(w|C) Ss μ |S| Sections in D P(w|section, D, C) Ds

A New Framework:Advantages • Soft matching of sections • Matches documents even w/o section fields • “prior fields”, (Bilotti et al 2007) called this “empty fields” • Aggregation of all matching fields • P-OR, Max, … • Heuristics • From our generative model • Probabilistic-Average

Reduction to Keyword Likelihood Model • Assume [term] tag around each term in collection • Assume no document level smoothing (μd = +inf, λd = 0)then, no matter with how many empty smoothing fields, the AVG model degenerates to the Keyword retrieval model, in the following way: • #combine( #combine[term]( u ) == #combine( u v ) #combine[term]( v ) )(same collection level smoothing Dirichlet/Jelinek-Mercer is preserved)

Term Level Smoothing Revisited • Two level Jelinek-Mercer (traditional) • Equivalently (more general parameterization), • Two level Dirichlet (new) • Corrects J-M’s Bias toward shorter fields • Relative gain of matching independent of field length

Experiments - Smoothing- Evidence combination methods

XML retrieval INEX 06, 07 (Wikipedia collection) Goal: evaluate evidence combination (and smoothing) Topics (modified): element retrieval  document retrieval, e.g. #combine(#combine[section](wifi security)) Assessments (modified): any element relevant  document relevant Smoothing parameters Trained on INEX 06, 62 topics Tested on INEX 07, 60 topics Question Answering TREC 2002 AQUAINT Corpus Topics Training: 55 original topics -> 367 relevant sentences (new topics) Test: 54 original topics -> 250 relevant sentences (new topics) For example,Question: “wholovesMary”Relevant sentence: “John says he lovesMary”Query: #combine[target]( love #combine[./arg1](Mary) ) Relevance feedback setup, stronger than (Bilotti et al 2007) Datasets

Effects of 2-level Dirichlet smoothing Table 3. A comparison of two-level Jelinek-Mercer and two-level Dirichlet smoothing on the INEX and QA datasets. *: significance level < 0.04, **: significance level < 0.002, ***: significance level < 0.00001

Optimal Smoothing Parameters • Optimization with grid search • Optimal values for Dirichlet related to average length of the fields being queried

Evidence Combination Methods • For QA, MAX is best • For INEX • Evaluation at document level does not discount irrelevant text portions • Not clear which combination method performs best

Better Evaluation for INEX Datasets • NDCG • Assumptions • Degree of relevance is somehow given • user spends similar amount of effort on each document, and effort decreases in log-rank • With more informative element level judgments • Degree of relevance for a document = relevance density • the proportion of relevant texts (in bytes) in the document • Discount lower ranked relevant documents • not by # docs ranked ahead • but by length (in bytes) of texts ranked ahead • Effectively discounts irrelevant texts ranked ahead

Measuring INEX topics with NDCG • * p < 0.007 between AVG and MAX or AVG and OR • No significant difference between AVG and keyword!

Error Analysis for INEX06 Queries and Correcting INEX07 Queries • Two Changes (looking only at training set) • Semantic mismatch with topic (mainly keyword query) (22/70) • Lacking alternative fields: [image]  [image,figure] • Wrong AND|OR semantics: (astronaut AND cosmonaut)  (astronaut OR cosmonaut) • Misspellings: VanGogh  Van Gogh • Over-restricted query terms using phrases: #1(origin of universe)  #uw4(origin universe) • All [article] restrictions  whole document (34/70) • Proofreading test (INEX07) queries • Retrieval results of the queries are not referenced in any way. • Only looked at keyword query + topic description

Performance after query correction df = 30: p < 0.006, for NDCG@10; p < 0.0004, for NDCG@20; p < 0.002 for NDCG@30

Conclusions • A structured query specifies a generative model for P(Q|D),model parameters estimated from D, rank D by P(Q|D) • Best evidence combination strategy is task dependent • Dirichlet smoothing corrects bias to short fields, and outperforms Jelinek-Mercer • Guidance to structured query formulation • Robust structured queries can outperform keyword

Acknowledgements • Paul Ogilvie • Matthew Bilotti • Eric Nyberg • Mark Hoy

Thanks! Comments & Questions?

A Generative Retrieval Model for Structured Documents

A Generative Retrieval Model for Structured Documents

Presentation Transcript

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval

Latent Dirichlet Allocation a generative model for text

Processing of structured documents

Processing of structured documents

Processing of structured documents

Processing of structured documents

Patterns And A Generative Model

Processing of structured documents

Processing of structured documents

Processing of structured documents

Structured Documents

Processing of structured documents

Structured Documents

Processing of structured documents

Structured Text Retrieval Models

Structured Documents: An Introduction

Second Space: A Generative Model For The Blogosphere

Processing of structured documents

Processing of structured documents

Processing of structured documents