1 / 56

Towards Information Retrieval with More Inferential Power

Towards Information Retrieval with More Inferential Power. Jian-Yun Nie Department of Computer Science University of Montreal nie@iro.umontreal.ca. Background. IR Goal: Retrieve relevant information from a large collection of documents to satisfy user’s information need

charla
Télécharger la présentation

Towards Information Retrieval with More Inferential Power

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Information Retrieval with More Inferential Power Jian-Yun Nie Department of Computer Science University of Montreal nie@iro.umontreal.ca

  2. Background • IR Goal: • Retrieve relevant information from a large collection of documents to satisfy user’s information need • Traditional relevance: • Query Q and document D in a given corpus: Score (Q,D) • User-independent • Knowledge independent • Independent of all contextual factors • Expected relevance: • Also depends on users (U) and contexts (C): Score (Q,D,U,C) • Reasoning with contextual information • Several approaches in IR can be viewed as simple inference • We have to consider more complex inference

  3. Overview • Introduction: Current Approaches to IR • Inference using terms relations • A General Model to Integrate Contextual Factors • Constructing and Using Domain Models • Conclusion and Future Work

  4. Traditional Methods in IR • Each query term t matches a list of documents t: {…, D, …} • Final answer list = combining all the lists of query terms • e.g. Vector space model: Language model: • 2 implicit assumptions: • Information need is only specified by the query terms • Query terms are independent

  5. Reality • A term is only one of the possible expression of a meaning • Synonyms, related terms • Query is only a partial specification of user’s information need • Many words can be omitted in the query: e.g. “Java hotel”: hotel booking in Java island, … • How to make the query more complete?

  6. Dealing with relations between terms • Previous methods try to enhance the query: • Query expansion (add some related terms) • Thesauri: Wordnet, Hownet • Statistical co-occurrence: 2 terms that often co-occur together in the same context • Pseudo relevance feedback: top-ranked documents retrieved with original query • User profile, background, preference… (a set of background terms) • Used to re-rank the documents • Equivalent to a query expansion

  7. Question • Are these related to inference? • How to perform inference in general in IR? LM as a tool for implementing logical IR

  8. Overview • Introduction: Current Approaches to IR • Inference using terms relations • A General Model to Integrate Contextual Factors • Constructing and Using Domain Models • Conclusion and Future Work

  9. What is logical IR? • Key: inference – infer query from document • D: Tsunami • Q: natural disaster • DQ?

  10. Using knowledge to make inference in IR • K | D Q • K: general knowledge • No knowledge • Thesauri • Co-occurrence • … • K: user knowledge • Characterizes the knowledge of a particular user

  11. Simple inference – the core of logical IR • Logical deduction (A  B)  (B  C)  A  C • In IR: (D  Q’)  (Q’  Q)  D  Q (D  D’)  (D’  Q)  D  Q Doc. matching Inference on query Inference on doc. Doc. matching

  12. Is language modeling a reasonable framework? 1. Basic generative model: • P(Q|D) ~ P(DQ) • Current Smoothing: • E.g. D=Tsunami, PML(natural disaster|D)=0 change to P(natural disaster|D)>0 • Not inference • P(computer|D)>0 ~ P(natural disaster|D)>0

  13. Effect of smoothing? • Doc: Tsunami, ocean, Asia, … • Smoothing inference • Redistribution uniformly/according to collection (also to unrelated terms) Tsunami ocean Asia computer nat.disaster …

  14. Expected effect • Using Tsunami  natural disaster • Knowledge-based smoothing Tsunami ocean Asia computer nat.disaster …

  15. Inference: Translation model (Berger & Lafferty 99) Traditional LM Inference

  16. Using more types of knowledge for document expansion (Cao et al. 05) • Different ways to satisfy a query (term) • Directly though unigram model • Indirectly (by inference) through Wordnet relations • Indirectly trough Co-occurrence relations • … • Dti if DUG ti or DWN ti or DCO ti

  17. Inference using different types of knowledge (Cao et al. 05) qi PWN(qi|w1) PCO(qi|w1) w1 w2 … wn w1 w2 … wn WN model CO model UG model λ1 λ2 λ3 document

  18. Experiments (Cao et al. 05) Integrating more types of relation is useful

  19. Query expansion in LM • KL-div: • With no query expansion, equivalent to generative model Query model Smoothed doc. model

  20. Expanding query model Classical LM Relation model

  21. Using co-occurrence information • Using an external knowledge base (e.g. Wordnet) • Pseudo-rel. feedback • Other term relationships • …

  22. Using co-occurrence relation • Use term co-occurrence relationship • Terms that often co-occur in the same windows are related • Window size: 10 words • Unigram relationship (wj  wi ) • Query expansion

  23. Problem co-occurrence relations • Ambiguity • Term relationship between two single words e.g. “Java  programming” • No information to determine the appropriate context e.g. “Java travel” by “programming” • Solution: add some context information into term relationship

  24. Overview • Introduction: Current Approaches to IR • Inference using terms relations • Extracting context-dependent term relations • A General Model to Integrate Contextual Factors • Constructing and Using Domain Models • Conclusion and Future Work

  25. General Idea (Bai et al. 06) • Use (t1, t2, t3, …)  t instead of t1  t • e.g. “(Java, computer, language)  programming” • Problem with arbitrary number of terms in condition: • Complexity with many words in condition part • Difficult to obtain reliable relations • Our solution: • Limit condition part to 2 words e.g. “(Java, computer)  programming” “(Java, travel)  island” • One word specifies the context to the other

  26. Hypotheses • Hypothesis 1: most words can be disambiguated with one useful context word • e.g. “Java + computer, Java + travel, Java + taste” • Hypothesis 2: users often choose useful related words to form their queries • A word in query provides useful information to disambiguate another word • Possible queries: e.g. “windows version” “doors and windows” • Seldom case: users do not express their need clearly e.g. “windows installation” ?

  27. Context-dependent co-occurrences (Bai et al. 06) • wiwj wk • New relation model

  28. Experimental Results (Average Precision) * and ** indicate the difference is statistically significant by t-test (*: p-value < 0.05, **: p-value < 0.01)

  29. Experimental Analysis (example) • Query #55: “Insider trading” • Unigram relationships: P(*|insider) or P(*|trading) stock:0.014177 market:0.0113156 US:0.0112784 year:0.010224 exchang:0.0101797 trade:0.00922486 report:0.00825644 price:0.00764028 dollar:0.00714267 1:0.00691906 govern:0.00669295 state:0.00659957 futur:0.00619518 million:0.00614666 dai:0.00605674 offici:0.00597034 peopl:0.0059315 york:0.00579298 issu:0.00571347 nation:0.00563911 • Bi-term relationships: P(*|insider, trading) secur:0.0161779 charg:0.0158751 stock:0.0137123 scandal:0.0128471 boeski:0.0125011 inform:0.011982 street:0.0113332 wall:0.0112034 case:0.0106411 year:0.00908383 million:0.00869452 investig:0.00826196 exchang:0.00804568 govern:0.00778614 sec:0.00778614 drexel:0.00756986 fraud:0.00718055 law:0.00631543 ivan:0.00609914 profit:0.00566658 => Expansion terms determined by BQE are more relevant than UQE

  30. Logical point of view of the extensions D tj … ti

  31. Overview • Introduction: Current Approaches to IR • Inference using terms relations • A General Model to Integrate Contextual Factors • Constructing and Using Domain Models • Conclusion and Future Work

  32. LM for context-dependent IR? • Context (X) = background knowledge of the user, the domain of interest, … • Document model smoothed by context model X | DQ = | X+D  Q • Similar to doc. Expansion approaches • Query smoothed by context model X | DQ = | DQ+X • Similar to (Lau et al. 04) and query expansion approaches • Utilizations of context: • domain knowledge (e.g. javaprogramming only in computer science) • Specification of the area of interest (e.g. science): background terms • Characteristics of the collection

  33. Contexts and Utilization (1) • General term relations (Knowledge) • Traditional term relations are context-independent : e.g. “Java  programming”, Prob(programming|Java) • Context-dependent term relations: add some context words in term relations e.g. “{Java, computer} programming” (“programming” only derived to expand a query containing both “Java” and “computer”) “{Java, computer}” identifies a better context than “Java ”to determine expansion terms

  34. Contexts and Utilization (2) • Topic domains of the query (Domain background) • Consider topic domain as specifying a set of background terms frequently used in the domain • However, these terms are often omitted in the queries e.g. in Computer Science domain, term “computer” is often implied by queries in this domain, but usually omitted • e.g. Computer Science domain: any query  “computer”, …

  35. Example: “bus services in Java” • 99 concern “Java language”, only one related to “transportation” (but irrelevant to the query) • Reason: do not consider the retrieval context - the user is preparing a travel

  36. Example: “bus services in Java + transportation, hotel, flight” • 12 among 20 related to “transportation” • Reason: the additional terms specify the appropriate context and make query less ambiguous

  37. Contexts and Utilization (3) • Query-specific collection characteristics (Feedback model) • What terms are useful to retrieve relevant documents in a particular corpus? • ~ What other topics are often described together with the query topic in the corpus? e.g. in a corpus, “terrorism” can be described often with “9-11, air hijacking, world trade center, …” • Expand query with related terms • Feedback model: capture query-related collection context

  38. Enhanced Query Model • Basic idea for query expansion: • Combine original query model with the expansion model • Generalized model: 3 expansion models from 3 contextual factors: • : original query model : knowledge model • : domain model : feedback model whereX={0, K, Dom, FB} is the set of all component models is the mixture weight Original query model Expansion model

  39. Illustration: Expanded Query Model • Term t can be derived from query model by several inference paths • Once a path is selected, the corresponding LM is to generate term t

  40. Overview • Introduction: Current Approaches to IR • Inference using terms relations • A General Model to Integrate Contextual Factors • Constructing and Using Domain Models • Conclusion and Future Work

  41. Creating Domain Models • Assumption: each topic domain contains a set of example (in-domain) documents • Extract domain-specific terms from them • Use EM algorithm: extract only the specific terms • Assume each in-domain document is generated from: (Dom=0.5) • Domain model is extracted by EM so as to maximize P(Dom|θ’Dom):

  42. Effect of EM Process • Term probabilities in domain “Environment” before/after EM (12 iterations) => Extract domain-specific terms while filtering out common words

  43. How to Gather in-domain Documents • Existing directories: ODP, Yahoo! directory • We assume that user defines his own domains, and assigns a domain to each of his queries (during the training phase) • Gather relevant documents of the queries (by user’s relevance judgments) (C1) • Simply collect the top-ranked documents (without user’s relevance judgments) (C2) • (This strategy is used in order to test on TREC data)

  44. How to Determine the Domain of a New Query • 2 strategies to assign domain to the query: • Manually (U1) • Automatically (U2) • Automatic query classification by LM: • Similar to text classification, but query is much shorter than text document • Select domain with the lowest KL-divergence score of the query: • This is an extension to Naïve Bayes classification [Peng et al. 2003]

  45. Overview • Introduction: Current Approaches to IR • Inference using terms relations • A General Model to Integrate Contextual Factors • Constructing and Using Domain Models • Experiments • Conclusion and Future Work

  46. Experimental Setting • Text collection statistics: TREC • Training collection: to determine the parameter values (mixture weights)

  47. An Example: Query with Manually Assigned Domain <top> <head> Tipster Topic Description <num> Number: 055 <dom> Domain: International Economics <title> Topic: Insider Trading (only use title as Query) <desc> Description: Document discusses an insider-trading case. … Figure: Distribution of the queries among 13 domains in TREC

  48. Baseline Methods • Document model: Jelinek-Mercer smoothing

  49. Constructing and Using Domain Models • 2 Strategies to create domain models: (current test query is excluded from domain model construction) • with the relevant documents for in-domain queries (C1) • User judges which documents relevant to the domain • Similar to manual construction of directories • with the top-100 documents retrieved by in-domain queries (C2) • User specifies a domain for queries without judging relevant documents • System gathers in-domain documents from user’s search history • Once constructed domain models, 2 Strategies to use them: • Domain can be assigned to a new query by user manually (U1) • Domain is determined by the system automatically using query classification (U2)

  50. Creating Domain Models • C1 (constructed with relevant documents) vs. C2 (with top-100):

More Related