Information Retrieval Models

Information Retrieval Models PengBo Oct 30, 2010

上次课回顾 • Basic Index Techniques • Inverted index • Dictionary &Postings • Scoring and Ranking • Term weighting • tf·idf • Vector Space Model • Cosine Similarity • IR evaluation • Precision, Recall, F • Interpolation • MAP, interpolated AP

本次课大纲 • Information Retrieval Models • Vector Space Model (VSM) • Latent Semantic Model (LSI) • Language Model (LM)

Relevance Feedback Query Expansion

Vector Space Model

Documents as vectors • 每一个文档 j能够被看作一个向量，每个term 是一个维度，取值为log-scaled tf.idf • So we have a vector space • terms are axes • docs live in this space • 高维空间：即使作stemming, may have 20,000+ dimensions

t3 d2 d3 d1 θ φ t1 d5 t2 d4 Intuition Postulate: 在vector space中“close together” 的文档会talk about the same things. 用例：Query-by-example，Free Text query as vector

t 3 d 2 d 1 θ t 1 t 2 Cosine similarity • 向量d1和d2的“closeness”可以用它们之间的夹角大小来度量 • 具体的，可用cosine of the anglex来计算向量相似度. • 向量按长度归一化Normalization

#1.COS Similarity • 计算查询 “digital cameras” 与文档 “digital cameras and video cameras” 之间的相似度. • 假定 N = 10,000,000, query和document都采用logarithmic term weighting (wf columns), query采用idf weighting ，document采用cosine normalization. “and”作为stop word.

#2. Evaluation • 定义precision-recall graph如下：对一个查询结果列表，每一个返回结果文档处计算precision/recall点，由这些点构成的图. • 在这个图上定义 breakeven point为precision和 recall值相等的点. • 问：存在多于一个breakeven point的图吗？如果有，给出例子；没有的话，请证明之。

Latent Semantic Model

Vector Space Model: Pros • Automatic selection of index terms • Partial matching of queries and documents (dealing with the case where no document contains all search terms) • Ranking according to similarity score(dealing with large result sets) • Term weighting schemes (improves retrieval performance) • Various extensions • Document clustering • Relevance feedback (modifying query vector) • Geometric foundation

Problems with Lexical Semantics • Polysemy: 词通常有multitude of meanings和不同用法。Vector Space Model不能区分同一个词的不同含义，即ambiguity. • Synonymy: 不同的terms可能具有identical or a similar meaning. Vector Space Model里不能表达词之间的associations.

Issues in the VSM • terms之间的独立性假设 • 有些terms更可能在一起出现 • 同义词，相关词汇，拼写错误，etc. • 根据上下文，terms可能有不同的含义 • term-document矩阵维度很高对每篇文档/每个词，真的有那么多重要的特征?

 DT Wtd T = r  r r  d t  d t  r Singular Value Decomposition • 对term-document矩阵作奇异值分解SingularValue Decomposition • r, 矩阵的rank • , singular values的对角阵（按降序排列） • D, T, 具有正交的单位长度列向量(TT’=I, DD’=I) WWT的特征值 WTW和WWT的特征向量

Singular Values •  gives an ordering to the dimensions • 值下降非常快 • 尾部的singular values at 代表"noise" • 在low-value dimensions截止可以减少 noise，提高性能

Low-rank Approximation  DT wtd T = r  r r  d t  d t  r  DT w'td T ≈ = k  k k  d t  d t  k

Latent Semantic Indexing (LSI) • Perform a low-rank approximation of term-document matrix (typical rank 100-300) • General idea • Map documents (and terms) to a low-dimensional representation. • Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). • Compute document similarity based on the inner product in this latent semantic space

What it is • 从原始的term-document矩阵Ar, 我们计算得到它的近似Ak. • 在Ak中，每行对应一个term，每列对应一个document • 区别是，文档在新的空间，它的维度k << r dimensions • 怎样比较两个term? • 怎样比较两个document? • 怎样比较一个term和一个文档? AkAKT=TDT D TTT= (T)(T)T AKTAk= D T TT TDT= (DT)T( DT) Ak[I,j]

LSI Term matrix T • T matrix • 每个term在LSI space的向量 • 原始matrix: terms向量是d-dimensional，T中要小很多 • Dimensions是在相同文档中倾向于与这个词“同现”的一组terms • synonyms, contextually-related words, variant endings • (T) 用来计算term相似度

Document matrix D • D matrix • 在LSI space中文档的表示 • 和T vectors有相同的dimensionality • (DT) 用来计算document相似度 • 可用于计算查询和一个文档的similarity

Retrieval with LSI • LSI检索过程： • 查询映射/投影到LSI的DT空间，称为“folded in“ ： • W=TDT，若q投影到DT中后为q’，则有q = Tq’T • 既有q’= (-1T-1q)T = qT-1 • Folded in 既为 document/query vector 乘上T-1 • 文档集的文档向量为DT • 两者通过dot-product计算相似度

Improved Retrieval with LSI • 性能提升来自… • 去除了noise • 不需要stem terms (variants will co-occur) • 不需要stop list • 没有速度和空间上的改进, though…

C= • Tr= • r=

DrT= • 2= • 2 D2T=

Example • Map into 2-dimenstion space

Latent Semantic Analysis • Latent semantic space: illustrating example courtesy of Susan Dumais

Empirical evidence • Experiments on TREC 1/2/3–Dumais • Precision at or above median TREC precision • Top scorer on almost 20% of TREC topics • Slightly better on average than straight vector spaces • Effect of dimensionality:

LSI has many other applications • 在很多场合,我们都有feature-object matrix. • 矩阵是高维，有大量冗余，从而能使用low-rank approximation. • 比如文本检索，the terms是features，the docs是objects. Latent Semantic Index • 比如opinions和users … • 数据不全(e.g., users’ opinions), 可以在低维空间里恢复. • Powerful general analytical technique

Language Models

d1 generation d2 … … dn IR based on Language Model (LM) • 通常的search方法：猜测作者写相关文档时使用的词，形成query • The LM approach directly exploits that idea! Information need query document collection

I wish Formal Language (Model) • 传统的生成模型 generative model: 产生strings • Finite state machines or regular grammars, etc. • Example: I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish … (I wish) *

multiply Stochastic Language Models • Models probability of generating strings in the language (commonly all strings over alphabet ∑) Model M 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes … the man likes the woman 0.2 0.01 0.02 0.2 0.01 P(s | M) = 0.00000008

the class pleaseth yon maiden 0.2 0.01 0.0001 0.0001 0.0005 0.2 0.0001 0.02 0.1 0.01 Stochastic Language Models • Model probability of generating any string Model M1 Model M2 0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maiden 0.0001 woman 0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman P(s|M2) > P(s|M1)

P ( | M ) = P ( | M ) P ( | M, ) P ( | M, ) P ( | M, ) Stochastic Language Models • 用来生成文本的统计模型 • Probability distribution over strings in a given language M

P ( ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | ) P ( | ) = P ( ) P ( | ) P ( | ) Unigram and higher-order models • Unigram Language Models • Bigram (generally, n-gram) Language Models • Other Language Models • Grammar-based models (PCFGs), etc. • Probably not the first thing to try in IR Easy. Effective!

P ( | M ( ) ) The fundamental problem of LMs • 模型 M 是不知道的 • 只有代表这个模型的样例文本 • 从样例文本中来估计Model • 然后计算观察到的文本概率 M

Using Language Models in IR • 每篇文档对应一个model • 按P(d | q)对文档排序 • P(d | q) = P(q | d) x P(d) / P(q) • P(q) is the same for all documents, so ignore • P(d) [the prior] is often treated as the same for all d • But we could use criteria like authority, length, genre • P(q | d) is the probability of q given d’s model • Very general formal approach

Language Models for IR • Language Modeling Approaches • 为query generation process建模 • 文档排序：按一个query作为由文档模型产生的随机样本而被观察到的概率the probability that a query would be observed as a random sample from the respective document model • Multinomial approach

Retrieval based on probabilistic LM • 把query的产生当作一个随机过程 • 方法 • 为每个文档Infer a language model. • Estimate the probability：估计每个文档模型产生这个query的概率 • Rank：按这个概率对文档排序. • 通常使用Unigram model

: language model of document d : raw tf of term t in document d : total number of tokens in document d Query generation probability (1) • 排序公式 • 用最大似然估计: Unigram assumption: Given a particular language model, the query terms occur independently

Insufficient data • Zero probability • 一个文档里没有query中的某个term时… • General approach • 没有出现文档中的term按它出现在collection中的概率来代替. • If , : raw count of term t in the collection : raw collection size(total number of tokens in the collection)

Insufficient data • Zero probabilities spell disaster • 使用平滑：smooth probabilities • Discount nonzero probabilities • Give some probability mass to unseen things • 有很多方法，如adding 1, ½ or  to counts, Dirichlet priors, discounting, and interpolation • [See FSNLP ch. 6 if you want more] • 使用混合模型：use a mixture between the document multinomial and the collection multinomial distribution

Mixture model • P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc) • 参数很重要 •  值高，使得查询成为 “conjunctive-like” – 适合短查询 •  值低更适合长查询 • 调整 来优化性能 • 比如使得它与文档长度相关 (cf. Dirichlet prior or Witten-Bell smoothing)

Basic mixture model summary • General formulation of the LM for IR general language model individual-document model

Example • Document collection (2 documents) • d1: Xerox reports a profit but revenue is down • d2: Lucent narrows quarter loss but revenue decreases further • Model: MLE unigram from documents;  = ½ • Query: revenue down • P(Q|d1) • = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2] • = 1/8 x 3/32 = 3/256 • P(Q|d2) • = [(1/8 + 2/16)/2] x [(0 + 1/16)/2] • = 1/8 x 1/32 = 1/256 • Ranking: d1 > d2

Alternative Models of Text Generation Searcher Query Model Query Is this the same model? Writer Doc Model Doc

Retrieval Using Language Models Query Query Model 1 3 2 Doc Doc Model Query likelihood (1) Document likelihood (2), Model comparison (3)

Information Retrieval Models