Information Retrieval

Information Retrieval PengBo Oct 28, 2010

本次课大纲 • Introduction of Information Retrieval • 索引技术：Index Techniques • 排序：Scoring and Ranking • 性能评测：Evaluation

Basic Index Techniques

Document Collection site:pkunews.pku.edu.cn baidu report 12,800 pages Google report 6820 pages

User Information Need • 在这个新闻网站内查找 • articles talks aboutCulture of China and Japan, and doesn’t talk aboutstudents abroad. • QUERY： • “中国日本文化 —留学生” 中国日本文化 -留学生 site:pkunews.pku.edu.cn Baidu report 38 results Google report 361 results

How to do it? • 字符串匹配，如使用 grep所有WebPages，找到包含 “中国”,“文化”and “日本”的页面,再去除包含 “留学生”的页面? • Slow (for large corpora) • NOT“留学生”is non-trivial • Other operations (e.g., find “中国”NEAR “日本”) not feasible

1 if page contains word, 0 otherwise Document Representation • Bag of words model • Document-term incidence matrix（关联矩阵）

Incidence Vector • Transpose：把Document-term矩阵转置 • 得到term-document 关联矩阵 • 每个term对应一个0/1向量, incidence vector

Retrieval • Information Need: • 在这个新闻网站内查找: articles talks aboutCulture of China and Japan, and doesn’t talk aboutstudents abroad. • To answer query: • 读取term向量 “中国”,“文化”,“日本”,“留学生”(complemented) • bitwise AND • 101110 AND 110010 AND 011011 AND 100011= 000010

Let’s build a search system! • 考虑系统规模： • 文档数：N = 1million documents, 每篇文档约有1K terms. • 平均6 bytes/term =>6GB of data in the documents. • 不相同的term数：M = 500K distinct terms • 这个Matrix规模是？ • 500K x 1M • 十分稀疏：不超过one billion 1’s • What’s a better representation?

1875年,Mary Cowden Clarke为莎士比亚作品编纂了词汇索引。在书的前言，她骄傲的写到她“纷献了一个通向智慧宝藏的可靠指南…,希望这十六年来的辛勤劳动没有辜负这个理想…” 1911,LaneCooper教授出版了一本William Wordsworth诗集的词汇索引。耗时7个月，共有67人参与工作，使用的工具八廓卡片、剪刀、胶水和邮票等。 1965,使用计算机整理这样的资料只需要几天时间，而且会完成得更好……

2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 中国 13 16 文化留学生 Sorted by docID (more later on why). Dictionary Postings Inverted index • 对每个 term T: 保存包含T的文档(编号)列表

Tokenizer Friends Romans Countrymen Token stream. Linguistic modules friend friend roman countryman Modified tokens. roman 2 4 Indexer countryman 1 2 Inverted index. 16 13 Inverted index construction Documents to be indexed. Friends, Romans, countrymen.

Indexer steps • 输出： <Modified token, Document ID> 元组序列. Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Core indexing step • Sort by terms.

合并一个文档中的多次出现，添加term的Frequency信息.合并一个文档中的多次出现，添加term的Frequency信息.

Why split? • 结果split 为一个Dictionary文件和一个Postings文件.

中国 2 4 8 16 32 64 128 文化 1 2 3 5 8 13 21 34 Boolean Query processing • 查询:中国AND文化 • 查找Dictionary，定位中国; • 读取对应的postings. • 查找Dictionary，定位文化; • 读取对应的postings. • “Merge”合并(AND)两个postings:

中国文化 13 128 2 2 4 4 8 8 16 16 32 32 64 64 8 1 1 2 2 3 3 5 5 8 8 21 21 13 34 The merge • Lists的合并算法 2 34 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

Boolean queries: Exact match • Queries using AND, OR and NOT together with query terms • Primary commercial retrieval tool for 3 decades. • Professional searchers (e.g., Lawyers) still like Boolean queries: • You know exactly what you’re getting.

Example: WestLaw • Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) • About 7 terabytes of data; 700,000 users • Majority of users still use boolean queries • Example query: • What is the statute of limitations in cases involving the federal tort claims act? • LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM • 特点：Long, precise queries; proximity operators; incrementally developed; not like web search

Beyond Boolean term search • 短语phrase： • Find “Bill Gates” , not “Bill and Gates” • 词的临近关系Proximity: • Find GatesNEAR Microsoft. • 文档中的区域限定: • Find documents with (author = Ullman) AND (text contains automata). • Solution： • 记录term的field property • 记录term在docs中的position information.

LAST COURSE REVIEW

Bag of words model Vector representation doesn’t consider the ordering of words in a document John is quicker than Maryand Mary is quicker than John have the same vectors This is called the bag of words model. In a sense, this is a step back: The positional index was able to distinguish these two documents. We will look at “recovering” positional information later in this course. For now: bag of words model

2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 中国 13 16 文化留学生 Sorted by docID (more later on why). Dictionary Postings Inverted index • 对每个 term T: 保存包含T的文档(编号)列表

Simple Inverted Index

Inverted Index • with counts • supports better ranking algorithms

Inverted Index • with positions • supports • proximity matches

Query Processing • Document-at-a-time • Calculates complete scores for documents by processing all term lists, one document at a time • Term-at-a-time • Accumulates scores for documents by processing term lists one at a time • Both approaches have optimization techniques that significantly reduce time required to generate scores

Document-At-A-Time

Term-At-A-Time

Scoring and Ranking

Beyond Boolean Search • 对大多数用户来说…. • LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM • 大多数用户可能会输入 billrights or bill of rights作为Query • 怎样解释和处理这样 full textqueries? • 没有AND OR NOT等boolean连接符 • 某些query term不一定要出现在结果文档中 • 用户会期待结果按某种 order返回，most likely to be useful的文档在结果的前面

Scoring: density-based • 按query，给文档打分scoring，根据score排序 • Idea • 如果一个文档 talks about a topic more, then it is a better match • if如果包含很多次query term的出现，文档是relevant(相关的) •  term weighting.

Term frequency vectors • 考察term t在文档d, 中出现的次数number of occurrences，记作tft,d 对一个free-text Query q Score(q,d) = tqtft,d

Problem of TF scoring • 没有区分词序 • Positional information index • 长文档具有优势 • 归一化：normalizingfor document length • wft,d = tft,d / |d| • 出现的重要程度其实与出现次数不成正比关系 • 从0次到1次的出现，和100次出现到101次出现，意义大不相同 • 平滑 • 不同的词，其重要程度其实不一样 • Consider query 日本的汉字丼 • 区分Discrimination of terms

Discrimination of terms • 怎样度量terms的common程度 • collection frequency (cf )：文档集合里term出现的总次数 • document frequency (df )：文档集合里出现过term的文档总数

tf x idf term weights • tf x idf 权值计算公式: • term frequency (tf ) • or wf, some measure of term density in a doc • inverse document frequency (idf ) • 表达term的重要度(稀有度) • 原始值idft = 1/dft • 同样，通常会作平滑 • 为文档中每个词计算其tf.idf权重：

Documents as vectors • 每一个文档 j能够被看作一个向量，每个term 是一个维度，取值为tf.idf • So we have a vector space • terms are axes • docs live in this space • 高维空间：即使作stemming, may have 20,000+ dimensions

t3 d2 d3 d1 θ φ t1 d5 t2 d4 Intuition Postulate: 在vector space中“close together” 的文档会talk about the same things. 用例：Query-by-example，Free Text query as vector

Sec. 6.3 Formalizing vector space proximity • First cut: distance between two points • ( = distance between the end points of the two vectors) • Euclidean distance? • Euclidean distance is a bad idea . . . • . . . because Euclidean distance is large for vectors of different lengths.

Sec. 6.3 Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribution of terms in the query qand the distribution of terms in the document d2 are very similar.

t 3 d 2 d 1 θ t 1 t 2 Cosine similarity • 向量d1和d2的“closeness”可以用它们之间的夹角大小来度量 • 具体的，可用cosine of the anglex来计算向量相似度. • 向量按长度归一化Normalization

Example • Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights • cos(SAS, PAP) = .996 x .993 + .087 x .120 + .017 x 0.0 = 0.999 • cos(SAS, WH) = .996 x .847 + .087 x .466 + .017 x .254 = 0.929

Notes on Index Structure • 怎样保存normalized tf-idf 值？ • 在每一个postings entry吗? • 保存tf/normalization? • Space blowup because of floats • 通常： • tf以整数值保存index compression • 文档长度，idf每doc只保存一个

Sec. 6.4 tf-idf weighting has many variants Columns headed ‘n’ are acronyms for weight schemes. Why is the base of the log in idf immaterial?

Sec. 6.4 Weighting may differ in queries vs documents A bad idea? Many search engines allow for different weightings for queries vs. documents SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq, using the acronyms from the previous table A very standard weighting scheme is: lnc.ltc Document: logarithmic tf (l as first character), no idf and cosine normalization Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization …

Information Retrieval