智能信息检索

智能信息检索 杜小勇教授,中国人民大学文继荣教授,微软亚洲研究院

Overview of Key Techniques in IR Prof. Xiaoyong Du

Core Techniques Metadata-level Techniques Indexing Search Compression Media-Related Interpretation Text Operator image operator Video operator Unstructured data

English Text Operation • Word tokenization(断词) • “.”的处理 • “/apostrophe ”的处理 • “-”的处理 • Open source http://nltk.sourceforge.net/ • G.Grefenstette的研究结果(1994) • 统计Brown语料中的52511个句子 • 将”.”简单地作为句子分割符,准确率为93.20% • 使用简单的正则表达式规则,准确率为97.66% • 借助词表,可以进一步提高准确率 Proceedings of 3rd conf. on computational lexicography and text research,1994

English Text Operation • Stemming(词干提取) • 查表法,事前将所有词的词干都列出来. 浪费存储空间 • 基于规则的porter算法 • Open source http://www.tartarus.org/martin/porterStemmer/ • 其他方法

中文词法分析 • 分词(word segmentation) • 什么是中文的”词”? • 基于词典(词表)的最大匹配法 • 正向最大匹配 Forward Maximum Matching • 逆向最大匹配 Reverse Maximum Matching • 双向最大匹配 Bi-Directional MM • 如果FMM=RMM 可认为分词正确,否则可进行进一步的消歧处理

中文词法分析 • 歧义词切分(ambiguities) • 歧义词分类 • 交集型歧义: A+X+B => AX, XB, 例:苏副教授 • 组合型歧义: A+B => A, B, AB, 例: 马上 • 基于统计语言模型的消歧

中文词法分析 • 未登陆词识别(out-of vocabulary OOV) • 没有在词表中出现的新词 • 未登陆词的种类 • 人名:张朝阳,哈里.波特 • 地名:海淀区,李家庄 • 机构名:中国人民大学, • 专有名词:道-琼斯 • 专业术语: 非典,线形回归 • 数词,时间词等.1992年 • Named entity recognition • Information Extraction

Image Operation • OCR • Color • ……

Indexing • Inverted Files • Suffix Trees • Signatures Indexing Object Representation Unstructured data

Inverted Files • Characteristics • A word-oriented mechanism based on sorted list of keywords, with each keyword having links to the documents containing that keyword. • Preprocessing • Each document is assigned a list of keywords or attributes. • Each keyword (attribute) is associated with relevance weights.

Inversion of Word List 1. The input text is parsed into a list of words along with their location in the text. (time and storage consuming operation) 2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order. 3. Add term weights, or reorganize or compress the files.

Inversion of Word List

Structure and Construction • Structure (split the index into two files) • Vocabulary: O(nb) according to Heaps’ Law • Occurrences : depends on the addressing granularity(document or block?) • Construction • Dictionary file: The vocabulary is stored in lexicographical order and points to posting list. • Posting file：the lists of occurrences are stored contiguously Heaps’ Law: the vocabulary of a text of size n words is of size O(nbeta) beta in (0.4,0.6)

Dictionary and Postings File (document #, frequency)

Vocabulary and Posting File

Structures used in Inverted Files • Vocabulary • Sorted Arrays • Hashing Structures • Keyword Trees: Tries (digital search trees) • The Search Procedure • Vocabulary search • Retrieval of occurrences • Manipulation of occurrences

Size of an Inverted File Block addressing The text is divided in blocks, and the occurrences point to the blocks instead of full inverted indices where exact occurrences are recorded * Percentage of the size the whole text collection

Analysis for Block Addressing • Advantage • easy to implement • Disadvantage • updating the index is expensive

Suffix Trees • Each position in the text is considered as a text suffix • Index points are selected from the text, which point to the beginning of the text positions which will be retrievable

Suffix of a Text

Suffix Tries and Suffix Trees

Suffix arrays • Suffix Array is an array containing all the pointers to the text suffixes list in lexicographical order • Less space requirements than suffix trees. • The main drawbacks of Suffix Array are its costlyconstruction process. • Allow binary searches done by comparing the contents of each pointer. • Supra-indices (for large suffix array)

Signature Files • Characteristics • Word-oriented index structures based on hashing • Low overhead (10%~20% over the text size) at the cost of forcing a sequential search over the index • Suitable for not very large texts • Inverted files outperform signature files for most applications

Construction and Search • Word-oriented index structures base on hashing • Maps words to bit masks of B bits • Divides the text in blocks of b words each • The mask is obtained by bitwise ORing the signatures of all the words in the text block. • Search • Hash the query to a bit mask W • If W & Bi = W, the text block may contain the word

Block 4: 001100 OR 100001 101101 Example • Four blocks: • This is a text. A text has many words. Wordsare made from letters. 000101 110101 100100 101101 • Hash(text) = 000101 • Hash(many)= 110000 • Hash(words)= 100100 • Hash(made)= 001100 • Hash(letters)= 100001

False Drop • False Drop: = False Positive • Assumes that m bits are randomly set in the mask • Let a=m/B • For b words, the probability that a given bit of the mask is set is 1-(1-1/B)bm1-e-ba • Hence, the probability that the m random bits set in the query are also set in the mask is Fd =(1-e-ba)aB • Fd is minimized for a=ln(2)/b • Fd = 2-mm = B ln2/b

end

智能信息检索

智能信息检索

Presentation Transcript