File organization

File organization

File Organizations (Indexes) • Choices for accessing data during query evaluation • Scan the entire collection • Typical in early (batch) retrieval systems • Computational and I/O costs are O(characters in collection) • Practical for only “small” text collections • Large memory systems make scanning feasible • Use indexes for direct access • Evaluation time O(query term occurrences in collection) • Practical for “large” collections • Many opportunities for optimization • Hybrids: Use small index, then scan a subset of the collection

Indexes • What should the index contain? • Database systems index primary and secondarykeys • This is the hybrid approach • Index provides fast access to a subset of database records • Scan subset to find solution set • IR Problem: • Cannot predict keys that people will use in queries • Every word in a document is a potential search term • IR Solution: Index by all keys (words) full text indexes

Indexes • Index is accessed by the atoms of a query language, The atoms are called “features” or “keys” or “terms” • Most common feature types: • – Words in text, punctuation • – Manually assigned terms (controlled and uncontrolled vocabulary) • – Document structure (sentence and paragraph boundaries) • – Inter- or intra-document links (e.g., citations) • Composed features • – Feature sequences (phrases, names, dates, monetary amounts) • – Feature sets (e.g., synonym classes) • Indexing and retrieval models drive choices • – Must be able to construct all components of those models • Indexing choices (there is no “right” answer) • Tokenization,Case folding (e.g., New vs new, Apple vs apple), Stopwords (e.g., the, a, its), Morphology (e.g., computer, computers, computing, computed) • Index granularity has a large impact on speed and effectiveness

Index Contents • The contents depend upon the retrieval model • Feature presence/absence • Boolean • Statistical (tf, df, ctf, doclen, maxtf) • Often about 10% the size of the raw data, compressed • Positional • Feature location within document • Granularities include word, sentence, paragraph, etc • Coarse granularities are less precise, but take less space • Word-level granularity about 20-30% the size of the raw data,compressed

Indexes: Implementation • Common implementations of indexes • Bitmaps • Signature files • Inverted files • Common index components • Dictionary (lexicon) • Postings • document ids • word positions No positional data indexed

Indexes: Bitmaps • Bag-of-words index only • For each term, allocate vector with one bit per document • If feature present in document n, set nth bit to 1, otherwise 0 • Boolean operations very fast • Space efficient for common terms (why?) • Space inefficient for rare terms (why?) • Good compression with run-length encoding (why?) • Not widely used

Indexes: Signature Files • Bag-of-words only • For each term, allocate fixed size s-bit vector (signature) • Define hash function: • Multiple functions: word → 1..s [selects which bits to set] • Each term has an s-bit signature • may not be unique! • OR the term signatures to form document signature • Long documents are a problem (why?) • Usually segment them into smaller pieces / blocks

Signature File Example

Indexes: Signature Files • At query time: • Lookup signature for query (how?) • If all corresponding 1-bits are “on” in document signature, document probablycontains that term • Vary s to control P (false positive) • Note space tradeoff • Optimal s changes as collection grows (why?) • Widely studied but Not widely used

Indexes: Inverted Lists • Inverted lists are currently the most common indexing technique • Source file: collection, organized by document • Inverted file: collection organized by term • one record per term, listing locations where term occurs • During evaluation, traverse lists for each query term • OR: the union of component lists • AND: an intersection of component lists • Proximity: an intersection of component lists • SUM: the union of component lists; each entry has a score

Inverted Files

Word-Level Inverted File

How big is the index? • For an n word collection: • Lexicon • Heaps’ Law: V = O(nβ), 0.4 < β< 0.6 • TREC-2: 1 GB text, 5 MB lexicon • Postings • at most, one per occurrence of the word in the text: O(n)

Inverted Search Algorithm • Find query elements (terms) in the lexicon • Retrieve postings for each lexicon entry • Manipulate postings according to the retrieval model

Word-Level Inverted File lexicon posting Query: 1.porridge & pot (BOOL)2.“porridge pot” (BOOL) 3. porridge pot (VSM) Answer

Lexicon Data Structures According to Heaps’ Law, the size of lexicon maybe very large • Hash table • O(1) lookup, with constant h() and collision handling • Supports exact-match lookup • May be complex to expand • B-Tree • On-disk storage with fast retrieval and good caching behavior • Supports exact-match and range-based lookup • O(log n) lookups to find a list • Usually easy to expand

In-memory Inversion Algorithm • Create an empty lexicon • For each document d in the collection, • Read document, parse into terms • For each indexing term t, • fd,t = frequency of t in d • If t is not in lexicon, insert it • Append <d, fd,t> to postings list for t • Output each postings list into inverted file • For each term, start new file entry • Append each <d,fd,t> to the entry • Compress entry • Write entry out to file.

Complexity of In-memory Inv. • Time: O(n) for n-byte text • Space • Lexicon: space for unique words + offsets • Postings, 10 bytes per entry • document number: 4 bytes • frequency count: 2 bytes (allows 65536 max occ) • "next" pointer: 4 bytes • Is this affordable? • For 5GB collection, at 10 bytes/entry, for 400M entries, need 4GB of main memory

Main inverted file 多路归并 chunk Idea 1: Partition the text • Invert a chunk of the text at a time • Then, merge each sub-indexes into one complete index

Idea 2: Sort-based Inversion Invert in two passes: • Output records <t, d, ft> to a temp. file • Sort the records using external merge sort • read a chunk of the temp file • sort it using Quicksort • write it back into the same place • then merge-sort the chunks in place • Read sorted file, and write inverted file

Access Optimizations • Skip lists: • A table of contents to the inverted list • Embedded pointers that jump ahead n documents (why is this useful?) • Separating presence information from location information • Many operators only need presence information • Location information takes substantial space (I/O) • If split, • reduced I/O for presence operators • increased I/O for location operators (or larger index) • Common in CD-ROM implementations

Inverted file compression • Inverted lists are usually compressed • Inverted files with word locations are about the size of the raw data • Distribution of numbers is skewed • Most numbers are small (e.g., word locations, term frequency) • Distribution can be made more skewed easily • Delta encoding: 5, 8, 10, 17 → 5, 3, 2, 7 • Simple compression techniques are often the best choice • Simple algorithms nearly as effective as complex algorithms • Simple algorithms much faster than complex algorithms • Goal: Time saved by reduced I/O > Time required to uncompress Compress  Huffman encoding

Inverted file compression • The longst lists, which take up the most space, have the most frequent (probable) words. • Compressing the longest lists would save the most space. • The longest lists should compress easily because they contain the least information (why?) • Algorithms: • Delta encoding • Variable-length encoding • Unary codes • Gamma codes • Delta codes • Variable-Byte Code • Golomb code

Inverted List Indexes: Compression Delta Encoding ("Storing Gaps") • Reduces range of numbers. • keep d in ascending order • 3, 5, 20, 21, 23, 76, 77, 78 • becomes: 3, 2, 15, 1, 2, 53, 1, 1 • Produces a more skewed distribution. • Increases probability of smaller numbers. • Stemming also increases the probability of smaller numbers. (Why?)

Variable-Byte Code • Binary, but use minimum number of bytes • 7 bits to store value, 1 bit to flag if there is another byte • 0 < x < 128: 1 byte • 128 < x < 16384: 2 bytes • 16384 < x < 2097152: 3 bytes • Integral byte sizes for easy coding • Very effective for medium-sized numbers • A little wasteful for very small numbers • Best trade-off between smallest index size and fastest query time

Index/IR toolkits • The Lemur Toolkit for Language Modeling and Information Retrieval • Java-based indexing and search technology,provide web search application software http://www.lemurproject.org/ http://lucene.apache.org/nutch/

Summary • Common implementations of indexes • Bitmap, signature file, inverted file • Common index components • Lexicon & postings • Inverted Search Algorithm • Inversion algorithm • Inverted file access optimizations • compression

The End

Vector Space Model • 文档d和查询q在向量空间中表示为两个m维向量，每维度的权值用TF∙IDF，其相似度用向量夹角余弦度量，有: (使用原始的tf,idf公式)

Vocabulary Growth (Heap’s Law) • How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? • Vocabulary has no upper bound due to proper names, typos, etc. • New words occur less frequently as vocabulary grows • If V is the size of the vocabulary and the n is the length of the corpus in words: • V = Knβ (0< β <1) • Typical constants: • K ≈ 10−100 • β ≈ 0.4−0.6 (approx. square-root of n)

Query Answer • 1.porridge & pot (BOOL) • d2 • 2.“porridge pot” (BOOL) • null • 3. porridge pot (VSM) • d2 > d1>d5 • Next page

3. porridge pot (VSM) N=6，f(d1)表示d1内词频 back

File organization