CS 430 / INFO 430 Information Retrieval

CS 430 / INFO 430 Information Retrieval Lecture 4 Inverted Files

Course Administration • Assignment 1 will be posted soon. It is a programming assignment and is due on Monday, September 20 at 5 p.m. Follow the submission instructions carefully. Send questions to cs430-l@cs.cornell.edu. • Lecture 2, slides 13 and 15. Error in definition of n has been corrected

Document Vectors as Points on a Surface • Normalize all document vectors to be of length 1 Define d' = • Then the ends of the vectors d' all lie on a surface with unit radius • For similar documents, we can represent parts of this surface as a flat region • Similar document are represented as points that are close together on this surface d |d|

Results of a Search x x hits from search x  x x x x x documents found by search  query

Relevance Feedback (Concept) hits from original search x x o  x x o o x documents identified as non-relevant o documents identified as relevant  original query reformulated query

Document Clustering (Concept) x x x x x x x x x x x x x x x x x x x Document clusters are a form of automatic classification. A document may be in several clusters.

Use of Inverted Files for Calculating Similarities In the term vector space, if q is query and dj a document, then q and dj have no terms in common iff q.dj = 0. 1. To calculate all the non-zero similarities, find all the documents, dj, that contain at least one term in the query: • Merge the inverted lists for each term ti in the query, with a logical OR, to establish a set of hits, R. • For each dj R, calculate Similarity(q, dj), using appropriate weights. 2. Return the elements of R in ranked order.

Representation of Inverted Files Document file: Stores the documents. Important for user interface design. [Repositories for the storage of document collections are covered in CS 431.] Index (word list, vocabulary) file: Stores list of terms (keywords). Designed for searching and sequential processing, e.g., for range queries, (lexicographic index). Often held in memory. Postings file: Stores an inverted list (postings list) of postings for each term. Designed for rapid merging of lists and calculation of similarities. Each list is usually stored sequentially.

Organization of Inverted Files Index file Postings file Documents file Term Pointer to postings ant bee cat dog elk fox gnu hog Inverted lists

Decisions in Building Inverted Files: What is a Term? • Underlying character set, e.g., printable ASCII, Unicode, UTF8. • Is there a controlled vocabulary? If so, what words are included? • List of stopwords. • Rules to decide the beginning and end of words, e.g., spaces or punctuation. • Character sequences not to be indexed, e.g., sequences of numbers.

Decisions in Building an Inverted File: Efficiency and Query Languages Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in lexicographic order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists

Efficiency Criteria Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources.

Document File The documents file stores the documents that are being indexed. The documents may be: • primary documents, e.g., electronic journal articles • surrogates, e.g., catalog records or abstracts

Document File The storage of the document file may be: Central (monolithic) - all documents stored together on a single server (e.g., library catalog) Distributed database - all documents managed together but stored on several servers (e.g., Medline, Westlaw, Dialog) Highly distributed - documents are stored on independently managed servers (e.g., Web) Each requires: a document ID, which is a unique identifier that can be used by the inverted file system to refer to the document, and a location counter, which can be used to specify location within a document.

Documents File for Web Search System For web search systems: • A document is a web page. • The documents file is the web. • The document ID is the URL of the document. Indexes are built using a web crawler, which retrieves each page on the web (or a subset). After indexing, each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores contextual information, which will be discussed in a later lecture.)

Postings File The postings file stores the elements of a sparse matrix, the term assignment matrix. It is stored as a separate inverted list for each column, i.e., a list corresponding to each term in the index file. Each element in an inverted list is called a posting, i.e., the occurrence on a term in a document Each list consists of one or many individual postings.

Postings File:A Linked List for Each Term • 1 abacus 3 94 19 7 19 212 22 56 • 2 actor • 66 19 213 29 45 3 aspen 5 43 • 4 atoll • 3 • 70 34 40 A linked list for each term is convenient to process sequentially, but slow to update when the lists are long.

Length of Postings File For a common term there may be very large numbers of postings for a given term. Example: 1,000,000,000 documents 1,000,000 distinct words average length 1,000 words per document 1012 postings By Zipf's law, the 10th ranking word occurs, approximately: (1012/10)/10 times = 1010 times

Postings File Merging inverted lists is the most computationally intensive task in many information retrieval systems. Since inverted lists may be long, it is important to match postings efficiently. Usually, the inverted lists will be held on disk and paged into memory for matching. Therefore algorithms for matching postings process the lists sequentially. For efficient matching, the inverted lists should all be sorted in the same sequence. Inverted lists are commonly cached to minimize disk accesses.

Data for Calculating Weights The calculation of weights requires extra data to be held in the inverted file system. (See Lecture 7 for the use made of this data.) For each term, tjanddocument, di fij number of occurrences of tj in di For each term, tj nj number of documents containing tj For each document, di mi maximum frequency of any term in di For the entire document file n total number of documents

Index File: Individual Records for Each Term The record for term j in the index file contains: term j pointer to inverted (postings) list for term j number of documents in which term j occurs (nj)

Index Files On disk If an index is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that an index has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer.

Index File Structures: Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for lexicographic processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

Index File Structures: Binary Tree Input: elk, hog, bee, fox, cat, gnu, ant, dog elk bee hog fox ant cat gnu dog

Binary Tree Advantages Can be searched quickly Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Less good for lexicographic processing, e.g., comp* Tree tends to become unbalanced If the index is held on disk, important to optimize the number of disk accesses

Binary Tree Calculation of maximum depth of tree. Illustrates importance of balanced trees. Worst case: depth = n O(n) Ideal case: depth = log(n + 1)/log 2 O(log n)

Right Threaded Binary Tree Threaded tree: A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in-order predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Can be used for lexicographic processing. A good data structure when index held in memory Knuth vol 1, 2.3.1, page 325.

Right Threaded Binary Tree From: Robert F. Rossa

B-trees B-tree of order m: A balanced, multiway search tree: • Each node stores many keys • Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. • If ki is the ith key in a given internal node -> all keys in the (i-1)th child are smaller than ki -> all keys in the ith child are bigger than ki • All leaves are at the same depth

B-trees B-tree example (order 2) 50 65 55 59 70 90 98 10 19 35 66 68 91 95 97 36 47 1 5 8 9 72 73 12 14 18 21 24 28 Every arrow points to a node containing between 2 and 4 keys. A node with k keys has k + 1 pointers.

B+-tree • A B-tree is used as an index • Data is stored in the leaves of the tree, known as buckets Example: B+-tree of order 2, bucket size 4 50 65 10 25 55 59 70 81 90 ... D9 D51 ... D54 D66... D81 ... (Implementation of B+-trees is covered in CS 432.)

CS 430 / INFO 430 Information Retrieval