Chapter 8 Indexing and Searching

Chapter 8Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Introduction • searching • Online text searching • Scan the text sequentially • Indexed searching • Build data structures over the text to speed up the search • Semi-static collections: updated at reasonably regular interval • indexing techniques • inverted files • suffix (PAT) arrays • signature files

Assumptions • n: the size of text databases • m: the length of the search patterns (m<n) • M: the amount of memory available • n’: the size of texts that are modified (n’<n) • Experiments • 32bit Sun UltraSparc-1 of 167 MHz with 64 MB of RAM • TREC-2 (WSJ, DOE, FR, ZIFF, AP)

File Structures for IR • lexicographical indices • indices that are sorted • inverted files • Patricia (PAT) trees (Suffix trees and arrays) • cluster file structures (see Chapter 7 in document clustering) • indices based on hashing • signature files

Inverted Files

Inverted Files • Each document is assigned a list of keywords or attributes. • Each keyword (attribute) is associated with operational relevance weights. • An inverted file is the sorted list of keywords (attributes), with each keyword having links to the documents containing that keyword. • Penalty • the size of inverted files ranges from 10% to 100%of more of the size of the text itself • need to update the index as the data set changes

1 6 9 11 17 19 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters. Text Vocabulary Occurrences • addressing granularity: • inverted list – • word positions • character positions • inverted file – • document letters 60 … made 50 … many 28 … text 11, 19 … words 30, 40 … Heaps’ law: the vocabulary grows as O(n), : 0.4~0.6 Vocabulary for 1GB of TREC-2 collection: 5MB (before stemming and normalization) Occurrences: the extra space O(n) 30% ~ 40% of the text size

Block Addressing block: fixed size blocks, files, documents, Web pages, … block = retrieval units? • Full inverted indices • Point to exact occurrences • Blocking addressing • Point to the blocks where the word appears • Pointers are smaller • 5% overhead over the text size Block1Block2Block3Block 4 This is a text. A text has many words. Words are made from letters. Vocabulary Occurrences Text letters 4 … made 4 … many 2 … text 1, 2 … words 3 … Inverted index

Sorted array implementation of an inverted file the documents in which keyword occurs

Full inversion (all words, exact positions, 4-byte pointers) 2 or 1 byte(s) per pointer independent of the text size document size (10KB), 1, 2, 3 bytes per pointer, depending on text size Stop words are not indexed All words are indexed

Searching • Vocabulary search • Identify the words and patterns in the query • Search them in the vocabulary • Retrieval of occurrences • Retrieve the lists of occurrences of all the words • Manipulation of occurrences • Solve phrases, proximity, or Boolean operations • Find the exact word positions when block addressing is used Three general steps

Structures used in Inverted Files • Sorted Arrays • store the list of keywords in a sorted array • using a standard binary search • advantage: easy to implement • disadvantage: updating the index is expensive • B-Trees • Tries • Hashing Structures • Combinations of these structures

Trie 1 6 9 11 17 19 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters. Text letters: 60 made: 50 ‘l’ ‘d’ ‘a’ ‘m’ Vocabulary trie many: 28 ‘n’ ‘t’ text: 11, 19 ‘w’ words: 33, 40

B-trees F M Rut Uni Al Br E Gr H Ja L … … Russian 9 Ruthenian 1 … Afgan 2

Sorted Arrays 1. The input text is parsed into a list of words along with their location in the text. (time and storage consuming operation) 2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order. 3. Add term weights, or reorganize or compress the files.

Inversion of Word List “report” appears in two records

Dictionary and postings file Idea: the file to be searched should be as short as possible split a single file into two pieces (vocabulary) (occurrences) e.g. data set: 38,304 records, 250,000 unique terms 88 postings/record (document #, frequency)

Producing an Inverted File for Large Data Sets without Sorting Idea: avoid the use of an explicit sort by using a right-threaded binary tree current number of term postings & the storage location of postings list traverse the binary tree and the linked postings list

Indexing Statistics Final index: only 8% of input text size for 50MB database 14% of the input text size for the larger database Working storage: not much larger than the size of final index for new indexing method the storage needed to build the index p.17&18 p.20 933 2GB the same

A Fast Inversion Algorithm • Principle 1the large primary memories are availableIf databases can be split into memory loads that can be rapidly processed and then combined, the overall cost will be minimized. • Principle 2the inherent order of the input dataIt is very expensive to use polynomial or even nlogn sorting algorithms for large files

FAST-INV algorithm concept postings/ pointers See p. 22.

Sample document vector document number concept number (one concept number for each unique word) Similar to the document- word list shown in p. 16. The concept numbers are sorted within document numbers, and document numbers are sorted within collection

HCN=highest concept number in dictionary (total number of concepts in dictionary) L=number of concepts/document (documents/concept) pairs in the collection M=available primary memory size, in bytes M>>HCN, M < L L/j<M, so that each part fill fit into primary memory HCN/j concepts, approximately, are associated with each part Let LL=length of current load (8 bytes for each concept-weight) S=spread of concept numbers in current load (4 bytes for each count of posting) number of concept-weight pairs 8*LL+4*S < M

Preparation 1. Allocate an array, con_entries_cnt, of size HCN. 2. For each <doc#, con#> entry in the document vector file: increment con_entries_cnt[con#] ……………………0 (1,2), (1,4)……….. 2 (2,3) …………….. 3 (3,1), (3,2), (3,5) ... 6 (4,2), (4,3) ………. 8 ...

Preparation (continued) 5. For each <con#,count> pair obtained from con_entries_cnt: if there is no room for documents with this concept to fit in the current load, then created an entry in the load table and initialize the next load entry; otherwise update information for the current load table entry.

: the range of concepts for each primary load 如何產生Load表？ LL:length of current load S: end concept-start concept +1 space for concept/ weight pair*LL+ space for each concept to store count of postings*S < M 留意：SS為了管理上資料的添加讀入(Doc,Con) 依Con去查Load 表，確定這個配對該落在那個Load 依序將每個Load File反轉。CONPTR 表中的Offset顯示每筆資料該填入的位置。 copy rather than sort

PAT Tress and PAT Arrays(Suffix Trees and Suffix Arrays)

PAT Trees and PAT Arrays • Problems of tradition IR models • Documents and words are assumed. • Keywords must be extracted from the text (indexing). • Queries are restricted to keywords. • New indices for text • A text is regarded as a long string. • Each position corresponds to a semi-infinite string (sistring). • suffix: a string that goes from a text position to the end of the text • Each suffix is uniquely identified by its position • no structures and no keywords

Text This is a text. A text has many words. Words are made from letters. text. A text has many words. Words are made from letters. text has many words. Words are made from letters. many words. Words are made from letters. Suffixes Words are made from letters. different made from letters. Index points are selected from the text, which point to the beginning of the text positions which are retrievable. letters.

PATRICIA • trie • branch decision node: search decision-markers • element node: real data • if branch decisions are made on each bit, a complete binary tree is formed where the depth is equal to the number of bits of the longest strings • many element nodes and branch nodes are null

PATRICIA (Continued) • compressed digital search trie • the null element nodes and branch nodes are removed • an additional field to denote the comparing bit for branching decision is included in each decision node • a matching between the searched results and their search keys is required because only some of bits are compared during the search process

PATRICIA (Continued) • Practical Algorithm to Retrieve Information Coded in Alphanumeric • augmented branch node: an additional field for storing elements is included in branch node • each element is stored in an upper node or in itself • an addition root node: note the number of leaf nodes is always greater than that of internal nodes by one

PAT-tree • PATRICIA + semi-infinite strings • a text T with n basic units u1 u2 … un • u1 u2 … un …, u2 u3 … un …, u3 u4 … un …, … • an end to the left but none to the right • store the starting positions of semi-infinite strings in a text using PATRICIA

semi-infinite strings • ExampleText Once upon a time, in a far away land …sistring 1 Once upon a time …sistring 2 nce upon a time …sistring 8 on a time, in a …sistring 11 a time, in a far …sistring 22 a far away land … • Compare sistrings 22 < 11 < 2 < 8 < 1

PAT Tree • PAT TreeA Patricia tree constructed over all the possible sistrings of a text • Patricia tree • a digital tree where the individual bits of the keys are used to decide on the branching • each internal node indicates which bit of the query is used for branching • absolute bit position • a count of the number of bits to skip • each external node is a sistring, i.e., the integer displacement

1 Example 2 2 Text 01100100010111 … sistring 1 01100100010111 … sistring 2 1100100010111 … sistring 3 100100010111 … sistring 4 00100010111 … sistring 5 0100010111 … sistring 6 100010111 … sistring 7 00010111 … sistring 8 0010111 ... 4 3 1 2 1 2 2 4 3 3 2 5 1 : external node sistring (integer displacement) total displacement of the bit to be inspected 1 1 1 1 0 0 1 1 1 2 2 0 1 3 2 : internal node skip counter & pointer

1 Text 01100100010111 … sistring 1 01100100010111 … sistring 2 1100100010111 … sistring 3 100100010111 … sistring 4 00100010111 … sistring 5 0100010111 … sistring 6 100010111 … sistring 7 00010111 … sistring 8 0010111 ... 2 2 3 4 2 3 7 6 4 3 5 1 1 2 2 1 3 4 2 3 2 2 7 6 5 3 5 4 1 4 3 2 4 8 6 3 5 1 Search 00101 註：3和6要4個bits才能區辨

1 6 9 11 17 19 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters. Text Suffix Trie 60 ‘l’ 50 ‘d’ ‘m’ ‘a’ 28 space overhead: 120%~240% over the text size ‘n’ 19 ‘t’ ‘e’ ‘x’ ‘t’ ‘w’ 11 40 ‘o’ ‘r ‘d’ ‘s’ 33 60 Suffix Tree ‘l’ 50 ‘d’ ‘m’ 3 1 28 ‘n’ 19 ‘t’ 5 11 ‘w’ 40 6 33

PAT Trees Represented as Arrays • indirect binary search vs. sequential searchKeep the external nodes in the bucket in the same relative order as they would be in the tree PAT array 1 7 4 8 5 1 6 3 2 2 2 3 4 2 3 0 1 1 0 0 1 0 0 0 1 0 1 1 1 ... Text 7 6 5 3 5 1 4 8

1 6 9 11 17 19 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters. Text 60 ‘l’ (1) Suffix Tree 50 ‘d’ ‘m’ 3 1 28 ‘n’ 19 ‘t’ 120%~240% overhead 5 11 ‘w’ 40 6 33 40% overhead (2) Suffix Array (3) Supra-Index Suffix Array

difference between suffix array and inverted list • suffix array: the occurrences of each word are sorted lexicographically by the text following the word • inverted list: the occurrences of each word are sorted by text position 1 6 9 11 17 19 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters. Vocabulary Supra-Index Suffix Array Inverted list

Indexing Points • The above example assumes every position in the text is indexed.n external nodes, one for each position in the text • word and phrase searchessistrings that are at the beginning of words are necessary • trade-off between size of the index and search requirements

Prefix searching • ideaevery subtree of the PAT tree has all the sistrings with a given prefix. • Search: proportional to the query lengthexhaust the prefix or up to external node. Search for the prefix “10100” and its answer

Searching PAT Trees as Arrays • Prefix searching and range searchingdoing an indirect binary search over the array with the results of the comparisons being less than, equal, and greater than. • exampleSearch for the prefix 100 and its answer. PAT array 7 4 8 5 1 6 3 2 0 1 1 0 01 0 0 0 1 0 1 1 1 ... Text

Proximity Searching • Find all places where s1 is at most a fixed (given by a user) number of characters away from s2.in 4 ation ==> insulation, international, information • Algorithm1. Search for s1 and s2.2. Select the smaller answer set from these two sets and sort by position.3. Traverse the unsorted answer set, searching every position in the sorted set and checking if the distance between positions satisfying the proximity condition. sort+traverse time:(m1+m2)logm1 (assume m1<m2)

Range Searching • Search for all the strings within a certain lexicographical range. • the range of “abc” ..”acc”: “abracadabra”, “acacia” ○ “abacus”, “acrimonious” X • Algorithm • Search each end of the defining intervals. • Collect all the sub-trees between (and including) them.

Searching Suffix Array • P1 S < P2 • Binary search both limiting patterns in the suffix array. • Find all the elements lying between both positions

Longest Repetition Searching • the match between two different positions of a text where this match is the longest in the entire text, e.g., 0 1 1 0 0 1 00 0 1 0 1 1 1 the tallest internal node gives a pair of sistrings that match for the greatest number of characters Text 01100100010111 sistring 1 01100100010111 sistring 2 1100100010111 sistring 3 100100010111 sistring 4 00100010111 sistring 5 0100010111 sistring 6 100010111 sistring 7 00010111 sistring 8 0010111 1 2 2 3 4 3 2 7 6 5 3 5 1 4 8

“Most Significant” or “Most Frequent” Matching • the most frequently occurring strings within the text database, e.g., the most frequent trigram • find the most frequent trigramfind the largest subtree at a distance 3 characters from root 1 the tallest internal node gives a pair of sistrings that match for the greatest number of characters 2 2 3 4 3 i.e., 1, 2, 3 are the same for sistrings 100100010111 and 100010111 2 7 6 5 3 5 1 4 8

Building PAT Trees as Patricia Trees • bucketing of external nodes • collect more than one external node • a bucket replaces any subtree with size less than a certain constraint (b)save significant number of internal nodes • the external nodes inside a bucket do not have any structure associated with themincrease the number of comparisons for each search

Chapter 8 Indexing and Searching