Introduction to Data Structures and Algorithms Related to Information Retrieval

Introduction to Data Structures and Algorithms Related to Information Retrieval Ch-2 Frakes and Yates

Contents • Introduction • BASIC CONCEPTS • Strings • Similarity between Strings • Regular Expressions • Finite Automata • DATA STRUCTURES • Search Trees • Hashing • Digital Trees • Retrieval ALGORITHMS • Filtering Algorithms • Indexing Algorithms

1. Introduction • Information retrieval (IR) is a multidisciplinary field. In this chapter we study data structures and algorithms used in the implementation of IR systems. In this sense, many contributions from theoretical computer science have practical and regular use in IR systems. • The first section covers some basic concepts: strings, regular expressions, and finite automata. In section 2.3 we have a look at the three classical foundations of structuring data in IR: search trees, hashing, and digital trees. We give the main performance measures of each structure and the associated trade-offs. In section 2.4 we attempt to classify IR algorithms based on their actions. We distinguish three main classes of algorithms and give examples of their use. These are retrieval, indexing, and filtering algorithms.

2. BASIC CONCEPTS • We start by reviewing basic concepts related with text: strings, regular expressions (as a general query language), and finite automata (as the basic text processing machine). Strings appear everywhere, and the simplest model of text is a single long string. Regular expressions provide a powerful query language, such that word searching or Boolean expressions are particular cases of it. Finite automata are used for string searching (either by software or hardware), and in different ways of text filtering and processing.

3. Strings • We use to denote the alphabet (a set of symbols). We say that the alphabet is finite if there exists a bound in the size of the alphabet, denoted by . Otherwise, if we do not know a priori a bound in the alphabet size, we say that the alphabet is arbitrary. A string over an alphabet is a finite length sequence of symbols from . The empty string is the string with no symbols. If x and y are strings, xy denotes the concatenation of x and y. If = xyz is a string, then x is a prefix, and z a suffix of . The length of a string x ( ) is the number of symbols of x. Any contiguous sequence of letters y from a string is called a substring. If the letters do not have to be contiguous, we say that y is a subsequence.

4. Similarity between Strings • When manipulating strings, we need to know how similar are a pair of strings. For this purpose, several similarity measures have been defined. Each similarity model is defined by a distance function d, such that for any strings satisfies the following properties:

Contd.. • The Hamming distance is defined over strings of the same length. The function d is defined as the number of symbols in the same position that are different (number of mismatches). For example, d(text, that) = 2. The edit distance is defined as the minimal number of symbols that is necessary to insert, delete, or

5. Regular Expressions The plus or positive closure is defined by L+ = LL*. We use L(r) to represent the set of strings in the language denoted by the regular expression r. The regular expressions over and the languages that they denote (regular sets or regular languages) are defined recursively as follows:

6. Finite Automata • A finite automaton is a mathematical model of a system. The automaton can be in any one of a finite number of states and is driven from state to state by a sequence of discrete inputs. Figure 2.1 depicts an automaton reading its input from a tape.

Contd..

7. DATA STRUCTURES • In this section we cover three basic data structures used to organize data: search trees, digital trees, and hashing. They are used not only for storing text in secondary memory, but also as components in searching algorithms (especially digital trees). We do not describe arrays, because they are a well-known structure that can be used to implement static search tables, bit vectors for set manipulation, suffix arrays (Chapter 5), and so on. These three data structures differ on how a search is performed. Trees define a lexicographical order over the data.

Contd.. • However, in search trees, we use the complete value of a key to direct the search, while in digital trees, the digital (symbol) decomposition is used to direct the search. On the other hand, hashing "randomizes" the data order, being able to search faster on average, with the disadvantage that scanning in sequential order is not possible (for example, range searches are expensive).

8. Search Trees • The most well-known search tree is the binary search tree. Each internal node contains a key, and the left subtree stores all keys smaller that the parent key, while the right subtree stores all keys larger than the parent key. Binary search trees are adequate for main memory. However, for secondary memory, multiway search trees are better, because internal nodes are bigger. In particular, we describe a special class of balanced multiway search trees called B-tree.

Contd.. A B-tree of order m is defined as follows: • The root has between 2 and 2m keys, while all other internal nodes have between m and 2m keys. • If ki is the i-th key of a given internal node, then all keys in the i - 1 - th child are smaller than ki, while all the keys in the i-th child are bigger. • All leaves are at the same depth. Usually, a B-tree is used as an index, and all the associated data are stored in the leaves or buckets. This structure is called B+-tree. An example of a B+-tree of order 2 is shown in Figure 2.3, using bucket size 4. B-trees are mainly used as a primary key access method for large databases in secondary memory. To search a given key, we go down the tree choosing the appropriate branch at each step. The number of disk accesses is equal to the height of the tree.

Contd..

9. Hashing • A hashing function h (x) maps a key x to an integer in a given range (for example, 0 to m - 1). Hashing functions are designed to produce values uniformly distributed in the given range. For a good discussion about choosing hashing functions, see Ullman (1972), Knuth (1973), and Knott (1975). The hashing value is also called a signature. A hashing function is used to map a set of keys to slots in a hashing table. If the hashing function gives the same slot for two different keys, we say that we have a collision. Hashing techniques mainly differ in how collisions are handled. There are two classes of collision resolution schemas: open addressing and overflow addressing.

Contd..

Contd.. In open addressing (Peterson 1957), the collided key is "rehashed" into the table, by computing a new index value. The most used technique in this class is double hashing, which uses a second hashing function (Bell and Kaman 1970; Guibas and Szemeredi 1978). The main limitation of this technique is that when the table becomes full, some kind of reorganization must be done. Figure 2.4 shows a hashing table of size 13, and the insertion of a key using the hashing function h (x) = x mod 13 (this is only an example, and we do not recommend using this hashing function!).

10. Digital Trees Efficient prefix searching can be done using indices. One of the best indices for prefix searching is a binary digital tree or binary trie constructed from a set of substrings of the text. This data structure is used in several algorithms. Tries are recursive tree structures that use the digital decomposition of strings to represent a set of strings and to direct the searching. Tries were invented by de la Briandais (1959) and the name was suggested by Fredkin (1960), from information retrieval. If the alphabet is ordered, we have a lexicographically ordered tree. The root of the tree uses the first character, the children of the root use the second character, and so on. If the remaining subtreecontains only one string, that string's identity is stored in an external node.

Contd.. Figure 2.5 shows a binary tree (binary alphabet) for the string "01100100010111 . . . " after inserting all the substrings that start from positions 1 through 8. (In this case, the substring's identity is represented by its starting position in the text.) The height of a tree is the number of nodes in the longest path from the root to an external node. The length of any path from the root to an external node is bounded by the height of the tree. On average, the height of a tree is logarithmic for any square-integrable probability distribution (Devroye 1982).

Contd..

11. Retrieval ALGORITHMS It is hard to classify IR algorithms, and to draw a line between each type of application. However, we can identify three main types of algorithms, which are described below. There are other algorithms used in IR that do not fall within our description, for example, user interface algorithms. The reason that they cannot be considered as IR algorithms is because they are inherent to any computer application. The main class of algorithms in IR is retrieval algorithms, that is, to extract information from a textual database. We can distinguish two types of retrieval algorithms, according to how much extra memory we need:

Contd.. • Sequential scanning of the text: extra memory is in the worst case a function of the query size, and not of the database size. On the other hand, the running time is at least proportional to the size of the text, for example, string searching (Chapter 10). Indexed text: an "index" of the text is available, and can be used to speed up the search. The index size is usually proportional to the database size, and the search time is sublinear on the size of the text, for example, inverted files (Chapter 3) and signature files (Chapter 4).

12. Filtering Algorithms This class of algorithms is such that the text is the input and a processed or filtered version of the text is the output. This is a typical transformation in IR, for example to reduce the size of a text, and/or standardize it to simplify searching. The most common filtering/processing operations are: Common words removed using a list of stopwords. This operation is discussed in Chapter 7. Uppercase letters transformed to lowercase letters. Special symbols removed and sequences of multiple spaces reduced to one space. Numbers and dates transformed to a standard format (Gonnet 1987). Spelling variants transformed using Soundex-like methods (Knuth 1973). Word stemming (removing suffixes and/or prefixes). This is the topic of Chapter 8. Automatic keyword extraction. Word ranking.

Contd.. • Unfortunately, these filtering operations may also have some disadvantages. Any query, before consulting the database, must be filtered as is the text; and, it is not possible to search for common words, special symbols, or uppercase letters, nor to distinguish text fragments that have been mapped to the same internal form.

13. Indexing Algorithms The usual meaning of indexing is to build a data structure that will allow quick searching of the text, as we mentioned previously. There are many classes of indices, based on different retrieval approaches. For example, we have inverted files (Chapter 3), signature files (Chapter 4), tries (Chapter 5), and so on, as we have seen in the previous section. Almost all type of indices are based on some kind of tree or hashing. Perhaps the main exceptions are clustered data structures (this kind of indexing is called clustering), which is covered in Chapter 16, and the Direct Acyclic Word Graph (DAWG) of the text, which represents all possible subwords of the text using a linear amount of space (Blumer et al. 1985), and is based on finite automata theory. Usually, before indexing, the text is filtered. Figure 2.7 shows the complete process for the text.

Contd..

Contd.. Thank You

Introduction to Data Structures and Algorithms Related to Information Retrieval