1 / 22

CS 430: Information Discovery

CS 430: Information Discovery. Lecture 4 Files Structures for Inverted Files. Course Administration. • Assignment 1 has been posted on the web site. Right Threaded Binary Tree. Threaded tree:

nasim-sears
Télécharger la présentation

CS 430: Information Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files

  2. Course Administration • Assignment 1 has been posted on the web site.

  3. Right Threaded Binary Tree Threaded tree: A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in-order predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Knuth vol 1, 2.3.1, page 325.

  4. Right Threaded Binary Tree From: Robert F. Rossa

  5. Definitions Keyword: A term that is used to describe the subject matter in a document. It is sometimes called an index term. In full text indexing, every word in the text is treated as a keyword (with the exception of stopwords). Keywords can be extracted automatically from a document or assigned by a human cataloguer or indexer. Controlled vocabulary: A list of words that can be used as keywords. For example, in a retrieval system used for research papers in medicine, the controlled vocabulary might be a list of medical terms.

  6. Restrictions in Building Inverted Files • Underlying character set, e.g., printable ASCII, Unicode, UTF8. • Whether to use a controlled vocabulary. If so, what words to include. • List of stopwords. • Rules to decide the beginning and end of words, e.g., spaces or punctuation. • Character sequences not to be indexed, e.g., sequences of numbers.

  7. Representation of Inverted Files Index file: Stores list of terms (keywords). Designed for rapid searching and processing range queries. May be held in memory. Postings file: Stores list of postings for each term. Designed for rapid evaluation of Boolean operators. May be stored sequentially. Document file: [Repositories for the storage of document collections are covered in CS 502.]

  8. Sizes of Inverted Files Set Records Unique Terms A 2,653 5,123 B 38,304 c.25,000 Set A has an average of 14 postings per term and a maximum of over 2,000 postings per term. Set B has an average of 88 postings per record. Examples from Harman and Candela, 1990

  9. B-trees B-tree of order m: A balanced, multiway search tree: • Each node stores many keys • Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. • If ki is the ith key in a given internal node -> all keys in the (i-1)th child are smaller than ki -> all keys in the ith child are bigger than ki • All leaves are at the same depth

  10. B+-tree B+-tree: • A B-tree is used as an index • Data is stored in the leaves of the tree, known as buckets 50 65 10 25 55 59 70 81 90 ... D9 D51 ... D54 D66... D81 ... Example: B+-tree of order 2, bucket size 4

  11. B-tree Discussion For a discussion of B-trees, see Frake, Section 2.3.1, pages 18-20. • B-trees combine fast retrieval with moderately efficient updating. • Bottom-up updating is usual fast, but may require recursive tree climbing to the root. • The main weakness is poor storage utilization; typically buckets are only 0.69 full. • Various algorithmic improvements increase storage utilization at the expense of updating performance.

  12. Signature Files Inexact filter: A quick test which discards many of the non-qualifying items. Advantages • Much faster than full text scanning -- 1 or 2 orders of magnitude • Modest space overhead -- 10% to 15% of file • Insertion is straightforward Disadvantages • Sequential searching no good for very large files • Some hits are false hits

  13. Signature Files Signature size. Number of bits in a signature, F. Word signature. A bit pattern of size F with m bits set to 1 and the others 0. The word signature is calculated by a hash function. Block. A sequence of text that contains D distinct words. Block signature. The logical OR of all the word signatures in a block of text.

  14. Signature Files Example Word Signature free 001 000 110 010 text 000 010 101 001 block signature 001 010 111 011 F = 12 bits in a signature m = 4 bits per word D = 2 words per block

  15. Signature Files A query term is processed by matching its signature against the block signature. (a) If the term is in the block, its word signature will always match the block signature. (b) A word signature may match the block signature, but the word is not in the block. This is a false hit. The design challenge is to minimize the false drop probability, Fd . Frake, Section 4.2, page 47 discussed how to minimize Fd. The rest of this chapter discusses enhancements to the basic algorithm.

  16. Tries Basic concept The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique. The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once. Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node. Suffix trees (and similar suffix arrays) have a size of the same order of magnitude as the input documents.

  17. Tries: Suffix Tree Example: suffix tree for the following words: begin beginning between bread break b e rea gin tween d k _ ning

  18. Tries: Sistrings A binary example String: 01 100 100 010 111 Sistrings: 1 01 100 100 010 111 2 11 001 000 101 11 3 10 010 001 011 1 4 00 100 010 111 5 01 000 101 11 6 10 001 011 1 7 00 010 111 8 00 101 11

  19. Tries: Lexical Ordering 7 00 010 111 4 00 100 010 111 8 00 101 11 5 01 000 101 11 1 01 100 100 010 111 6 10 001 011 1 3 10 010 001 011 1 2 11 001 000 101 11 Unique remaining subtrie indicated in red

  20. Trie: Basic Concept 1 0 1 0 1 0 2 0 1 0 1 0 7 5 1 1 0 0 6 3 0 1 4 8

  21. Patricia Tree 4 3 3 2 2 5 1 1 0 1 0 1 00 2 0 1 1 0 0 10 7 5 1 6 3 0 1 4 8 Single-descendant nodes are eliminated. Nodes have bit number.

  22. Oxford English Dictionary

More Related