1 / 26

Data File Structures

10/18/2000. CSI660. 2. Accessing data during retrieval. Scan the entire collection SLOW!Typical in early (batch) retrieval systems Still used today, in hardware form (e.g., Fast Data Finder) Computational and I/O costs are O(characters in collection) Practical for only small" collections Use

kiri
Télécharger la présentation

Data File Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. 10/18/2000 CSI660 1 Data & File Structures Indexing data structures Efficient search techniques

    2. 10/18/2000 CSI660 2 Accessing data during retrieval Scan the entire collection – SLOW! Typical in early (batch) retrieval systems Still used today, in hardware form (e.g., Fast Data Finder) Computational and I/O costs are O(characters in collection) Practical for only “small” collections Use indexes for direct access – FAST! Evaluation time O(query term occurrences in collection) Practical for “large” collections Many opportunities for optimization Hybrids: Use small index, then scan a subset of the collection

    3. 10/18/2000 CSI660 3 What should an Index Contain? Database systems index primary and secondary keys This is the hybrid approach Index provides fast access to a subset of database records Scan subset to find solution set IR Problem: Cannot predict keys that people will use in queries Every word in a document is a potential search term IR Solution: Index by all keys (words)

    4. 10/18/2000 CSI660 4 How is the index accessed? Access by the atoms of a query language “features” or “keys” or “terms” Most common feature types: Words in text, punctuation Manually assigned terms (controlled and uncontrolled vocabulary) Document structure (sentence and paragraph boundaries) Inter- or intra-document links (e.g., citations) Composed features Feature sequences (phrases, names, dates, monetary amounts) Feature sets (e.g., synonym classes)

    5. 10/18/2000 CSI660 5 Indexing Words What is a word? Embedded punctuation (e.g., DC-10, long-term, AT&T) Case folding (e.g., New vs new, Apple vs apple) Stopwords (e.g., the, a, its) Morphology (e.g., computer, computers, computing, computed) Index granularity has a large impact on speed and effectiveness Index stems only? Index surface forms only? Index both?

    6. 10/18/2000 CSI660 6 What else to store in the index? Features depend upon the retrieval model Boolean Statistical (tf, df, ctf, doclen, maxtf) Often about 10% the size of the raw data, compressed Positional information Feature location within document Granularities include word, sentence, paragraph, etc Coarse granularities are less precise, but take less space Word-level granularity about 20-30% the size of the raw data, compressed

    7. 10/18/2000 CSI660 7 Common Index Implementations Bitmaps Signature files Inverted files Hashing n-grams

    8. 10/18/2000 CSI660 8 Common Index Components Dictionary (lexicon) Postings document ids, word positions

    9. 10/18/2000 CSI660 9 Bitmaps Bag-of-words index only For each term, allocate vector with one bit per document if term present in document n, set nth bit to 1, otherwise 0 Retrieval by Boolean operations very fast space efficient for common terms (why?) space inefficient for rare terms (why?) Good compression with run-length encoding Difficult to add/delete documents (why?) Not widely used

    10. 10/18/2000 CSI660 10 Signature Files

More Related