CS 245: Database System Principles Notes 4: Indexing

CS 245: Database System PrinciplesNotes 4: Indexing Hector Garcia-Molina Notes 4

Chapter 4 Indexing & Hashing value record ? value Notes 4

Topics • Conventional indexes • B-trees • Hashing schemes Notes 4

10 30 50 70 90 20 40 60 80 100 Sequential File Notes 4

70 50 30 10 110 90 120 80 60 40 20 100 90 10 70 50 30 80 60 40 20 100 Sequential File Dense Index Notes 4

130 90 50 10 210 170 230 150 110 70 30 190 90 10 70 50 30 80 60 40 20 100 Sequential File Sparse Index Notes 4

210 50 10 130 170 490 90 170 330 10 30 570 70 110 410 150 190 90 230 250 10 50 30 90 70 100 80 60 40 20 Sequential File Sparse 2nd level Notes 4

Comment: {FILE,INDEX} may be contiguous or not (blocks chained) Notes 4

Question: • Can we build a dense, 2nd level index for a dense index? Notes 4

Notes on pointers: (1) Block pointer (sparse index) can be smaller than record pointer BP RP Notes 4

Notes on pointers: (2) If file is contiguous, then we can omit pointers (i.e., compute them) Notes 4

say: 1024 B per block R2 R1 R3 • if we want K3 block: • get it at offset • (3-1)1024 • = 2048 bytes R4 K1 K2 K3 K4 Notes 4

Sparse vs. Dense Tradeoff • Sparse: Less index space per record can keep more of index in memory • Dense: Can tell if any record exists without accessing file (Later: • sparse better for insertions • dense needed for secondary indexes) Notes 4

Terms • Index sequential file • Search key ( primary key) • Primary index (on Sequencing field) • Secondary index • Dense index (all Search Key values in) • Sparse index • Multi-level index Notes 4

Next: • Duplicate keys • Deletion/Insertion • Secondary indexes Notes 4

10 10 20 30 40 10 20 30 30 45 Duplicate keys Notes 4

10 10 10 10 20 20 30 30 40 40 45 10 20 30 10 20 45 30 30 30 Duplicate keys Dense index, one way to implement? 10 10 10 10 10 10 20 20 20 20 30 30 30 30 30 30 Notes 4

30 10 10 20 40 45 30 30 10 20 Duplicate keys Dense index, better way? 10 20 30 40 Notes 4

30 10 10 20 40 45 30 30 10 20 careful if looking for 20 or 30! Duplicate keys Sparse index, one way? 10 10 20 30 Notes 4

10 10 20 30 40 30 20 30 45 10 should this be 40? Duplicate keys Sparse index, another way? • place first new key from block 10 20 30 30 Notes 4

Duplicate values, primary index Summary • Index may point to first instance of each value only File Index a a a . . b Notes 4

10 30 50 70 80 20 40 60 Deletion from sparse index 10 30 50 70 90 110 130 150 Notes 4

70 10 30 50 20 40 60 80 Deletion from sparse index • delete record 40 10 30 50 70 90 110 130 150 Notes 4

10 30 50 70 20 40 60 80 40 40 Deletion from sparse index • delete record 30 10 30 50 70 90 110 130 150 Notes 4

10 30 50 70 20 40 60 80 50 70 Deletion from sparse index • delete records 30 & 40 10 30 50 70 90 110 130 150 Notes 4

10 30 50 70 80 20 40 60 Deletion from dense index 10 20 30 40 50 60 70 80 Notes 4

10 50 70 20 60 80 40 40 Deletion from dense index • delete record 30 10 20 30 30 40 40 50 60 70 80 Notes 4

10 30 40 60 20 50 Insertion, sparse index case 10 30 40 60 Notes 4

10 30 40 60 20 50 34 • our lucky day! • we have free space • where we need it! Insertion, sparse index case • insert record 34 10 30 40 60 Notes 4

10 30 40 60 20 50 15 20 20 30 Insertion, sparse index case • insert record 15 10 30 40 60 • Illustrated: Immediate reorganization • Variation: • insert new block (chained file) • update index Notes 4

10 30 40 60 25 20 50 overflow blocks (reorganize later...) Insertion, sparse index case • insert record 25 10 30 40 60 Notes 4

Insertion, dense index case • Similar • Often more expensive . . . Notes 4

30 20 80 100 90 50 70 40 10 60 Secondary indexes Sequence field Notes 4

100 30 20 80 90 50 70 40 10 60 30 90 ... 20 80 100 does not make sense! Secondary indexes Sequence field • Sparse index Notes 4

90 100 20 30 80 70 40 10 60 50 10 10 50 60 20 50 70 90 30 ... ... 40 sparse high level Secondary indexes Sequence field • Dense index Notes 4

Also: Pointers are record pointers (not block pointers; not computed) With secondary indexes: • Lowest level is dense • Other levels are sparse Notes 4

20 20 10 10 30 10 40 40 40 40 Duplicate values & secondary indexes Notes 4

30 20 20 10 10 10 40 40 40 40 20 10 40 10 40 30 10 40 ... 20 40 Duplicate values & secondary indexes one option... • Problem: • excess overhead! • disk space • search time Notes 4

20 20 10 10 30 40 40 40 40 10 20 40 30 Duplicate values & secondary indexes another option... 10 Problem: variable size records in index! Notes 4

30 20 20 10 10 10 40 40 40 40 50 10 60 20 30 ... 40 Duplicate values & secondary indexes   Another idea (suggested in class):Chain records with same key?   • Problems: • Need to add fields to records • Need to follow chain to know records Notes 4

10 30 20 10 20 10 40 40 40 40 50 10 20 60 ... 30 40 Duplicate values & secondary indexes buckets Notes 4

Why “bucket” idea is useful Indexes Records Name: primary EMP (name,dept,floor,...) Dept: secondary Floor: secondary Notes 4

Dept. index EMP Floor index Toy 2nd Query: Get employees in (Toy Dept) ^ (2nd floor)  Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s Notes 4

cat dog Inverted lists This idea used in text information retrieval Documents ...the cat is fat ... ...was raining cats and dogs... ...Fido the dog ... Notes 4

IR QUERIES • Find articles with “cat” and “dog” • Find articles with “cat” or “dog” • Find articles with “cat” and not “dog” • Find articles with “cat” in title • Find articles with “cat” and “dog” within 5 words Notes 4

Common technique: more info in inverted list cat position location type d1 Title 5 Author 10 Abstract 57 d2 d3 dog Title 100 Title 12 Notes 4

Posting: an entry in inverted list. Represents occurrence of term in article Size of a list: 1 Rare words or (in postings) miss-spellings 106 Common words Size of a posting: 10-15 bits (compressed) Notes 4

IR DISCUSSION • Stop words • Truncation • Thesaurus • Full text vs. Abstracts • Vector model Notes 4

PRODUCT = 1 + ……. = score Vector space model w1 w2 w3 w4 w5 w6 w7 … DOC = <1 0 0 1 1 0 0 …> Query= <0 0 1 1 0 0 0 …> Notes 4

Tricks to weigh scores + normalize e.g.: Match on common word not as useful as match on rare words... Notes 4

CS 245: Database System Principles Notes 4: Indexing

CS 245: Database System Principles Notes 4: Indexing

Presentation Transcript

CS 245: Database System Principles Notes 4: Indexing

CS 245: Database System Principles Notes 10: More TP

CS 245: Database System Principles Notes 10: More TP

CS 245: Database System Principles

CS 245: Database System Principles Notes 09: Concurrency Control

CS 245: Database System Principles Notes 01: Introduction

CS 245: Database System Principles Notes 09: Concurrency Control

CS 245: Database System Principles Notes 08: Failure Recovery

CS 245: Database System Principles

CS 277: Database System Implementation Notes 4: Indexing

CS 245: Database System Principles Notes 02: Hardware

CS 245: Database System Principles Notes 09: Concurrency Control

CS 245: Database System Principles

CS 245: Database System Principles Notes 7: Query Optimization

CS 245: Database System Principles Notes 01: Introduction

CS 245: Database System Principles Notes 09: Concurrency Control

CS 245: Database System Principles Notes 10: More TP

CS 245: Database System Principles

CS 245: Database System Principles

CS 245: Database System Principles

CS 245: Database System Principles Notes 03: Disk Organization