Index Structures

Index Structures Computer Science CS-257-01 : Spring 2009 San Jose State University, 16th February, 2009

What is an Index? Blocks Holding Records Matching Records Value Index An Index is any Data Structure that takes as input a property of Records - typically the value of one or more fields - and finds the records with that property “quickly”.

Indexes on Sequential Files 10 20 30 40 50 60 70 80 This is a Sequential File in which tuples are sorted by their Primary Key.

Definitions to Study Index Structures • Data File : A Sorted Sequential file • Index File : Consisting of Key-Pointer pairs • Search Key : A Search Key K in the Index file is associated with a pointer to a Data file record that has Search Key K 4

Dense Indexes 10 10 20 20 30 30 40 40 50 50 60 60 70 80 70 80 There is an entry in the Index File for record in the Data File. The index is called "dense" because every key from the data file indexes can be "dense," meaning there is an entry in the index file for every is represented in the index. In comparison, "sparse" indexes, to be discussed in record of the data file, or "sparse," meaning that only some of the data records 5

Sparse Indexes 10 10 20 30 50 30 70 40 90 50 110 60 130 150 70 80 Holds only one Key-Pointer per Data Block. The Key is for the first Record on the Data Block. 6

Multiple Levels of Indexes 10 10 10 20 90 30 170 50 30 250 70 40 330 90 50 410 110 60 490 130 570 150 70 80 Adding second level of Sparse Index 7

Managing Indexes during Data Modification • Create / Delete Overflow blocks • Insert new blocks in the Sequential order • Slide tuples to adjacent blocks. • The correct approach depends on 15-hether the indexes is dense or sparse. And on which of the three strategies enumerated above is used. However, one general principle should be remembered 8

How actions on the Sequential File affect the Index File 9

Secondary Indexes Why do we need Secondary Indexes? SELECT name, address FROM moviestar WHERE birthday = DATE ’01/09/2008’; Need Secondary Indexes on birthday to help with such queries 10

INFORMATION INTEGRATION Sandeep Singh Balouria CS-257 ID- 101

DATA CUBES Data cube is a multi-dimensional structure , it as a data abstraction that allows one to view aggregated data from a number of perspectives. It is surrounded by a collection of sub-cubes/cuboids that represent the aggregation of the base cuboid along one or more dimensions.

DATA CUBE

CUBE OPERATOR In cube operator, we can define an augmented table CUBE(F) that add an additional value, denoted *, to each dimension. The * has the intuitive meaning “any”, and it represents aggregation along the dimension in which it appears.

CUBE OPERATOR

EXAMPLE: Sales(model , color, date, dealer, val , cnt). -In this query we specify conditions on certain attributes of the sales relation and group by some other attributes -In the relation CUBE(sales), we look for those tuples t with the following properties: 1. If the query specifies a value v for attribute a, then tuple t has v in its component for a.

2.If the query groups by an attribute a, then t has any non-* value in its component for a. 3.If the query neither groups by attribute a nor specifies a value for a, then t has * in its component for a. QUERY: SELECT color, AVG(price) FROM Sales WHERE model=‘Gobi’ GROUP BY color; Cube(Sales) Form is (‘Gobi’ ,c ,*,*,v ,n)

CUBE IMPLEMENTED BY MATERIALIZED VIEWS A materialized view is an object that stores the result of select statement. These are mostly used in data warehousing to improve the performance of the select statements that involve group and aggregate functions we use materialized views . EXAMPLE: INSERT INTO salesV1 SELECT model, color, month, city, SUM(val) AS val, SUM(cnt) AS cnt FROM sales JOIN Dealers ON dealer=name GROUP BY model , color , month, city; Sales(model, color, date, dealer, val, cnt)

Query:SELECT model ,SUM(val) FROM sales GROUP BY model; -Can also be written as: SELECT model, SUM(val) FROM sales(v1) GROUP BY model; SELECT model, color, date, SUM(val) FROM sales GROUP BY model ,color , date;

LATTICE OF VIEWS In lattice view we can partition the values of a dimension by grouping according to one or more attributes of its dimensional tables. Having a lattice for each dimension, we can now define a lattice for all the possible materialized views of a data cube that can be formed by grouping according to some partition in each dimension.

CH-13 Index Structures Continued…..

Document Retrieval and Inverted Index • Due advent in WWW keeping documents online and document retrieval become one of the largest database problem • The most easy approach for document retrieval is to create separate index for each word (Problem: wastage of storage space) • The other approach is to use Inverted Index

Inverted Index: • Records is a collection of documents • The inverted index itself consist of set of word-pointer pairs • The inverted index pointers refer to position in the bucket file • Pointers in the bucket file can be: • Pointers to document • Pointers to occurrence of word (may be pair of first block of document and an integer indicating number of word) • When points to word we can include some info in bucket array EX. For document using HTML and XML we can also include marking associated with words so we can distinguish between titles headers etc.

More Information Retrieval Techniques to improve effectiveness • Stemming: Remove suffixes to find stem of each word ( Ex. Plurals can be treated as there singular version. • Stop Words: words such as “the” or “and”. Are excluded from inverted index • Ex. With ref. to next fig. if we want to find the document about the dogs that compare them with cats. • Difficult to solve with out understanding of text • However we could get good hint if we search document that • Mention dogs in the title, and • Mention cats in an anchor <href>

B-Trees • The most commonly used index structure in the commercial systems. • Advantages • B-trees automatically maintains the levels of index according to file size • B-trees mange the space on the blocks so no overflow blocks are needed • The structure of B-trees • The tree is balanced ( All the paths from root to leaf have the same length • Typically three layers: the root, an intermediate layer, and leaves. • The Applications of B-trees • The B-tree is a powerful tool for building indexes. The sequence of pointers to • records at the leaves can play the role of any of the pointer sequences coming • out of an index file

Important rules about what can appear in the blocks of a B-tree: • Keys are distributed among the leaves in sorted order, from left to right • At root there are at least two used pointer. (exception of tree with single record) • At leaf last pointer points to the next leaf block to the right. • At interior node all n+1 pointer can be used (at least n+1/2 are actually used)

Efficiency of B-tree • B-tree allow lookup, insertion, deletion of record using few disk I/O’s • If the number of keys per block is reasonably large then rarely we need to split or merge the blocks. And even if we need this operation this are limited to the leaves and there parents. Thus, we can neglect the cost of B-tree reorganization. • The number of disk I/O to read the record are normally the levels of B-tree plus the one (for lookup) and two (for insert or delete). • Ex. Suppose 340 key pointer pairs could fit in one block, suppose avg. block has occupied between min. and max. i.e. the typical block has 255 pointers. • With root 255 children and 255^2 = 65025 leaves • Suppose among this leaves we have 255^3 or about 16.6 million records • That is, files with up to 16.6 million records can be accommodated by 3 levels of B-tree • Number of disk I/O can reduced further by keeping B-tree in main memory.

CH-20 Information Integration Continued……

What is OLAP? • On-Line Analytic Processing (OLAP) • OLAP is used to query very large amount of data in the data warehouse of company • It involves highly complex queries that use one or more aggregators • OLAP queries are also called as decision support queries • What is OLTP? • Online transaction processing, or OLTP, refers to a class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing • Common database operation touch very small amount of data and they are referred as OLTP (online transaction processing) • OLAP queries are considered as long transactions and long transactions locking the entire database would shutdown the ordinary OLTP transactions so OLAP data is stored separately in data warehouse rather then in ordinary database

Application of OLTP • Acomlnon OLXP application uses a warehouse of sales data. Major store chains • will accumulate terabytes of information representing every sale of every item • at every store. • Example of OLAP and OLTP queries • Consider data warehouse of automobile company the schema can be as follows • Sales (serialNo, date, dealer, price) • Autos (serialNo, model, color) • Dealers( name, city, state, phone) • The typical OLAP query can be for finding the average sales price by state • SELECT state, AVG (price) • FROM Sales, Dealers • WHERE Sales.dealer = Dealers.name AND date >= ‘2001-01-04’ • GROUP BY state; • In same example the typical OLTP query can be for finding the price at which the auto with serial number 123 was sold

Multidimensional view of OLTP data : • In typical OLAP application we have a central relation called fact table. • Fact table represents events or objects of interest. • It is helpful to think that objects in the fact table are arranged in the multidimensional space. • consider the earlier example of automobile company the fact table can be build for sales which is the object of interest and is viewed as a 3 dimensional data cube .

Multidimensional View Of OLAP Data Each single point in cube represents sales of single automobile and dimension represents properties of sales.

Star Schemas A star schema consist of schema for the fact table, which links to several other relations called “dimension tables”.

Slicing and Dicing • The row data cube can be partitioned along each dimension at some level of granularity for analysis this partitioning operations are known as slicing and dicing. • In SQL this partitioning is done by “ GROUP BY” clause.

Lets consider the automobile example. suppose car named Gobi is not selling well and we want to find exactly which colors are not doing well • SQL query is as follows: • SELECT color, SUM (Price) • FROM Sales NATURAL JOIN Autos • WHERE model = ‘Gobi’ • GROUP BY color; • This query dice by color and slice by model, Focusing on particular model, the Gobi, and ignores other data.

Data Representation Recovery from Disk Crashes – 13.4 Presented By: Deepti Bhardwaj Roll No. 223_103 SJSU ID: 006521307

Contents • 13.4.5 Recovery from Disk Crashes • 13.4.6 Mirroring as a Redundancy Technique • 13.4.7 Parity Blocks • 13.4.8 An Improving: RAID 5 • 13.4.9 Coping With Multiple Disk Crashers

Recovery from Disk Crashes: Ways to recover data • The most serious mode of failure for disks is “head crash” where data permanently destroyed. • So to reduce the risk of data loss by disk crashes there are number of schemes which are know as RAID (Redundant Arrays of Independent Disks) schemes. • Each of the schemes starts with one or more disks that hold the data and adding one or more disks that hold information that is completely determined by the contents of the data disks called Redundant Disk.

Mirroring as a Redundancy Technique • Mirroring Scheme is referred as RAID level 1 protection against data loss scheme. • In this scheme we mirror each disk. • One of the disk is called as data disk and other redundant disk. • In this case the only way data can be lost is if there is a second disk crash while the first crash is being repaired.

Parity Blocks • A parity block is used by certain RAID levels. Redundancy is achieved by the use of parity blocks. If a single drive in the array fails, data blocks and a parity block from the working drives can be combined to reconstruct the missing data. • RAID level 4 scheme uses only one redundant disk no matter how many data disks there are. • In the redundant disk, the ith block consists of the parity checks for the ith blocks of all the data disks. • It means, the jth bits of all the ith blocks of both data disks and redundant disks, must have an even number of 1’s and redundant disk bit is used to make this condition true.

Parity Blocks – Reading disk Reading data disk is same as reading block from any disk. • We could read blockfrom each of the other disks and compute the block of the disk we want to read by taking the modulo-2 sum. disk 2: 10101010 disk 3: 00111000 disk 4: 01100010 If we take the modulo-2 sum of the bits in each column, we get disk 1: 11110000

Parity Block - Writing • When we write a new block of a data disk, we need to change that block of the redundant disk as well. • One approach to do this is to read all the disks and compute the module-2 sum and write to the redundant disk. But this approach requires n-1 reads of data, write a data block and write of redundant disk block. Total = n+1 disk I/Os

Continue : Parity Block - Writing • Better approach will require only four disk I/Os1. Read the old value of the data block being changed. 2. Read the corresponding block of the redundant disk. 3. Write the new data block. 4. Recalculate and write the block of the redundant disk.

Parity Blocks – Failure Recovery If any of the data disk crashes then we just have to compute the module-2 sum to recover the disk. Suppose that disk 2fails. We need to re compute each block of the replacement disk. We are given the corresponding blocks of the first and third data disks and the redundant disk, so the situation looks like: disk 1: 11110000 disk 2: ???????? disk 3: 00111000 disk 4: 01100010 If we take the modulo-2 sum of each column, we deduce that the missing block of disk 2 is : 10101010

An Improvement: RAID 5 • RAID 4 is effective in preserving data unless there are two simultaneous disk crashes. • Whatever scheme we use for updating the disks, we need to read and write the redundant disk's block. If there are n data disks, then the number of disk writes to the redundant disk will be n times the average number of writes to any one data disk. • However we do not have to treat one disk as the redundant disk and the others as data disks. Rather, we could treat each disk as the redundant disk for some of the blocks. This improvement is often called RAID level 5.

Continue : An Improvement: RAID 5 • For instance, if there are n + 1 disks numbered 0 through n, we could treat the ith cylinder of disk j as redundant if j is the remainder when i is divided by n+1. • For example, n = 3 so there are 4 disks. The first disk, numbered 0, is redundant for its cylinders numbered 4, 8, 12, and so on, because these are the numbers that leave remainder 0 when divided by 4. • The disk numbered 1 is redundant for blocks numbered 1, 5, 9, and so on; disk 2 is redundant for blocks 2, 6. 10,. . ., and disk 3 is redundant for 3, 7, 11,. . . .

Index Structures