Efficient Data Organization: Files, Structure, and Access Methods

Chapter9 C H A P T E R 3 DataBase & DBMS J. Glenn Brookshear 蔡文能 J. Glenn Brookshear

Chapter 9: Database Systems • 9.5 Traditional File Structures • 9.1 Database Fundamentals • 9.2 The Relational Model • 9.3 Object-Oriented Databases • 9.4 Maintaining Database Integrity • 9.6 Data Mining • 9.7 Social Impact of Database Technology

Data Structure vs. File Structure • Data Structure : How to arrange data in memory • File Structure : How to arrange data in Disk and/or any other secondary storage • DataBase and DataBase Management System • Users do NOT have to care about how to store data in a file. DBMS will handle the detail. • Users can use SQL (Structured Query Language) to access the DataBase in an interactive command and/or through a program (embeded SQL) RAM Disk

File Types • Sequential file: one accessed in a serial manner from beginning to end. E.g. audio, video, text, programs. • Text file: sequential file in which each logical record is a single character. ASCII: 1 byte/char Unicode: 2 bytes/char

Sequential Files • Sequential file: A file whose contents can only be read in order • Reader must be able to detect end-of-file (EOF) • Data can be stored in logical records, sorted by a key field • Greatly increases the speed of batch updates

Text Files • Simple file structure. • Extendable to more complex file structures using markup languages (XHTML, HTML). • XHTML, HTML control the display of the file on a monitor. • XML is a standard for markup languages.

Converting data from two’s complement notation into ASCII for storage in a text file

Figure 9.15 A procedure for merging two sequential files

Figure 9.16Applying the merge algorithm (Letters are used to represent entire records. The particular letter indicates the value of the record’s key field.)

XML • Example <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> http://www.w3schools.com/xml/xml_whatis.asp

Indexed File key pointer Data item for Brown Data file Index file Index: A list of key values and the location of their associated records

An inverted file

A file with a partial index

Figure 9.17 Opening an indexed file

Hashing • In hashing the index file is replaced by a hash function. • The storage space is divided into buckets. • Each record has a key field. Each record is stored in the bucket corresponding to the hash of its key. • A hash function computes a bucket number for each key value. • Advantage: no index table needed. • Disadvantages: i) hash function needs careful design; ii) unpredictable performance

Terminology • Bucket: section of the data storage area. • Key: identifier for a block of information. • Hash function: takes as input a key and outputs a bucket number. • Collision: two keys yield the same bucket number.

Hash Functions • Hash Function Requirements • Easily and quickly computed. • Values evenly spread over the bucket numbers. • What can go wrong: bucket number computed from 1st and 3rd characters of a name: Brown, Brook, Broom, Broadhead, Biot, Bloom, … • Examples of Hash Functions • Mid square: compute (key x key) and set bucket number = middle digits. • Extraction: select digits from certain positions within the key. • Divide key by number of buckets and use the remainder.

Figure 9.18 Hashing the key field value 25X3Z to one of 41 buckets

Figure 9.19 The rudiments of a hashing system, in which each bucket holds those records that hash to that bucket number

Collisions in Hashing • Collision: The case of two keys hashing to the same bucket • Clustering problem: Poorly designed hashing function can have uneven distribution of keys into buckets • Collision also becomes a problem when there aren’t enough buckets (probability greatly increases as load factor (% of buckets filled) approaches 75%) • Solution: somewhere between 50% and 75% load factor, increase number of buckets and rehash all data

Handling bucket overflow

A large file partitioned into bucketsto be accessed by hashing

The role of an operating system when accessing a file System calls ?

Maintaining a file’s order by means of a File Allocation Table (FAT)

Information Required on a Hard Drive to Load an OS • Startup BIOS (POST , Load MBR) • Master Boot Record (MBR) • Master Boot Program • Partition Table (16 bytes * 4) • OS Boot Record (Boot Sector) • Loads the first program file of the OS • Boot Loader Program • Begins process of loading OS into memory

How Data Is Logically Stored on a Floppy Disk • All floppy and hard disk drives are divided into tracks and sectors • Tracks are concentric circles on a disk • Sector • Always 512 bytes • Physical organization of a disk • BIOS manages disk as sectors • Cluster (file allocation unit) • Group of sectors • Logical organization of a disk • OS views disk as a list of clusters

The Boot Record • Track 0, sector 1 of a floppy disk • Contains basic information about how the disk is organized • Includes bootstrap program, which can be used to boot from the disk • Uniform layout and content of boot record allows any version of DOS or Windows to read any DOS or Windows disk

The File Allocation Table (FAT) • Lists the location of files on disk in a one-column table • Floppy disk FAT is 12 bits wide, called FAT12 • Each entry describes how a cluster on the disk is used • A bad cluster on the disk will be marked in the FAT

The Root Directory • Lists all the files assigned to this table • Contains a fixed number of entries • Some items included are: • Filename and extension • Time and date of creation or last update • File attributes • First cluster number

How a Hard Drive is Logically Organized to Hold Data • Low-level format • Creates tracks and sectors, done at factory • Partition the hard drive (FDISK.EXE) • Creates partition table at the beginning of drive • High-level format • Done by OS for each logical drive • Master Boot Record (MBR) is the first 512 bytes of a hard drive • Master boot program (446 bytes) calls boot program to load OS • Partition table • Description, Location, Size

Partitions and Logical Drives

FAT16 • Supported by DOS and all versions of Windows • Uses 16 bits for each cluster entry • As the size of the logical drive increases, FAT16 cluster size increases dramatically

FAT32 • Became available with Windows 95 OSR2 • Used 32 bits per FAT entry, although only 28 bits were used to hold cluster numbers • More efficient than FAT16 in terms of cluster size

NTFS • Supported by Windows NT/2000/XP • Provides greater security • Used a database called the master file table (MFT) to locate files and directories • Supports large hard drives

Comparing FAT16, FAT32, NTFS

Figure 9.1: A file versus a database organization (1/2)

Figure 9.1: A file versus a database organization (2/2)

Figure 9.2: The conceptual layers of a database

Schemas • Schema: A description of the structure of an entire database, used by database software to maintain the database • Subschema: A description of only that portion of the database pertinent to a particular user’s needs, used to prevent sensitive data from being accessed by unauthorized personnel

Database Management Systems • Database Management System (DBMS): A software layer that manipulates a database in response to requests from applications • Distributed Database: A database stored on multiple machines • DBMS will mask this organizational detail from its users • Data independence: The ability to change the organization of a database without changing the application software that uses it

Database Models • Database model: A conceptual view of a database • Relational database model • Object-oriented database model

Relational Database Model • Relation: A rectangular table • Attribute: A column in the table • Tuple: A row in the table • Relational Design • Avoid multiple concepts within one relation • Can lead to redundant data • Deleting a tuple could also delete necessary but unrelated information

Improving a Relational Design • Decomposition: Dividing the columns of a relation into two or more relations, duplicating those columns necessary to maintain relationships • Lossless or nonloss decomposition: A “correct” decomposition that does not lose any information

Figure 9.3: A relation containing employee information

Figure 9.4: A relation containing redundancy

Figure 9.5 An employee database consisting of three relations

Figure 9.6 Finding the departments in which employee 23Y34 has worked

Figure 9.7: A relation and a proposed decomposition

Relational Operations • Select: Choose rows • Project: Choose columns • Join: Assemble information from two or more relations

Figure 9.8: The SELECT operation

Efficient Data Organization: Files, Structure, and Access Methods

Efficient Data Organization: Files, Structure, and Access Methods

Presentation Transcript

J. Glenn Brookshear

Computer Science — An Overview J. Glenn Brookshear

Computer Science: An Overview Tenth Edition by J. Glenn Brookshear

J. Glenn Brookshear

J. Glenn Brookshear

J. Glenn Brookshear

J. Glenn Brookshear

Computer Science: An Overview Eleventh Edition by J. Glenn Brookshear

Computer Science: An Overview Tenth Edition by J. Glenn Brookshear

Computer Science — An Overview J. Glenn Brookshear

J. Glenn Brookshear

J. Glenn Brookshear

Computer Science: An Overview Tenth Edition by J. Glenn Brookshear

Computer Science: An Overview Tenth Edition by J. Glenn Brookshear

J. Glenn Brookshear

Computer Science: An Overview Eleventh Edition by J. Glenn Brookshear

J. Glenn Brookshear

J. Glenn Brookshear

J. Glenn Brookshear