CHAPTER 8

CHAPTER 8 FILE PROCESSING CONCEPT

CONTENTS • Introduction • Primary Key • Classification of Data Files • By Content • By Mode of Processing • By Organization of files • Serial • Sequential • Index Sequential • Random • Transformation Method • Q&A

Introduction • File Processing is a computer programming term • refers to the use of computer files to store data in persistent memory/permanent storage • Variables and arrays are temporary storage of data • File processing is a useful alternative to a database only where the information is only going to be accessed by a single user, where speed of data input is vital and where the amount of data being stored is relatively small

Introduction Elements of Computer file • A collection of info,stored on magnetic media/optical disks/pen drive • Data files – similar in concepts • Files can be created, updated and processed • File contains logical record fields Characters There are 2 categories of record: Logical Record and Physical record. Logical records are referred to each line of data in a file. Physical record is defined as one or more logical records read into or written from main memory as a unit of information

FILE REC-2 … REC-n REC-1 Field-n Field-1 Field-2 ... Char-1 … Char-2 Char-n Introduction – Data Hierarchy FILE LOGICAL RECORD FIELD CHARACTER

Introduction • The number of characters grouped into a field can vary from field to field in a record • 2 types of record : • fixed length • Where each record has a fixed length e.g. 90 characters. Fields not completely filled will be padded with space characters resulting waste of space. • variable length • Where fields of record size vary according to the size of data contained in them. • Special character called field separators are used to indicate the start and end of a record.

Introduction • the information contained in the file is related to specific detail • Different files are used to store different types of details – different types of details are notmixed into a single file • Records are not usually transferred to and from main memory as single logical records but grouped together (as a block of logical records). • When read, records are stored in a buffer temporarily. • File normally ends with “end of file” marker.

Primary Key • File always contains primary key (a field of the record which has unique value) to uniquely identify a particular record • Primary Key is made up of one field or combination of two or more fields of the record • Primary key allows easier/quicker search and retrieval of a particular record by matching the search key and the primary key.

Classification of Data file • The way data files are used is dependent upon : • the contents, • mode of processing and • organisation of the file

Classification according to content • 6 basic categories: • Master File • Transaction File • Index File • Table File • Archival/History File • Backup File

Master File • contain permanent info of current status type. • used for basic identification and accumulation of certain statistical data e.g. Product file, Staff file, Customer File etc. • Transaction File • Contain all the data and activities included on the master file. • Accumulated records are used to update the master file e.g. invoices, purchase order etc. • Updating method is batch

Index File • Index files actually consist of a pair of files: one holding the data and one storing an index to that data. • Used to indicate location of specific records in other files (usually master file) using an index key or address. Table File • Static reference data used during processing e.g. pay rate table for preparation of payroll

Archival/History File • Often termed master files. • Contain non-current statistical data – used to create comparative reports, pay commission etc. • Normally updated periodically & involve large volume of data Back up File • Non-current files stored in the file library • Used when the current master file is destroyed

Classification according to processing mode • Input • Data loaded into CPU, processed, output placed in another file • Output • Data processed, written onto another file • Overlay • A record is accessed, loaded into CPU, updated, written back to the original location (overwrite the original value).

Classification according to organization of file • File organization is how the records is stored, processed and accessed • It has 3 functions: • Storage of records. • Maintenance of files (updating, editing, deleting) • Enable retrieval of required items (searching).

Classification according to organization of file • There are several types of file organization: • Serial • Sequential • Indexed Sequential • Random

Serial File • Most simple form of file organization • Records are not kept in any pre-determined order • Records are position one after another • new records are added to the bottom of the file regardless of what these rows contain • This type of technique is normally used for storing records for further processing (eg. Sorting) • Normally applied to storage on magnetic tape • Accessing records is very slow

Sequential File • more organisedthan a serial file • records are kept in some pre-defined order - in the order of primary key • e.g. books data are stored alphabetically according to their author • Will not be necessary to search the whole file if the record is not present • This is less flexible because if we are looking for books with authors whose names beginning with N, then we need to scan along from A until we come to N

Sequential File • Data cannot be modified without the risk of destroying the other data in the file. • E.g. if the name “Sam” needed to be changed to “Shaun”, the old name cannot simply be overwritten. The new record contains more characters than the original one. The characters beyond the ‘a’ in “Shaun” would overwrite the beginning of the next sequential record in the file. • Suitable for storage on magnetic tape • Sequential access is not usually used to update records in place. Instead the entire file usually rewritten. This requires processing every record in the file to update one record. NOTE : In both files (serial and sequential), individual records can only be found by reading the whole file until the required key value is located.

Indexed Sequential File • basically a hybrid of sequential and random file organisation techniques (uses Sequential & random access method) • Often referred to as ISAM (Indexed Sequential Access Method) • Records are maintained in key sequence but have an index structure built on top of actual data • The index to a (large) file may be split into different index levels – INDEX OF INDEXES • Master Index – highest level index, contain pointers to the low level index

Indexed Sequential File • Locating a particular record – following the index tree from master index to the target data block containing the target record. • Block is read to locate the target record with matching key • This organisation may be useful for auto-bank machines i.e. customers randomly access their accounts throughout the day and at the end of the day the banks can update the whole file sequentially • One of the drawback of using this organization is the fact that several tables must be stored for the index which makes for a considerable storage overhead

Indexed Sequential File Locating record 7, which address is 050E Block 2 INDEX Block 3 3 050K Block 4

Indexed Sequential File Multi-level structure Locating record 100, which address is 053X Low-level index 2 Index Block 159

Random File • Records normally fixed in length • Accessed directly without searching thru the preceding records • Data can be inserted in a randomly accessed file without destroying other data in the file. • Data previously stored can also be updated or deletedwithout rewriting the entire file/overwriting. • Eg. Airline reservation systems, banking systems etc. • Since every record is the same length, the computer can quickly calculates (as a function of the record key) the exact location of a record relative to the beginning of the file.

Random File • Random file uses block address calculation algorithm • Using this algorithm, the return is the block number with the record key as the input to the algorithm • Problem is how to store data efficiently, so that by giving the record key, the storage location can be found. • Keys are unlikely to run sequentially  file has clusters and gaps. For example, storage is determined by key sequence in alphabetical order of first letter of customer name. Some of the letters are common eg. A, B, D but some are not e.g. Q, X. • Need of a good algorithm to generate the uniform/consistent addresses – hashing algorithm

Transformation Method • 5 major techniques for hash coding • Division • Truncation • Extraction • Folding • Randomizing • All techniques aim to generate a uniformly distributed set of addresses which will map the keys to the storage area as uniformly as possible. • Best known and most used technique– division • Division is done by dividing the primary key by a positive integer, usually a prime number, which is approximately equal to the number of available addresses and use the remainder as the address

Transformation Method Here are some relatively simple hash functions that have been used: • The division-remainder method: The size of the number of items in the table is estimated. That number is then used as a divisor into each original value or key to extract a quotient and a remainder. The remainder is the hashed value. (Since this method is liable to produce a number of collisions, any search mechanism would have to be able to recognize a collision and offer an alternate search mechanism.) • Folding: This method divides the original value (digits in this case) into several parts, adds the parts together, and then uses the last four digits (or some other arbitrary number of digits that will work ) as the hashed value or key.

Transformation Method • Radix transformation: Where the value or key is digital, the number base (or radix) can be changed resulting in a different sequence of digits. (For example, a decimal numbered key could be transformed into a hexadecimal numbered key.) High-order digits could be discarded to fit a hash value of uniform length. • Digit rearrangement: This is simply taking part of the original value or key such as digits in positions 3 through 6, reversing their order, and then using that sequence of digits as the hash value or key.

CHAPTER 8

CHAPTER 8

Presentation Transcript

Diamond Chapter 8 1 CHAPTER 8

CHAPTER 8

Chapter 8

CHAPTER 8

Chapter 8

Chapter 8

Chapter 8

Chapter 8

Chapter 8

Chapter 8

Chapter 8:

Chapter 8

Chapter 8

Chapter 8

Chapter 8

Chapter 8

Chapter 8

Chapter 8

Chapter 8

Chapter 8

Chapter 8

Chapter 8