File System Architecture

File System Architecture .

File System architecture / bin etc user unix dev tty01 tty00 mike jim y z x .

File System Layout Boot block Super block Inode list Data Blocks Boot Block : first sector, contains bootstrap code to initialize the operating system Super Block : how many file it can store, where to find free space Inode List : The list of inode in the file system. Each Inode may represent a file or a directory. Data Blocks : The list of data blocks to carry information in the files. .

disk read file block The Buffer Cache Memory 1 2 1 3 3 4 write file block .

Buffer allocation algorithms Getblk brelse bread breada bwrite buffer allocation algorithms getblk: allocate buffer in memory. brelse: release buffer. bread: read disk block breada: read block ahead. bwrite: write a disk block .

Structure Of the Buffer Pool .

Data area Ptr to next buf on hash queue Structure Of Buffer Header Ptr to next buf on hash queue Ptr to next buf on free list Ptr to prev buf on free list • Device num: specify the file system device num. • Block num: specify the block num of data on file system. .

The buffer is currently locked, unlocked, busy, or free. • The buffer contains valid data. • Delayed-write: the kernel must write the buffer back to the disk before reassigning the buffer. • The kernel is currently reading or writing the contents to disk. • A process is waiting for the buffer to become free. Buffer status .

Buffer Functions • getblk: allocate buffer for read or write. • brelse: release buffer when it is not needed any more. .

Free list of buffers .

Buffer Allocation: Scenario 1 The kernel finds the block on its hash queue, and its buffer is free. .

Buffer Allocation: Scenario 1 • After buffer allocation using getblk • The kernel will mark the buffer busy; no other process can access it and change its contents while it is busy. • The kernel may read data from disk, or write data to disk. .

Search for block 18-not in cache Buffer Allocation: Scenario 2 The kernel cannot find the block on its hash queue, so it allocates a buffer from its freelist. .

Buffer Allocation: Scenario 2 • The kernel remove the first block (# 3) from the freelist. • The kernel mark the buffer to be busy. • Remove it from the hash queue from it is currently resides • Reassign the device # and block # to the free block. • Place the buffer in the correct hash queue. • Use the buffer for read or write. .

Search for block 18-not in cache Buffer Allocation: Scenario 3 Delayed write Delayed write The kernel cannot find the block on its hash queue, and in attempting to allocate a buffer from the free list, finds a buffer on the free list that is marked as delayed write. The kernel write the block to the disk and allocate another buffer. .

Buffer Allocation: Scenario 3 delayed write delayed write • The kernel takes off block 3,5 from freelist. • The kernel start asynchronous write for block 3,5. .

Buffer Allocation: Scenario 3 delayed write delayed write • The kernel will allocate buffer 4, release it from the free list. • Assign the device # and block # for the buffer. • Place the buffer in the correct hash queue. .

Buffer Allocation: Scenario 3 writing writing • The kernel will allocate buffer 4, release it from the free list. • Assign the device # and block # (28) for the buffer. • Place the buffer in the correct hash queue. .

Buffer Allocation: Scenario 3 Writing complete writing complete When the buffer writing is complete blocks 3,5 will be placed in the free list. .

Search for block 18-not in cache Buffer Allocation: Scenario 4 sleep The kernel search for block 18 not in cache and free list is empty. The process will go into sleep until another process executes brelse, release a buffer, and wake up processes waiting for this event. .

Search for block 99 block is busy Buffer Allocation: Scenario 5 Delayed write busy Delayed write The kernel search for a block in cache, it finds the block but the block is busy. The process goes to sleep and waits until the buffer is available .

Process A Process B • Allocate buffer for block b, • mark buffer busy, • initiate I/O, • sleep until done Race Condition • Find block b in hash queue • Buffer locked, go to sleep. • I/O done, wake up. • brelse(): wake up others • Buffer contains block b • Lock the buffer Race condition for a free buffer .

Process A Process B Process C 1 • Allocate buffer for block b, • mark buffer busy, • initiate I/O, • sleep until done 2 Race Condition • Find block b in hash queue • Buffer locked, go to sleep. 3 4 • Sleep waiting for a free buffer 5 • I/O done, wake up. • brelse(): wake up others • Get buffer assigned to block b. • reassign the buffer. 6 • Buffer does not contains block b • Start search a gain Process could sleep and wake up when a buffer becomes free, only go to sleep again because another process got control of buffer first. .

getblk (block no) • while (buffer not found) • if (block in hash queue) • if (buffer busy) // scenario 5 • sleep (event buffer becomes free) • continue • mark buffer busy // scenario 1 • remove buffer from free list • return buffer • else // block not in hash queue • if (there are no buffer on free list) //scenario 4 • sleep (event any buffer become free) • continue; • remove buffer from free list • If (buffer marked for delayed write) // scenario 3 • asynchronous write buffer to disk • continue • Remove buffer from old hash queue // scenario 2 • Put buffer onto new hash queue • Return buffer getblk system call .

brelse (locked buffer) • { • wakeup all procs waiting for any buffer to be free • wakeup all procs waiting for this buffer to be free • if (buffer is valid and not old and buffer not old) • Enqueue buffer at the end of the free list • else • Enqueue buffer at the beginning of free list • unlock (buffer) • } brelse system call .

bwrite () • { • initialize disk write; • if (I/O synchronous) • { • sleep (event I/O complete); • release buffer (brelse) • } • else if (buffer marked for delayed write) • mark buffer to be put at head of free list • } bwrite system call .

bread (block no) { get buffer for block no(getblk); if (buffer data valid) return buffer; initiate disk read; sleep (event disk read complete); return buffer; } bread system call .

When the process reads the file sequentially, two disk blocks are read. • The process asks for another block to be read using breada. • If the first block is not in cache, asynchronous read is issued. • If the second block is not in cache , asynchronous read is issued. • The process sleeps until the first block is read, and the buffer is returned. • The process doesn’t wait for the second block to be read. breada system call .

breada • Input: file system block number for immediate read • file system block number for asynchronous read • { • if (second block not in cache) • get buffer for second block (getblk) • initiate disk read; • If (first block not in cache) • get buffer for first block (getblk) • initiate disk read • sleep (event first buffer contains valid data) • return buffer • else // first block in cache • read first block (bread) • return buffer • } breada system call .

Use of the buffer cache can reduce the amount of disk traffic, thereby increasing overall system throughputs and decreasing response time. • The buffer algorithm help ensure system integrity, because they maintain a common, singe image of disk blocks contained in the cache. • Disk crash might leave the file system in an incorrect state due to delay-write. Advantage & Disadvantage of Buffer Cache .

namei alloc free ialloc ifree iget iput bmap Lower Level File System Algorithms Lower level file system algorithms • iget: return the previously allocated inode, possibly reading it from the disk. • Iput : release the inode. • nami: converts a path name to inode, using iget, iput and bmap. • alloc: allocate a free disk block for a file. • free: free a disk block. • bmap: map logical file byte offset to file system block • ialloc: allocate an inode for a file. • ifree: free inode of a file .

File System Data Structure User File Descriptor File Table Inode Table User File Descriptor: For each process. identify all opens file for specific process File table: Shared between all processes in the system . Contains how many bytes read or written, access rights allowed for the file Indo Table: access rights and file blocks location .

Inode list Disk blocks-each 512 bytes Inode 64 bytes-8 in block Inode offset in block = ((inode#-1)%#of-inode-per-block) x inode-size Block # = (inode# / #of-inode-per-block) Inode loc = block# x block-size + inode offset in block .

Inode Data Structure In core On disk .

Active Inode Hash table & Free list inode inode inode inode inode inode inode inode .

iget system call • iget () • { • While (not done) • { • if (inode in inode cache) • { • If (inode locked) • sleep (until inode is unlocked ); • continue; //?? • } • If (inode in free list) • remove inode from free list • increment reference count by 1 • return inode • } • If (no node in free list) • Return error • remove new inode from free list • Remove inode from old hash queue, and place on new one; • read inode from Disk (bread) • increment inode reference count by 1 • } • } .

iput system call • iput () • { • lock inode • Decrement inode reference count • If (reference count == 0) • { • If (inode link count == 0) • { • Free disk blocks for file (function free) • Set type to 0 • Free inode (function ifree) • } • If (file is accessed or modified or inode modified) • Update disk inode • put inode in free list • } • unlock inode • } .

File A File A File B free File C File C Allocation of contagious blocks for file 40 50 60 70 File B 40 50 60 70 85 Inode A • Simple inode structure, point to the first and last location. • Difficult to expand file if no space is available. • Inefficient to expand file (copy file to new location). • Fragmentation (garbage collection required). .

Block File Allocation-inode fixed size • No fragmentation problem. • Since the inode is fixed, it is difficult to increase file size .

Block size selection Difficult to find a point which minimize fragmentation and maximize file size Decreasing block size will decrease block fragmentation Increasing block size will decrease file size for fixed size inode .

Minimize fragmentation & Varying the file size .

Block Layout of a sample file Assume the block size = 1K Block number is addressable by 32 bits Block numbers per each block = 1024/32 = 32 block numbers 32 .

Block Layout of a sample file Block 4096 Block 228 Block 367 0 Block 3333 Block 331 Block 9156 .

Block Layout of a sample file Maximum file size-1K Bytes per block 10 direct blocks with 1K bytes each 10K bytes 1 indirect blocks with 32 direct blocks 32K bytes 1 double indirect blocks with 32 indirect blocks 1024K bytes 1 triple indirect blocks with 32 double indirect blocks 32M Bytes Process wants to access byte offset 9000 Block # = (9000/1024) = 8 starting from 0 (block # 367) Offset within block = 9000- 8*1024 = 808th byte from block # 367 Process wants to access byte offset 45000 First byte accessed by double indirect block = 32K + 10K = 43*1024=44032 Offset within block = 45000-44032 = 969 of the double indirect block Byte number 45000 is in 0th single indirect block-block # 331 Byte number 969 is in the 0th (969/1024)direct block – block # 3333 .

bmap system call bmap () // map logical offset into physical block # { if (offset <=10 K) indirection level = 0 else if (offset > 10K & <= 256K) indirection level = 1 else if (offset>256K & <= 64M) indirection level = 2 else if (offset > 64M) indirection level = 3 for ( l=1; l < indirection level; l++) { calculate indirect block # from file offset read indirect block using bread release an old indirect block using brelse } calculate direct block # return (block #) } .

Directories structure .

namei system call • namei () // convert path name to indoe • { • If (path name start from root) // /user/cse8343/jim • working node = root node (alg iget) • else // ./cse8343/jim • working node = current directory inode (alg iget) • while (there is more path name) • { • read next path name component from input • verify that working inode is of directory and permission is ok • read directory working inode by using bmap bread and brelse • if (component matches an entry in directory (working inode) • { • get inode number for match component • releases working inode (alg iput) • working inode = inode matched component (alg iget) • } • else // component not in directory • { • return no node • } • } • return working inode • } .

Super Block Fields • The size of the file system. • The number of free block in the file system. • A list of free block available in the file system • The index of the next free block in the free block list. • The size of inode list. • The number of free inodes in the file system. • A list of free inodes in the file system. • The index of the next free inode in the free inode list. • Lock fields for both free inode list and free block list • A flag to indicate if the super block is modified. .

Allocation of a new Inode Super Block Inode List on disk Remembered Inode (the highest free inode it found before) Free Inode list .

Allocation of a new Inode Free Inode list in super block is not empty 18 19 Free Inodes empty 83 48 Index=19 18 Free Inodes empty 83 Index=18 .

File System Architecture

File System Architecture

Presentation Transcript

Hadoop Distributed File System Architecture and Design

Introduction File Service Architecture Sun Network File System The Andrew File System Recent advances Summary

FILE SYSTEM

File System

Scalable Architecture for Tax File Processing System

File-System

File System

FILE SYSTEM

File System

File System

File System

File Processing : Database Management System Architecture

File System

distributed file system and google file system

File System

File System