860 likes | 992 Vues
Storage. Jeff Chase Duke University. Storage: The Big Issues. Disks are rotational media with mechanical arms. High access cost caching and prefetching Cost depends on previous access careful block placement and scheduling. Stored data is hard state .
E N D
Storage Jeff Chase Duke University
Storage: The Big Issues • Disks are rotational media with mechanical arms. • High access cost caching and prefetching • Cost depends on previous access careful block placement and scheduling. • Stored data is hard state. • Stored data persists after a restart. • Data corruption and poor allocations also persist. • Allocate for longevity, and write carefully. • Disks fail. • Plan for failure redundancy and replication. • RAID: integrate redundancy with striping across multiple disks for higher throughput.
Rotational Media Track Sector Arm Cylinder Platter Head • Access time = seek time + rotational delay + transfer time • seek time = 5-15 milliseconds to move the disk arm and settle on a cylinder • rotational delay = 8 milliseconds for full rotation at 7200 RPM: average delay = 4 ms • transfer time = 1 millisecond for an 8KB block at 8 MB/s • Bandwidth utilization is less than 50% for any noncontiguous access at a block grain.
Disks and Drivers • Disk hardware and driver software provide foundational support for block devices. • OS views the block devices as a collection of volumes. • A logical volume may be a partition of a single disk or a concatenation of multiple physical disks (e.g., RAID). • volume == LUN • Each volume is an array of fixed-size sectors. • Name sector/block by (volumeID, sector ID). • Read/write operations DMA data to/from physical memory. • Device interrupts OS on I/O completion. • ISR wakes process, updates internal records, etc.
RAID • Raid levels 3 through 5 • Backup and parity drives are shaded
Filesystems • Files • Sequentially numbered bytes or logical blocks. • Metadata stored in on-disk data object • e.g, Unix “inode” • Directories • A special kind of file with a set of name mappings. • E.g., name to inode • Pointer to parent in rooted hierarchy: .., / • System calls • Unix: open, close, read, write, stat, seek, sync, link, unlink, symlink, chdir, chroot, mount, chmod, chown.
A Typical Unix File Tree Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host. / File trees are built by grafting volumes from different volumes or from network servers. bin etc tmp usr vmunix In Unix, the graft operation is the privileged mount system call, and each volume is a filesystem. ls sh project users packages mount point (volume root) • mount (coveredDir, volume) • coveredDir: directory pathname • volume: device specifier or network volume • volume root contents become visible at pathname coveredDir tex emacs
Abstractions User view Addressbook, record for Duke CPS Application addrfile ->fid, byte range* fid File System bytes block# device, block # Disk Subsystem surface, cylinder, sector
wind: 18 0 0 snow: 62 rain: 32 directory inode hail: 48 Directories sector 32 Entries or slots are found by a linear scan.
sector 0 sector 1 allocation bitmap file directory file 11100010 00101101 10111101 wind: 18 0 0 snow: 62 rain: 32 hail: 48 10011010 00110001 00010101 00101110 00011001 01000100 A Filesystem On Disk once upo n a time /n in a l and far far away , lived th This is just an example (Nachos)
UNIX File System Calls Open files are named to by an integer file descriptor. Pathnames may be relative to process current directory. char buf[BUFSIZE]; int fd; if ((fd = open(“../zot”, O_TRUNC | O_RDWR) == -1) { perror(“open failed”); exit(1); } while(read(0, buf, BUFSIZE)) { if (write(fd, buf, BUFSIZE) != BUFSIZE) { perror(“write failed”); exit(1); } } Process passes status back to parent on exit, to report success/failure. Process does not specify current file offset: the system remembers it. Standard descriptors (0, 1, 2) for input, output, error messages (stdin, stdout, stderr).
File Sharing Between Parent/Child (UNIX) main(int argc, char *argv[]) { char c; int fdrd, fdwt; if ((fdrd = open(argv[1], O_RDONLY)) == -1) exit(1); if ((fdwt = creat([argv[2], 0666)) == -1) exit(1); fork(); for (;;) { if (read(fdrd, &c, 1) != 1) exit(0); write(fdwt, &c, 1); } } [Bach]
Operations on Directories (UNIX) • Link - make entry pointing to file • Unlink - remove entry pointing to file • Rename • Mkdir - create a directory • Rmdir - remove a directory
Access Control for Files • Access control lists - detailed list attached to file of users allowed (denied) access, including kind of access allowed/denied. • UNIX RWX - owner, group, everyone
File Systems: The Big Issues • Buffering disk data for access from the processor. • Block I/O (DMA) needs aligned physical buffers. • Block update is a read-modify-write. • Creating/representing/destroying independent files. • Allocating disk blocks and scheduling disk operations to deliver the best performance for the I/O stream. • What are the patterns in the request stream? • Multiple levels of name translation. • Pathnameinode, logicalphysical block • Reliability and the handling of updates.
Representing a File On Disk file attributes: may include owner, access control list, time of create/modify/access, etc. once upo n a time /nin a l logical block 0 block map and far far away ,/nlived t logical block 1 physical block pointers in the block map are sector IDs or physical block numbers he wise and sage wizard. logical block 2 inode
File size File type Protection - access control information History: creation time, last modification,last access. Location of file - which device Location of individual blocks of the file on disk. Owner of file Group(s) of users associated with file Meta-Data
Representing Large Files inode direct block map Classical Unix Each file system block is a clump of sectors (4KB, 8KB, 16KB). Inode == 128 bytes, packed into blocks. Each inode has 68 bytes of attributes and 15 block map entries. indirect block double indirect block Suppose block size = 8KB 12 direct block map entries in the inode can map 96KB of data. One indirect block (referenced by the inode) can map 16MB of data. One double indirect block pointer in inode maps 2K indirect blocks. maximum file size is 96KB + 16MB + (2K*16MB) + ...
Unix index blocks • Intuition • Many files are small • Length = 0, length = 1, length < 80, ... • Some files are huge (3 gigabytes) • “Clever heuristic” in Unix FFS inode • 12 (direct) block pointers: 12 * 8 KB = 96 KB • Availability is “free” - you need inode to open() file anyway • 3 indirect block pointers • single, double, triple
ln -s /usr/Marty/bar bar creat bar creat foo ln /usr/Lynn/foo bar unlink bar unlink foo foo bar Links usr Lynn Marty
directory A directory B wind: 18 0 0 inode link count = 2 sleet: 48 rain: 32 hail: 48 inode 48 Unix File Naming (Hard Links) A Unix file may have multiple names. Each directory entry naming the file is called a hard link. Each inode contains a reference count showing how many hard links name it. unlink system call (“remove”) unlink(name) destroy directory entry decrement inode link count if count == 0 and file is not in active use free blocks (recursively) and on-disk inode link system call link (existing name, new name) create a new name for an existing file increment inode link count Illustrates: garbage collection by reference counting.
wind: 18 0 0 directory A directory B sleet: 67 rain: 32 hail: 48 inode link count = 1 ../A/hail/0 inode 48 inode 67 Unix Symbolic (Soft) Links A soft link is a file containing a pathname of some other file. symlink system call symlink (existing name, new name) allocate a new file (inode) with type symlink initialize file contents with existing name create directory entry for new file with new name The target of the link may be removed at any time, leaving a dangling reference. How should the kernel handle recursive soft links?
Filesystems • Each file volume (filesystem) has a type, determined by its disk layout or the network protocol used to access it. • ufs (ffs), lfs, nfs, rfs, cdfs, etc. • Filesystems are administered independently. • Modern systems also include “logical” pseudo-filesystems in the naming tree, accessible through the file syscalls. • procfs: the /proc filesystem allows access to process internals. • mfs: the memory file system is a memory-based scratch store. • Processes access filesystems through common syscalls
user space syscall layer (file, uio, etc.) Virtual File System (VFS) network protocol stack (TCP/IP) FFS LFS NFS *FS etc. etc. device drivers VFS: the Filesystem Switch • Sun Microsystems introduced the virtual file system interface in 1985 to accommodate diverse filesystem types cleanly. • VFS allows diverse specific file systems to coexist in a file tree, isolating all FS-dependencies in pluggable filesystem modules.
syscall layer free vnodes NFS UFS Vnodes • In the VFS framework, every file or directory in active use is represented by a vnode object in kernel memory. Each vnode has a standard file attributes struct. Generic vnode points at filesystem-specific struct (e.g., inode, rnode), seen only by the filesystem. Each specific file system maintains a cache of its resident vnodes. Vnode operations are macros that vector to filesystem-specific procedures.
V/Inode Cache VFS free list head HASH(fsid, fileid) Active vnodes are reference- counted by the structures that hold pointers to them. - system open file table - process current directory - file system mount points - etc. Each specific file system maintains its own hash of vnodes (BSD). - specific FS handles initialization - free list is maintained by VFS vget(vp): reclaim cached inactive vnode from VFS free list vref(vp): increment reference count on an active vnode vrele(vp): release reference count on a vnode vgone(vp): vnode is no longer valid (file is removed)
user ID process ID process group ID parent PID signal state siblings children user ID process ID process group ID parent PID signal state siblings children Sharing Open File Instances shared seek offset in shared file table entry parent shared file (inode or vnode) child system open file table process file descriptors process objects
File Buffer Cache Proc • Avoid the disk for as many file operations as possible. • Cache acts as a filter for the requests seen by the disk - reads served best. • Delayed writeback will avoid going to disk at all for temp files. Memory File cache
Handling Updates in the File Cache 1. Blocks may be modified in memory once they have been brought into the cache. Modified blocks are dirty and must (eventually) be written back. 2. Once a block is modified in memory, the write back to disk may not be immediate (synchronous). Delayed writes absorb many small updates with one disk write. How long should the system hold dirty data in memory? Asynchronous writes allow overlapping of computation and disk update activity (write-behind). Do the write call for block n+1 while transfer of block n is in progress.
Failures, Commits, Atomicity • What guarantees does the system offer about the hard state if the system fails? • Durability • Did my writes commit, i.e., are they on the disk? • Atomicity • Can an operation “partly commit”? • Also, can it interleave with other operations? • Recoverability and Corruption • Is the metadata well-formed on recovery?
Unix Failure/Atomicity • File writes are not guaranteed to commit until close. • A process can force commit with a sync. • The system forces commit every (say) 30 seconds. • Failure could lose an arbitrary set of writes. • Reads/writes to a shared file interleave at the granularity of system calls. • Metadata writes are atomic/synchronous. • Disk writes are carefully ordered. • The disk can become corrupt in well-defined ways. • Restore with a scrub (“fsck”) on restart. • Alternatives: logging, shadowing • Want better reliability? Use a database.
Track Sector Arm Cylinder Platter Head The Problem of Disk Layout • The level of indirection in the file block maps allows flexibility in file layout. • “File system design is 99% block allocation.” [McVoy] • Competing goals for block allocation: • allocationcost • bandwidth for high-volume transfers • stamina/longevity • efficient directory operations • Goal: reduce disk arm movement and seek overhead. • metric of merit: bandwidth utilization
Bandwidth utilization • Define • b Block size • B Raw disk bandwidth (“spindle speed”) • s Average access (seek+rotation) delay per block I/O • Then • Transfer time per block = b/B • I/O completion time per block = s + (b/B) • Effective disk bandwidth for I/O request stream = b/(s + (b/B)) • Bandwidth wasted per I/O: sB • Effective bandwidth utilization (%): b/(sB + b) • How to get better performance? • Larger b (larger blocks, clustering, extents, etc.) • Smaller s (placement / ordering, sequential access, logging, etc.)
File System design and impl Usage patterns observed today Know your Workload! • File usage patterns should influence design decisions. Do things differently depending: • How large are most files? How long-lived?Read vs. write activity. Shared often? • Different levels “see” a different workload. • Feedback loop
Generalizations from UNIX Workloads • Standard Disclaimers that you can’t generalize…but anyway… • Most files are small (fit into one disk block) although most bytes are transferred from longer files. • Most opens are for read mode, most bytes transferred are by read operations • Accesses tend to be sequential and 100%
More on Access Patterns • There is significant reuse (re-opens) - most opens go to files repeatedly opened & quickly. Directory nodes and executables also exhibit good temporal locality. • Looks good for caching! • Use of temp files is significant part of file system activity in UNIX - very limited reuse, short lifetimes (less than a minute).
Example: BSD FFS • Fast File System (FFS) [McKusick81] • Clustering enhancements [McVoy91], and improved cluster allocation [McKusick: Smith/Seltzer96] • FFS can also be extended with metadata logging [e.g., Episode]
FFS Cylinder Groups • FFS defines cylinder groups as the unit of disk locality, and it factors locality into allocation choices. • typical: thousands of cylinders, dozens of groups • Strategy: place “related” data blocks in the same cylinder group whenever possible. • seek latency is proportional to seek distance • Smear large files across groups: • Place a run of contiguous blocks in each group. • Reserve inode blocks in each cylinder group. • This allows inodes to be allocated close to their directory entries and close to their data blocks (for small files).
FFS Allocation Policies 1. Allocate file inodes close to their containing directories. For mkdir, select a cylinder group with a more-than-average number of free inodes. For creat, place inode in the same group as the parent. 2. Concentrate related file data blocks in cylinder groups. Most files are read and written sequentially. Place initial blocks of a file in the same group as its inode. How should we handle directory blocks? Place adjacent logical blocks in the same cylinder group. Logical block n+1 goes in the same group as block n. Switch to a different group for each indirect block.
Allocating a Block 1. Try to allocate the rotationally optimal physical block after the previous logical block in the file. Skip rotdelay physical blocks between each logical block. (rotdelay is 0 on track-caching disk controllers.) 2. If not available, find another block a nearby rotational position in the same cylinder group We’ll need a short seek, but we won’t wait for the rotation. If not available, pick any other block in the cylinder group. 3. If the cylinder group is full, or we’re crossing to a new indirect block, go find a new cylinder group. Pick a block at the beginning of a run of free blocks. Provided for completeness
Clustering in FFS • Clustering improves bandwidth utilization for large files read and written sequentially. • Allocate clumps/clusters/runs of blocks contiguously; read/write the entire clump in one operation with at most one seek. • Typical cluster sizes: 32KB to 128KB. • FFS can allocate contiguous runs of blocks “most of the time” on disks with sufficient free space. • This (usually) occurs as a side effect of setting rotdelay = 0. • Newer versions may relocate to clusters of contiguous storage if the initial allocation did not succeed in placing them well. • Must modify buffer cache to group buffers together and read/write in contiguous clusters. Provided for completeness
Sequential File Write note sequential block allocation write write stall read physical disk sector sync command (typed to shell) pushes indirect blocks to disk read next block of free space bitmap (??) sync time in milliseconds
Sequential Writes: A Closer Look 16 MB in one second (one indirect block worth) physical disk sector longer delay for head movement to push indirect blocks 140 ms delay for cylinder seek etc. (???) write write stall time in milliseconds
The Problem of Metadata Updates • Metadata updates are a second source of FFS seek overhead. • Metadata writes are poorly localized. • E.g., extending a file requires writes to the inode, direct and indirect blocks, cylinder group bit maps and summaries, and the file block itself. • Metadata writes can be delayed, but this incurs a higher risk of file system corruption in a crash. • If you lose your metadata, you are dead in the water. • FFS schedules metadata block writes carefully to limit the kinds of inconsistencies that can occur. • Some metadata updates must be synchronous on controllers that don’t respect order of writes.