Distributed Operating Systems

Distributed Operating Systems Computer Engineering Department Luai M. Malhis, Ph.D. Textbook: Distributed Systems by Tannenbaum and Van Steen, Prentice Hall Most of the course materials are taken from http://lass.cs.umass.edu/~shenoy/courses/677

Secondary Storage • Secondary storage typically: • is anything that is outside of “primary memory” • does not permit direct execution of instructions or data retrieval via machine load/store instructions • Characteristics: • it’s large: 30-60GB • it’s cheap: 40GB disk for $100 • 0.25 cents per megabyte (wow!) • it’s persistent: data survives power loss • it’s slow: milliseconds to access • why is this slow??

Memory Hierarchy • Each level acts as a cache of lower levels 100 bytes CPU registers 1 ns 32KB L1 cache 1 ns 256KB L2 cache 4 ns 1GB Primary Memory 60 ns 100GB Secondary Storage 10+ ms Tertiary Storage 1s-1hr 1-1000TB

Disks and the OS • Disks are messy, messy devices • errors, bad blocks, missed seeks, etc. • Job of OS is to hide this mess from higher-level software • low-level device drivers (initiate a disk read, etc.) • higher-level abstractions (files, databases, etc.) • OS may provide different levels of disk access to different clients • physical disk block (surface, cylinder, sector) • disk logical block (disk block #) • file logical (filename, block or record or byte #)

Physical Disk Structure • Disk components • platters • surfaces • tracks • sectors • cylinders • arm • heads sector track surface cylinder platter arm head

Interacting with Disks • In the old days… • OS would have to specify cylinder #, sector #, surface #, transfer size • I.e., OS needs to know all of the disk parameters • Modern disks are even more complicated • not all sectors are the same size, sectors are remapped, … • disk provides a higher-level interface, e.g. SCSI • exports data as a logical array of blocks [0 … N] • maps logical blocks to cylinder/surface/sector • OS only needs to name logical block #, disk maps this to cylinder/surface/sector • as a result, physical parameters are hidden from OS • both good and bad

Example disk characteristics • IBM Ultrastar 36XP drive • form factor: 3.5” • capacity: 36.4 GB • rotation rate: 7,200 RPM (120 RPS, musical note C3) • platters: 10 • surfaces: 20 • sector size: 512-732 bytes • cylinders: 11,494 • cache: 4MB • transfer rate: 17.9 MB/s (inner) – 28.9 MB/s (outer) • full seek: 14.5 ms • head switch: 0.3 ms

Disk Performance • Performance depends on a number of steps • seek: moving the disk arm to the correct cylinder • depends on how fast disk arm can move • seek times aren’t diminishing very quickly • rotation: waiting for the sector to rotate under head • depends on rotation rate of disk • rates are increasing, but slowly • transfer: transferring data from surface into disk controller, and from there sending it back to host • depends on density of bytes on disk • increasing, and very quickly • When the OS uses the disk, it tries to minimize the cost of all of these steps • particularly seeks and rotation

File Systems • The concept of a file system is simple • the implementation of the abstraction for secondary storage • abstraction = files • logical organization of files into directories • the directory hierarchy • sharing of data between processes, people and machines • access control, consistency, …

Files • A file is a collection of data with some properties • contents, size, owner, last read/write time, protection … • Files may also have types • understood by file system • device, directory, symbolic link • understood by other parts of OS or by runtime libraries • executable, dll, source code, object code, text file, … • Type can be encoded in the file’s name or contents • windows encodes type in name • .com, .exe, .bat, .dll, .jpg, .mov, .mp3, … • unix has a smattering of both • in content via magic numbers or initial characters (e.g., #!)

Basic operations • Unix • create(name) • open(name, mode) • read(fd, buf, len) • write(fd, buf, len) • sync(fd) • seek(fd, pos) • close(fd) • unlink(name) • rename(old, new)

File Access Methods • Some file systems provide different access methods that specify ways the application will access data • sequential access • read bytes one at a time, in order • direct access • random access given a block/byte # • record access • file is array of fixed- or variable-sized records • indexed access • FS contains an index to a particular field of each record in a file • apps can find a file based on value in that record (similar to DB) • Why do we care about distinguishing sequential from direct access? • what might the FS do differently in these cases?

Directories • Directories provide: • a way for users to organize their files • a convenient file name space for both users and FS’s • Most file systems support multi-level directories • naming hierarchies (/, /usr, /usr/local, /usr/local/bin, …) • Most file systems support the notion of current directory • absolute names: fully-qualified starting from root of FS bash$ cd /usr/local • relative names: specified with respect to current directory bash$ cd /usr/local(absolute) bash$ cd bin(relative, equivalent to cd /usr/local/bin)

Directory Internals • A directory is typically just a file that happens to contain special metadata • directory = list of (name of file, file attributes) • attributes include such things as: • size, protection, location on disk, creation time, access time, … • the directory list is usually unordered (effectively random) • when you type “ls”, the “ls” command sorts the results for you

Path Name Translation • Let’s say you want to open “/one/two/three” fd = open(“/one/two/three”, O_RDWR); • What goes on inside the file system? • open directory “/” (well known, can always find) • search the directory for “one”, get location of “one” • open directory “one”, search for “two”, get location of “two” • open directory “two”, search for “three”, get loc. of “three” • open file “three” • (of course, permissions are checked at each step) • FS spends lots of time walking down directory paths • this is why open is separate from read/write (session state) • OS will cache prefix lookups to enhance performance • /a/b, /a/bb, /a/bbb all share the “/a” prefix

Protection Systems • FS must implement some kind of protection system • to control who can access a file (user) • to control how they can access it (e.g., read, write, or exec) • More generally: • generalize files to objects (the “what”) • generalize users to principles (the “who”, user or program) • generalize read/write to actions (the “how”, or operations) • A protection system dictates whether a given action performed by a given subject on a given object should be allowed • e.g., you can read or write your files, but others cannot • e.g., your can read /etc/motd but you cannot write to it

Model for Representing Protection • Two different ways of thinking about it: • access control lists (ACLs) • for each object, keep list of subjects and subj’s allowed actions • capabilities • for each subject, keep list of objects and subj’s allowed actions • Both can be represented with the following matrix: objects subjects capability ACL

ACLs vs. Capabilities • Capabilities are easy to transfer • they are like keys: can hand them off • they make sharing easy • ACLs are easier to manage • object-centric, easy to grant and revoke • to revoke capability, need to keep track of subjects that have it • hard to do, given that subjects can hand off capabilities • ACLs grow large when object is heavily shared • can simplify by using “groups” • put users in groups, put groups in ACLs • additional benefit • change group membership, affects ALL objects that have this group in its ACL

File System Implementations • We’ve looked at disks and file systems generically • now it’s time to bridge the gap by talking about specific file system implementations • We’ll focus on two: • BSD Unix FFS • what’s at the heart of most UNIX file systems • LFS • a research file system originally from Berkeley

BSD UNIX FFS • FFS = “Fast File System” • original (i.e. 1970’s) file system was very simple and straightforwardly implemented • but had very poor disk bandwidth utilization • why? far too many disk seeks on average • BSD UNIX folks did a redesign in the mid ’80’s • FFS: improved disk utilization, decreased response time • McKusick, Joy, Fabry, and Leffler • basic idea is FFS is aware of disk structure • I.e., place related things on nearby cylinders to reduce seeks

File System Layout • How does the FS use the disk to store files? • FS defines a block size (e.g., 4KB) • disk space allocated in granularity of blocks • A “Master Block” defines the location of root directory • always at a well-known location • usually replicated for reliability • A “free map” lists which blocks are free vs. allocated • usually a bitmap, one bit per block on the disk • also stored on disk, and cached in memory for performance • Remaining disk blocks are used to store files/dirs • how this is done is the essence of FFS

Possible Disk Layout Strategies • Files span multiple disks • how do you find all of the blocks of a file? • option 1: contiguous allocation • like memory • fast, simplifies directory access • inflexible: causes fragmentation, needs compaction • option 2: linked structure • each block points to the next, directory points to first • good for sequential access, bad for all others • option 3: indexed structure • an “index block” contains pointers to many other blocks • handles random workloads better • may need multiple index blocks, linked together

… … … … … … Unix Inodes • In Unix (including in FFS), “inodes” are blocks that implement the index structure for files • directory entries point to file inodes • each inode contains 15 block pointers • first 12 are direct blocks (I.e., 4KB blocks of file data) • then, single, double, and triple indirect indexes 0 1 … 12 13 14

Inodes and Path Search • Unix Inodes are NOT directories • they describe where on disk the blocks for a file are placed • directories are just files, so each directory also has an inode that describes where the blocks for the directory is placed • Directory entries map file names to inodes • to open “/one”, use master block to find inode for “/” on disk • open “/”, look for entry for “one” • this gives the disk block number for inode of “one” • read the inode for “one” into memory • this inode says where the first data block is on disk • read that data block into memory to access the data in the file

Data and Inode placement • Original (non-FFS) unix FS had two major problems: • 1. data blocks are allocated randomly in aging file systems • blocks for the same file allocated sequentially when FS is new • as FS “ages” and fills, need to allocate blocks freed up when other files are deleted • problem: deleted files are essentially randomly placed • so, blocks for new files become scattered across the disk! • 2. inodes are allocated far from blocks • all inodes at beginning of disk, far from data • traversing file name paths, manipulating files, directories requires going back and forth from inodes to data blocks • BOTH of these generate many long seeks!

Cylinder groups • FFS addressed these problems using notion of a cylinder group • disk partitioned into groups of cylinders • data blocks from a file all placed in same cylinder group • files in same directory placed in same cylinder group • inode for file in same cylinder group as file’s data • Introduces a free space requirement • to be able to allocate according to cylinder group, the disk must have free space scattered across all cylinders • in FFS, 10% of the disk is reserved just for this purpose! • good insight: keep disk partially free at all times! • this is why it may be possible for df to report >100%

File Buffer Cache (not just for FFS) • Exploit locality by caching file blocks in memory • cache is system wide, shared by all processes • even a small (4MB) cache can be very effective • many FS’s “read-ahead” into buffer cache • Caching writes • some apps assume data is on disk after write • need to “write-through” the buffer cache • or: • “write-behind”: maintain queue of uncommitted blocks, periodically flush. Unreliable! • NVRAM: write into battery-backed RAM. Expensive! • LFS: we’ll talk about this soon! • Buffer cache issues: • competes with VM for physical frames • integrated VM/buffer cache? • need replacement algorithms here • LRU usually

Other FFS innovations • Small blocks (1KB) caused two problems: • low bandwidth utilization • small max file size (function of block size) • FFS fixes by using a larger block (4KB) • allows for very large files (1MB only uses 2 level indirect) • but, introduces internal fragmentation • there are many small files (I.e., <4KB) • fix: introduce “fragments” • 1KB pieces of a block • Old FS was unaware of disk parameters • FFS: parameterize FS according to disk and CPU characteristics • e.g.: account for CPU interrupt and processing time to layout sequential blocks • skip according to rotational rate and CPU latency!

Disk Scheduling • Access time has two major components • Seek time - time to move heads to correct cylinder. • Rotational latency additional time waiting for disk to rotate to desired sector. • OS Responsible for using hardware efficiently • Minimize seek time • Seek time  seek distance • Disk bandwidth = total bytes transferred/total time • time - between the first request for service and the completion of the last transfer.

Disk Scheduling (Cont.) • Several algorithms exist to schedule the servicing of disk I/O requests. • We illustrate them with a request queue with cylinder numbers (0-199). {98, 183, 37, 122, 14, 124, 65, 67} Head pointer 53

FCFS First-Come First-Serve: Illustration shows total head movement for FCFS is 640 cylinders.

SSTF • Shortest-seek-time-first: Selects the request with the minimum seek time from the current head position. • SSTF scheduling is a form of SJF scheduling • may cause starvation of some requests. • Illustration shows total head movement of 236 cylinders.

SSTF (Cont.)

SCAN (aka Elavator Algorithm) • The disk arm starts at one end of the disk, and moves toward the other end, servicing requests on the way. • At other end of disk the head movement is reversed and servicing continues. • Can be inefficient since sectors after direction change were just serviced • Illustration shows total head movement of 208 cylinders

SCAN (Cont.)

Circular-SCAN (C-SCAN) • Provides a more uniform wait time than SCAN • The head moves from one end of the disk to the other, servicing requests. When other end reached, it immediately returns to the beginning of the disk, without servicing any requests • Then repeats above • Treats the cylinders as a circular list that wraps around from the last cylinder to the first one.

C-SCAN (Cont.)

C-LOOK • Version of C-SCAN • Arm only goes as far as the last request in each direction, then reverses direction immediately, without first going all the way to the end of the disk.

C-LOOK (Cont.)

Selecting a Disk-Scheduling Algorithm • Performance depends on the number and types of requests. • Requests for disk service can be influenced by the file-allocation method. • The disk-scheduling algorithm should be written as a separate module allowing it to be replaced. • SSTF is common and has a natural appeal • SCAN and C-SCAN perform under heavy load • Either SSTF or LOOK is a reasonable choice for the default algorithm.

Disk Management • Low-level formatting, or physical formatting — • Dividing a disk into sectors. • record OS data structures on the disk. • Partition the disk into groups of cylinders. • Logical formatting or “making a file system”. • Boot block initializes system. • The bootstrap is stored in ROM. • Bootstrap loader program. • Methods such as sector sparing used to handle bad blocks.

Swap-Space Management • Swap-space - Virtual memory uses disk space as an extension of main memory. • Located in file system or a separate disk partition. • Swap-space management • 4.3BSD allocates swap space when process starts; holds text segment and data segment. • Kernel uses swap maps to track use. • Solaris 2 allocates swap space only when a page is forced out of physical memory, not when the virtual memory page is first created.

Disk Reliability • Several improvements involve the use of multiple disks working cooperatively. • Disk striping: group of disks used as one unit. • RAID (Redundant Array of Independent Disks) improve performance and improve the reliability of the storage system by storing redundant data. • Mirroring or shadowing keeps duplicate of each disk. • Block interleaved parity uses much less redundancy.

RAID Technology • Problem with disk • Data transfer rate is limited by serial access. • Reliability • Solution to both problems: Redundant arrays of inexpensive disks (RAID) • In the past RAID (combination of cheap disks) is alternative for large and expensive disks. • Today: RAID is used for their higher reliability and higher data-transfer rate. • So the I in RAID stands for “independent” instead of ‘inexpensive”. • So RAID stands for Redundant Arrays of Independent Disks. • RAID is arranged into six different levels.

RAID: Improvement of Reliability via Redundancy • The chance that some disk out of N disk fail is much higher than single disk. • Each failure of disk leads to loss of data • Solution: Redundancy. • Store extra information which can be used in the event of failure. • Simplest: Mirrored disks • Each logical disk consists of two physical disks. • Read from any disk • Write to both disks.

RAID: improvement in Performance via Parallelism. • # of reads per unit time can be increased. • With multiple disks we can improve the transfer rate with data stripping. • Bit level Data stripping: • Splitting the bits of each byte across multiple disks. • If we have array of 8 disks, we write bit i of every byte to disk i. • The array of eight disks can be treated as single disk that are eight time normal size and eight times the access rate. • Every disk participates in the read. • But each access can read eight times as many data. • Block level stripping: Blocks of files are stripped across multiple disks. • Two goals • Increase the throughput of multiple small accesses. • Reduce the response time of large accesses.

RAID levels • RAID level 0: Disk arrays with stripping at the level of blocks, but without any redundancy (no mirroring and no parity) • Block level stripping • RAID level1: Disk mirroring. • Block level striping • RAID level2: • Error detection with parity bits. • Error correcting stores two or more parity bits. • Data can be stripped among disks and parity bits are stored in other disks. • Bit level stripping

RAID levels • RAID level 3: bit-interleaved parity organization • Disk controllers can detect whether a sector has been read correctly. • The bits of failed sector can be recovered by computing the parity of the remaining bits. • RAID Level 3 is similar to RAID level 2 but less expensive • One disk overhead (One disk to store parity) • RAID level 2 is not used in practice. • Advantages of RAID level 3 over level 1 (mirroring) • One parity disk is needed; reducing the storage overhead • Transfer rate is same. • RAID 3 performance problem • Computing and writing parity

RAID levels • RAID level 4: block-interleaved parity organization • Uses block-level stripping • Keeps parity block on separate disk • If one disk fails the parity block and other blocks can be used to recover the failed disk. • A block read accesses only one disk • Data transfer rate for each process is slower, However multiple read requests can be carried out in parallel. • Write results into two writes: block and corresponding parity. • RAID level 5: Block interleaved distributed parity • Parity is distributed among all N+1 disks. • For each block one disk stores parity and others store data.

Distributed Operating Systems