Distributed Operating Systems

Distributed Operating Systems - Introduction Prof. Nalini Venkatasubramanian (includes slides borrowed from Prof. Petru Eles, lecture slides from Coulouris, Dollimore and Kindberg textbook)

What does an OS do? • Process/Thread Management • Scheduling • Communication • Synchronization • Memory Management • Storage Management • FileSystems Management • Protection and Security • Networking

CPU CPU Cache Cache CPU CPU Memory Memory Distributed Operating Systems Manages a collection of independent computers and makes them appear to the users of the system as if it were a single computer • Multicomputers • Loosely coupled • Private memory • Autonomous • Multiprocessors • Tightly coupled • Shared memory Memory CPU Memory Parallel Architecture Distributed Architecture

How to find an idle workstation? How is a process transferred from one workstation to another? What happens to a remote process if a user logs onto a workstation that was idle, but is no longer idle now? Other models - processor pool, workstation server... Workstation Model ws1 ws1 ws1 Communication Network ws1 ws1

Distributed Operating System (DOS) Types • Distributed OSs vary based on • System Image • Autonomy • Fault Tolerance Capability • Multiprocessor OS • Looks like a virtual uniprocessor, contains only one copy of the OS, communicates via shared memory, single run queue • Network OS • Does not look like a virtual uniprocessor, contains n copies of the OS, communicates via shared files, n run queues • Distributed OS • Looks like a virtual uniprocessor (more or less), contains n copies of the OS, communicates via messages, n run queues

Design Issues • Transparency • Performance • Scalability • Reliability • Flexibility (Micro-kernel architecture) • IPC mechanisms, memory management, Process management/scheduling, low level I/O • Heterogeneity • Security

Design Issues (cont.) • Transparency • Location transparency • processes, cpu’s and other devices, files • Replication transparency (of files) • Concurrency transparency • (user unaware of the existence of others) • Parallelism • User writes serial program, compiler and OS do the rest • Performance • Throughput - response time • Load Balancing (static, dynamic) • Communication is slow compared to computation speed • fine grain, coarse grain parallelism

Design Elements • Process Management • Task Partitioning, allocation, load balancing, migration • Communication • Two basic IPC paradigms used in DOS • Message Passing (RPC) and Shared Memory • synchronous, asynchronous • FileSystems • Naming of files/directories • File sharing semantics • Caching/update/replication

Remote Procedure Call A convenient way to construct a client-server connection without explicitly writing send/ receive type programs (helps maintain transparency). Initiated by Birrell and Nelson in 1980’s Basis of 2 tier client/server systems

Remote Procedure Calls (RPC) • General message passing model for execution of remote functionality. • Provides programmers with a familiar mechanism for building distributed applications/systems • Familiar semantics (similar to LPC) • Simple syntax, well defined interface, ease of use, generality and IPC between processes on same/different machines. • It is generally synchronous • Can be made asynchronous by using multi-threading Caller Process Request Message (contains Remote Procedure’s parameters) Receive request (procedure executes) Send reply and wait For next message Reply Message ( contains result of procedure execution) Resume Execution

RPC Needs and challenges • Needs – Syntactic and Semantic Transparency • Resolve differences in data representation • Support a variety of execution semantics • Support multi-threaded programming • Provide good reliability • Provide independence from transport protocols • Ensure high degree of security • Locate required services across networks • Challenges • Unfortunately achieving exactly the same semantics for RPCs and LPCs is close to impossible • Disjoint address spaces • More vulnerable to failure • Consume more time (mostly due to communication delays)

Implementing RPC Mechanism • Uses the concept of stubs; A perfectly normal LPC abstraction by concealing from programs the interface to the underlying RPC • Involves the following elements • The client • The client stub • The RPC runtime • The server stub • The server

RPC – How it works II client process server process client procedure call server procedure dispatcher selects stub server stub (un)marshal (de)serialize receive (send) client stub locate (un)marshal (de)serialize send (receive) communication module communication module Wolfgang Gassler, Eva Zangerle

Remote Procedure Call (cont.) • Client procedure calls the client stub in a normal way • Client stub builds a message and traps to the kernel • Kernel sends the message to remote kernel • Remote kernel gives the message to server stub • Server stub unpacks parameters and calls the server • Server computes results and returns it to server stub • Server stub packs results in a message and traps to kernel • Remote kernel sends message to client kernel • Client kernel gives message to client stub • Client stub unpacks results and returns to client

RPC - binding • Static binding • hard coded stub • Simple, efficient • not flexible • stub recompilation necessary if the location of the server changes • use of redundant servers not possible • Dynamic binding • name and directory server • load balancing • IDL used for binding • flexible • redundant servers possible

RPC - dynamic binding server process client process client procedure call server procedure 11 10 3 13 server stub register (un)marshal (de)serialize receive send client stub bind (un)marshal (de)serialize Find/bind send receive 8 1 communication module communication module dispatcher selects stub 12 4 9 7 12 12 5 6 2 name and directory server Wolfgang Gassler, Eva Zangerle

RPC - Extensions • conventional RPC: sequential execution of routines • client blocked until response of server • asynchronous RPC – non blocking • client has two entry points(request and response) • server stores result in shared memory • client picks it up from there

RPC servers and protocols… • RPC Messages (call and reply messages) • Server Implementation • Stateful servers • Stateless servers • Communication Protocols • Request(R)Protocol • Request/Reply(RR) Protocol • Request/Reply/Ack(RRA) Protocol • RPC Semantics • At most once (Default) • Idempotent: at least once, possibly many times • Maybe semantics - no response expected (best effort execution)

How Stubs are Generated • Through a compiler • e.g. DCE/CORBA IDL – a purely declarative language • Defines only types and procedure headers with familiar syntax (usually C) • It supports • Interface definition files (.idl) • Attribute configuration files (.acf) • Uses Familiar programming language data typing • Extensions for distributed programming are added

language specific call interface client stub RPC - IDL Compilation - result development environment client process server process IDL IDL sources client code server code language specific call interface IDL compiler server stub interface headers Wolfgang Gassler, Eva Zangerle

RPC NG: DCOM & CORBA • Object models allow services and functionality to be called from distinct processes • DCOM/COM+(Win2000) and CORBA IIOP extend this to allow calling services and objects on different machines • More OS features (authentication,resource management,process creation,…) are being moved to distributed objects.

Sample RPC Middleware Products • JaRPC (NC Laboratories) • libraries and development system provides the tools to develop ONC/RPC and extended .rpc Client and Servers in Java • powerRPC (Netbula) • RPC compiler plus a number of library functions. It allows a C/C++ programmer to create powerful ONC RPC compatible client/server and other distributed applications without writing any networking code. • Oscar Workbench (Premier Software Technologies) • An integration tool. OSCAR, the Open Services Catalog and Application Registry is an interface catalog. OSCAR combines tools to blend IT strategies for legacy wrappering with those to exploit new technologies (object oriented, internet). • NobleNet (Rogue Wave) • simplifies the development of business-critical client/server applications, and gives developers all the tools needed to distribute these applications across the enterprise. NobleNet RPC automatically generates client/server network code for all program data structures and application programming interfaces (APIs)— reducing development costs and time to market. • NXTWare TX (eCube Systems) • Allows DCE/RPC-based applications to participate in a service-oriented architecture. Now companies can use J2EE, CORBA (IIOP) and SOAP to securely access data and execute transactions from legacy applications. With this product, organizations can leverage their current investment in existing DCE and RPC applications

Distributed Shared Memory (DSM) Tightly coupled systems Use of shared memory for IPC is natural Distributed Shared Memory (exists only virtually) CPU1 Memory CPU1 Memory CPU1 Memory Memory CPU n CPU n CPU n Loosely coupled distributed-memory processors Use DSM – distributed shared memory A middleware solution that provides a shared-memory abstraction. … MMU MMU MMU Node n Node 1 Communication Network

Issues in designing DSM • Synchronization • Granularity of the block size • Memory Coherence (Consistency models) • Data Location and Access • Replacement Strategies • Thrashing • Heterogeneity

Distributed coordination

Distributed Coordination • An operation that a set of processes collectively perform • Mutual exclusion • Allow at most one process to enter the critical section • Leader election • Exactly one process becomes the leader

Quick Recap… • Liveness and Safety property • Safety: Bad things never happen • Liveness: Good thing eventually happens • Examples:

Distributed Mutual Exclusion • Ensures that concurrent processes have serialized access to shared resources • The critical section (CS) problem • At most one process is allowed to enter CS • Shared variables (semaphores) cannot be used in a distributed system • Mutual exclusion must be based on message passing, in the context of unpredictable delays and incomplete knowledge Critical section/resource

Distributed Mutual Exclusion Critical section/resource (by definition) (avoids starvation and deadlocks) (fairness) Distributed Systems Concepts and Design (Coulouris, Dollimore)

Distributed Mutual Exclusion • Centralized coordinator/server algorithm • Ring-based algorithm • Ricart and Agrawala’s algorithm • Maekawa’s algorithm

Distributed Mutual Exclusion(Central Coordinator/Server Algorithm) • A central server grants permission (token) to enter CS • To enter CS, a process “requests” for token and waits for the reply from the server • The server grants token if none is currently in the CS, else it puts the request in a queue • When process leaves CS, it “releases” the token and the server grants token to the oldest process waiting in the queue

Distributed Mutual Exclusion(Central Coordinator/Server Algorithm) • Bandwidth [messages per entry and exit operation] • Two messages to enter (request and grant), one for exit (release) • Client delay [amount of time a process waits to enter (when none is in CS] • One round trip delay to enter (request and then receive grant) • Synchronization delay [time gap between one process exiting and another process entering] • One round trip (exiting process releases token and the next process gets it)

Distributed Mutual Exclusion(Ring-based Algorithm) • Processes are arranged in a logical ring • A token is passed process-to-process in one direction (say clockwise) • Whoever holds the token can enter the CS • Whenever done, passes the token to the next • The algorithm ensures “safety” and “liveness”, but it may not respect “order” Messages per entry/exit: 1 Client delay: 0 -- N Synchronization delay: 1-- N

Distributed Mutual Exclusion(Ricart and Agrawala’s Algorithm) • Ricart and Agrawala [1981] uses multicast and Lamport logical clock • The basic idea is— • process that requires entry to CS multicasts a “request” message, and can enter it only when all the other processes have “replied” to this message • The conditions under which a process replies to a request are designed to ensure that conditions ME1–ME3 are met

Distributed Mutual Exclusion(Ricart and Agrawala’s Algorithm) • Process wants to enter CS • sends ‘requests’ to all other process and waits to receive ‘grant’ from others • Process receives ‘request’ • Reply with ‘grant’ if • it is not already in CS • and • wanted to enter CS (requests sent out), but this request is earlier than its own request • Else, queue the request • Exit from CS • Send ‘grant’ to all queued requests p1 p4 request (T1, P1) request (T1, P1) Already in CS p3 p2 Q

Distributed Mutual Exclusion(Ricart and Agrawala’s Algorithm) • Process wants to enter CS • sends ‘requests’ to all other process and waits to receive ‘grant’ from others • Process receives ‘request’ • Do not “GRANT” (put in a queue) if • it is already in CS • OR • wanted to enter CS (requests sent out) and its own request is earlier than this request • Else, send “GRANT” • Exit from CS • Send ‘grant’ to all queued requests grant p1 request (T1, P1) p4 grant request (T1, P1) request (T1, P1) Already in CS Q p3 p2 p1

Distributed Mutual Exclusion(Ricart and Agrawala’s Algorithm) • Process wants to enter CS • sends ‘requests’ to all other process and waits to receive ‘grant’ from others • Process receives ‘request’ • Do not “GRANT” (put in a queue) if • it is already in CS • OR • wanted to enter CS (requests sent out) and its own request is earlier than this request • Else, send “GRANT” • Exit from CS • Send ‘grant’ to all queued requests p1 request (T1, P1) Q p4 p1 grant request (T1, P1) request (T1, P1) Already in CS Q p3 p2 p4 p1

Distributed Mutual Exclusion(Ricart and Agrawala’s Algorithm) • Each process can be in three states: • RELEASED (neither in CS nor wants to be) • WANTED: (wants to be in CS, req sent and now waiting) • HELD: (in CS) Distributed Systems Concepts and Design (Coulouris, Dollimore)

Distributed Mutual Exclusion(Ricart and Agrawala’s Algorithm) • The algorithm is safe • No two processes can enter CS at the same time • This is possible only if they grant each other access • But that can’t happen • Because, reqs are totally ordered • The algorithm ensures liveness GRANT P2 if (T2, P2) < (T1, P1) P1 (T1, P1) (T2, P2) Both can’t be true P2 GRANT P1 if (T1, P1) < (T2, P2)

Distributed Mutual Exclusion(Ricart and Agrawala’s Algorithm) • Message complexity: • To enter: 2(N-1) messages (N if hardware supports multicast: 1 req plus N - 1 replies) • Client delay: • one round trip time (multicast requests followed by receiving replies from all) • Synchronization delay: • one-way message delay (Grant from the exiting proc) • Less than the other two techniques

Distributed Mutual Exclusion(Maekawa’s Algorithm) • A process may not need to take ‘grant’ from all • Only from a subset of them should suffice • For each process p, there is a voting set, V(p) • Process that wants to enter CS needs to receive ‘grant’ from all of the voting set • At least one member is common in any two processes’ voting sets • i.e., V(p) ∩ V(q) ≠ empty • Voting members VOTE to processes granting them access to CS

Distributed Mutual Exclusion(Maekawa’s Algorithm) Constructing voting sets Path vertices to the root Voting set size: O(log(N)) ? Voting set size: O(√N)

Distributed Mutual Exclusion(Maekawa’s Algorithm) V2 P2 V1 • The algorithm can lead to deadlock!! • can be avoided if requests are queued following happened-before relation [Sanders 1987] • Performance: • messages: 2√N (enter), √N (exit) • Client delay: same as Ricard-Ag • Sync delay: one-roundtrip P1 VOTE Distributed Systems Concepts and Design (Coulouris, Dollimore)

Distributed Mutual ExclusionTolerance to Failures • Central Server • Can tolerate crash failure of a client that neither holds nor has requested the token • Ring based • Cannot tolerate any process crash failure • Ricart and Agrawala algorithm • Can tolerate crash failure of a process if it grants all request implicitly • Maekawa’s algorithm • Tolerate crash failure who are not in any voting sets

Leader Elections

Leader Elections • An algorithm for choosing a unique process to play a particular role is called an election algorithm • For example some process to become the coordinator in the central server-based mutual exclusion • All processes agree on who is the (current) leader • A process “calls” the election if it takes an action that initiates a run of the election algorithm

Leader Election • It doesn’t matter which process is elected • What is important is that one and only one process is chosen as the leader and all processes agree on this decision. • Assume that each process has a unique number (identifier) • In general, election algorithms attempt to locate the process with the highest identifier • The ‘identifier’ may be any useful value, as long as the identifiers are unique and totally ordered. • For example, the process with the lowest computational load can be elected by having each process use <1/load , i > as its identifier

Leader Elections • Election is typically started after a failure occurs • The detection of a failure (e.g. the crash of the current leader) is normally based on time-out a process that gets no response for a period of time suspects a failure and initiates an election process • An election process is typically performed in two phases: • Select a leader with the highest identifier • Inform all processes about the “winner”

Leader Elections Distributed Systems Concepts and Design (Coulouris, Dollimore)

Distributed Operating Systems - Introduction