DISTRIBUTED COMPUTING

DISTRIBUTED COMPUTING Fall 2005

ROAD MAP: OVERVIEW • Why are distributed systems interesting? • Why are they hard?

GOALS OF DISTRIBUTED SYSTEMS Take advantage of cost/performance difference between microprocessors and shared memory multiprocessors Build systems: 1. with a single system image 2. with higher performance 3. with higher reliability 4. for less money than uniprocessor systems In wide-area distributed systems, information and work are physically distributed, implying that computing needs should be distributed. Besides improving response time, this contributes to political goals such as local control over data.

WHY SO HARD? A distributed system is one in which each process has imperfect knowledge of the global state. Reasons: Asynchrony and failures We discuss problems that these two features raise and algorithms to address these problems. Then we discuss implementation issues for real distributed systems.

ANATOMY OF A DISTRIBUTED SYSTEM A set of asynchronous computing devices connected by a network. Normally, no global clock. Communication is either through messages or shared memory. Shared memory is usually harder to implement.

ANATOMY OF A DISTRIBUTED SYSTEM (cont.) EACH PROCESSOR HAS ITS OWN CLOCK + ARBITRARY NETWORK BROADCAST MEDIUM Special protocols will be possible for the broadcast medium.

COURSE GOALS 1. To help you understand which system assumptions are important. 2. To present some interesting and useful distributed algorithms and methods of analysis then have you apply them under challenging conditions. 3. To explore the sources for distributed intelligence.

BASIC COMMUNICATION PRIMITIVE: MESSAGE PASSING Paradigm: • Send message to destination • Receive message from origin Nice property: can make distribution transparent, since it does not matter whether destination is at a local computer or at a remote one (except for failures). Clean framework: “Paradigms for Process Interaction in Distributed Programs,” G. R. Andrews, ACM Computing Surveys 23:1 (March 1991) pp. 49-90.

BLOCKING (SYNCHRONOUS) VS. NON-BLOCKING (ASYNCHRONOUS) COMMUNICATION For sender: Should the sender wait for the receiver to receive a message or not? For receiver: When arriving at a reception point and there is no message waiting, should the receiver wait or proceed? Blocking receive is normal (i.e., receiver waits).

sender sender receiver receiver send BLOCKING send NON-BLOCKING send NO COMPUTATION ACK (?) ACK

CLIENT server call return REMOTE PROCEDURE CALL Client calls the server using a call server (in parameters; out parameters). The call can appear anywhere that a normal procedure call can. Server returns the result to the client. Client blocks while waiting for response from server.

sender receiver send accept accepted RENDEZVOUS FACILITY • One process sends a message to another process and blocks at least until that process accepts the message. • The receiving process blocks when it is waiting to accept a request. Thus, the name: Only when both processes are ready for the data transfer, do they proceed. We will see examples of rendezvous interactions in CSP and Ada.

Beyond send-receive: Conversations Needed when a continuous connection is more efficient and/or only some data at a time. Bob and Alice: Bob initiates, Alice responds, then Bob, then Alice, … But what if Bob wants Alice to send messages as they arrive without Bob’s doing more than an ack? Sendonly or receiveonly mode. Others?

SEPARATION OF CONCERNS Separation of concerns is the software engineering principle that each component should have a single small job to do so it can do it well. In distributed systems, there are at least three concerns having to do with remote services: what to request, where to do it, how to ask for it.

IDEAL SEPARATION • What to request: application programmer must figure this out, e.g. access customer database. • Where to do it: application programmer should not need to know where, because this adds complexity + if location changes, application break. • How to ask for it: want a uniform interface.

client…client client Service broker server…server server WHERE TO DO IT: ORGANIZATION OF CLIENTS AND SERVERS A service is a piece of work to do. Will be done by a server. A client who wants a service sends a message to a service broker for that service. The server gets work from the broker and commonly responds directly to the client. A server is a process. More basic approach: Each server has a port from which it can receive requests. Difference: In client-broker-server model, many servers can offer the same service. In direct client-server approach, client must request a service from a particular server.

client…client client Service broker Client … client server ALTERNATIVE: NAME SERVER A service is a piece of work to do. Will be done by a server. Name Server knows where services are done Example: Client requests address of server from the Name Server and then communicates directly with that server.. Difference: Client-server communication is direct, so may be more efficient.

HOW TO ASK FOR IT:OBJECT-BASED • Encapsulation of data behind functional interface. • Inheritance is optional but interface is the contract. • So need a technique for both synchronous and asynchronous procedure calls.

REFERENCE EXAMPLE:CORBA OBJECT REQUEST BROKER • Send operation to ORB with its parameters. • ORB routes operation to proper site for execution. • Arranges for response to be sent to you directly or indirectly. • Operations can be “events” so can allow interrupts from servers to clients.

SUCCESSORS TO CORBA Microsoft Products • COM: allow objects to call one another in a centralized setting: classes + objects of those classes. Can create objects and then invoke them. • DCOM: COM + Object Request Broker. • ActiveX: DCOM for the Web.

SUCCESSORS TO CORBA Java RMI • Remote Method invocation (RMI): Define a service interface in Java. • Register the server in RMI repository, i.e., an object request broker. • Client may access Server through repository. • Notion of distributed garbage collection

SUCCESSORS TO CORBA Enterprise Java Beans • Beans are again objects but can be customized at runtime. • Support distributed transaction notion (later) as well as backups. • So transaction notion for persistent storage is another concern it is nice to separate.

REDUCING BUREAUCRACY:automatic registration • SUN also developed an abstraction known as JINI. • New device finds a lookup service (like an ORB), uploads its interface, and then everyone can access. • No need to register. • Requires a trusted environment.

TUPLE SPACE PROCESSES COOPERATING DISTRIBUTED SYSTEMS: LINDA • Linda supports a shared data structure called a tuple space. • Linda tuples, like database system records, consists of strings and integers. We will see that in the matrix example below.

LINDA OPERATIONS The operations are out (add a tuple to the space); in (read and remove a tuple from the space); and read (read but don’t remove a tuple from the tuple space). A pattern-matching mechanism is used so that tuples can be extracted selectively by specifying values or data types of some fields. in (“dennis”, ?x, ?y, ….) • gets tuple whose first field contains “dennis,” assigns values in second and third fields of the tuple to x and y, respectively.

EXAMPLE: MATRIX MULTIPLICATION There are two matrices A and B. We store A’s rows and B’s columns as tuples. (“A”, 1, A’s first row), (“A”, 2, A’s second row) …. (“B”, 1, B’s first column), (“B”, 2, B’s second column) …. (“Next”, 15) There is a global counter called Next in the range 1 .. number of rows of A x number of columns of B. A process performs an “in” on Next, records the value, and performs an “out” on Next+1, provided Next is still in its range. Convert Next into the row number I and column number j such that Next = i x total number of columns + j.

ACTUAL MULTIPLICATION First find i and j. in (“Next”, ?temp); out (“Next”, temp +1); convert (temp, i, j); Given i and j, a process just reads the values and outputs the result. read (“A”, i, ?row_values) read (“B”, j, ?col_values) out (“result”, i, j, Dotproduct(row, col)).

LINDA IMPLEMENTATION OF SHARED TUPLE SPACE The implementers assert that the work represented by the tuples is large enough so that there is no need for shared memory hardware. The question is how to implement out, in, and read (as well as inp and readp).

out BROADCAST IMPLEMENTATION 1 Implement out by broadcasting the argument of out to all sites. (Use a negative acknowledgement protocol for the broadcast.) To implement read, perform the read from the local memory. To implement in, perform a local read and then attempt to delete the tuple from all other sites. If several sites perform an in, only one site should succeed. One approach is to have the site originating the tuple decide which site deletes. Summary: good for reads and outs, not so good for ins.

in, read BROADCAST IMPLEMENTATION 2 Implement out by writing locally. Implement in and read by a global query. (This may have to be repeated if the data is not present.) Summary: better for out. Worse for read. Same for in.

COMMUNICATION REVIEW Basic distributed communication when no shared memory: send/receive. Location transparency: broker or name server or tuple space. Synchrony and asynchrony are both useful (e.g. real-time vs. informational sensors). Other mechanisms are possible

COMMUNICATION BY SHARED MEMORY: beyond locks Framework: Herlihy, Maurice. “Impossibility and Universality Results for Wait-Free Synchronization,” ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computed (PODC), 1988. In a system that uses mutual exclusion, it is possible that one process may stop while holding a critical resources and hang the entire system. It is of interest to find “wait-free” primitives, in which no process ever waits for another one. The primitive operations include test-and-set, fetch-and-add, and fetch-and-cons. Herlihy shows that certain operations are strictly more powerfully wait-free than others.

CAN MAKE ANYTHING WAIT-FREE (at a time price) Don’t maintain the data structure at all. Instead, just keep a history of the operations. enq(x) put enq(x) on end of history list (fetch-and-cons) end enq(x) deq put deq on end of history list (fetch-and-cons) “replay the array” and figure out what to return end deq Not extremely practical: the deq takes O(number of deq’s + number of enq’s) time. Suggestion is to have certain operations reconstruct the state in an efficient manner.

GENERAL METHOD: COMPARE-AND-SWAP Compare-and-swap takes two values: v and v’. If the register’s current value is v, it is replaced by v’, otherwise it is left unchanged. The register’s old value is returned. temp := compare-and-swap (register, 0, i) if register = 0 then register := i else register is unchanged Use this primitive to perform atomic updates to a data structure. In the following figure, what should the compare-and-swap do?

current current Original Data Structure PERSISTENT DATA STRUCTURES AND WAIT-FREEDOM One node added, one node removed. To establish change, change the current pointer. Old tree would still be available. Important point: If process doing change should abort, then no other process is affected.

LAMPORT Times, Clocks paper • What is the proper notion of time for Distributed Systems? • Time Is a Partial Order • The Arrow Relation • Logical Clocks • Ordering All Events using a tie-breaking Clock • Achieving Mutual Exclusion Using This Clock • Correctness • Criticisms • Need for Physical Clocks • Conditions for Physical Clocks • Assumptions About Clocks and Messages • How Do We Achieve Physical Clock Goal?

Languages & Constructs for Synchronization How to model time in distributed systems ROAD MAP: TIME ACCORDING TO LAMPORT

TIME Assuming there are no failures, the most important difference between distributed systems and centralized ones is that distributed systems have no natural notion of global time. • Lamport was the first who built a theory around accepting this fact. • That theory has proven to be surprisingly useful, since the partial order that Lamport proposed is enough for many applications.

WHAT LAMPORT DOES • Paper (reference on next slide) describes a message-based criterion for obtaining a time partial order. 2. It converts this time partial order to a total order. 3. It uses the total order to solve the mutual exclusion problem. 4. It describes a stronger notion of physical time and gives an algorithm that sometimes achieves it (depending on quality of local clocks and message delivery).

NOTIONS OF TIME IN DISTRIBUTED SYSTEMS Lamport, L. “Times, Clocks, and the Ordering of Events in a Distributed System,” Communications of the ACM, vol. 21, no. 7 (July 1978). • Distributed system consists of a collection of distinct processes, which are spatially separated. (Each process has a unique identifier.) • Communicate by exchanging messages. • Messages arrive in the order they are sent. (Could be achieved by hand-shaking protocol.) • Consequence: Time is partial order in distributed systems. Some events may not be ordered.

THE ARROW (partial order) RELATION We say A happens before B or A  B, if: 1. A and B are in the same process and A happens before B in that process> (Assume processes are sequential.) 2. A is the sending of a message at one process and B is the receiving of that message at another process, then A  B. 3. There is a C such that A  C and C  B. In the jargon,  is an irreflexive partial ordering.

LOGICAL CLOCKS Clocks are a way of assigning a number to an event. Each process has its own clock. For now, clocks will have nothing to do with real time, so they can be implemented by counters with no actual timing mechanism. Clock condition: For any events A and B, if A  B, then C(A) < C(B).

IMPLEMENTING LOGICAL CLOCKS • Each process increments its local clock between any two successive events. • Each process puts its local time on each message that it sends. • Each process changes its clock C to C’ when it receives message m having timestamp T. Require that C’> max(C, T).

13 13 8 18 14 19 IMPLEMENTATION OF LOGICAL CLOCKS Receiver clock jumps to 14 because of timestamp on message received. Receiver clock is unaffected by the timestamp associated with sent message, because receiver’s clock is already 18, so greater than message timestamp.

ORDERING ALL EVENTS We want to define a total order . Suppose two events occur in the same process, then they are ordered by the first condition. Suppose A and B occur in different processes, i and j. Use process ids to break ties. LC(A) = A|i, A concatenated with i. LC(B) = B|j. The total ordering  is called Lamport clock.

ACHIEVING MUTUAL EXCLUSION USING THIS CLOCK Goals: • Only one process can hold the resource at a time. • Requests must be granted in the order in which they are made. Assumption: Messages arrive in the order they are sent. (Remember, this can be achieved by handshaking.)

Pk Pi Pj REQUEST REQUEST Ack REQUEST Ack None needed Pi executes Ack RELEASE RELEASE ALGORITHM FOR MUTUAL EXCLUSION • To request the resource, Pi sends the message “request resource” to all other processes along with Pi’s local Lamport timestamp T. It also puts that message on its own request queue. • When a process receives such a request, it acknowledges the message. (Unless it has already sent a message to Pi timestamped later than T.) • Releasing the resource is analogous to requesting, but doesn’t require an acknowledgement.

USING THE RESOURCE Process Pi starts using the resource when: • its own request on its local request queue has the earliest Lamport timestamp T (consistent with ); and • it has received a message (either an acknowledgement or some other message) from every other process with a timestamp larger than T.

CORRECTNESS Theorem: Mutual exclusion and first-requested, first-served are achieved. Proof Suppose Pi and Pj are both using the resource at the same time and have timestamps Ti and Tj. Suppose Ti < Tj. Then Pj must have received i’s request, since it has received at least one message with a timestamp greater than Tj from Pi and since messages arrive in the order they are sent. But then Pj would not execute its request. Contradiction. First-requested, first-served. If Pi requests the resource before Pj (in the  sense), then Ti < Tj, so Pi will win.

CRITICISMS • Many messages. If only one process is using the resource, it still must send messages to many other processes. • If one process stops, then all processes hang (no wait freedom; could we achieve?)

DISTRIBUTED COMPUTING