Distributed Transaction Management

Distributed Transaction Management Jyrki Nummenmaa jyrki.nummenmaa@cs.uta.fi

How distributed txns are born? • Distributed database management systems: a client contacts local server, which takes care of transparent distribution – the client does not need to know about the other sites. • Alternative: a client contacts all local resource managers directly. • Alternative: separate client programs are started on all respective sites. • Typically the client executing on the first site acts as a coordinator communicating with the other clients.

Vector timestamps (continuing from last lecture)

Assigning vector timestamps • Initially, assign [0,...,0] to myVT. • The id of the local site is referred to as ”self”. • If event e is the receipt of message m, then: For i=1,...,M, assign max(m.VT[i],myVT[i]) to myVT[i]. Add 1 to myVT[self]. Assign myVT to e.VT. • If event e is the sending of m, then: Add 1 to myVT[self]. Assign myVT to both e.VT and m.VT.

Vector timestamps T3 T1 T2 T4 m1 [1,0,0,0] [1,0,0,1] m3 [0,1,0,0] [1,0,0,2] [2,0,0,2] m2 m5 [0,0,1,0] [3,0,0,2] [1,0,1,3] m4 [1,1,1,4] m6 [3,1,1,5] [3,1,2,6] [3,1,1,6] m7 [3,1,3,6] [3,1,3,7] m8 [3,1,3,8] [3,2,3,8] m9 [3,3,3,8] [3,3,3,9]

Vector timestamp order • e.VT Ve'.VT, if and only if e.VT[i] e'.VT[i], 1iM. (M = number of sites) • e.VT <V e'.VT, if and only if e.VT Ve'.VT, and e.VT e'.VT. • [0,1,2,3]  [0,1,2,3] • [0,1,2,2] <[0,1,2,3] • The order of [1,1,2,3] and [0,1,2,4] is not defined, they are concurrent.

Vector timestamps - properties • Vector timestamps also guarantee that if e<He', then e.VT < e'.VT - This follows from the definition of happens-before relation by observing the path of events from e to e’. • Vector timestamps also guarantee that if e.VT < e'.VT, then e <H e' (why?).

Example Lock Manager Implementation

Example lock manager • Software to manage locks (no items). The software is not created for any application, just to demo concepts. • (Non-existing) items are identified by numbers. • Users may start and stop transactions and take and release locks using an interactive client. • From the user’s point of view, the interactive client passes the commands to the lock managers. • Number of managers to consult for a shared lock is given in startup and exclusive is computed from there. • Lock managers queue lock requests and grant locks when/once they are available.

Lock manager classes • See javadocs. • LockManager manages locks. • A LMMultiServerThread is created for each client. • LockManagerProtocol talks with a client and calls LockManager methods. • StartLockManager just starts a LockManager. • Lock – a single lock object. • Transaction – a single transaction object. • Waitsfor – a single lock waiting for another.

Lock client classes • StartLockClient – starts a lock client. • LockClientController – the textual user interface, this is separate from the LockClient because for user it uses blocking input. It just reads standard input and sends commands to a LockClient. • LockClient – a client talking to a LockManager. The client opens a socket to talk to a lock manager, and reads input from both the Lock Manager and the Client Controller.

LockClient implementation • In addition to the communication data, the LockClient keeps track on its txn id and its lock requests (pending and granted). • Output from LockManagers is just printed out with the exception that if it contains a new txn-id, it will be stored. • It keeps a track on lock requests from client controller: • When a new lock request comes in, it will send it to appropriate LockManagers. • When an unlock request comes, it will check that there is a lock request and send the unlock to LockManagers. • NewTxn requests are only sent to one LockManager.

More on input/output • Each LMMultiServerThread talks with only one client. • Each client talks to several servers (LMMultiServerThreads). • An interactive client talks and also listens to a client controller, but it does not read standard input as this would block, or if you just check input with .ready() then user input is not echoed on the screen :( • A client controller only sends reads standard input and sends the messages to an interactive client.

LockManager data structures • A Vector for currently locked objects. • A Vector for locks queued in. • A Vector for locks given from the queue and waiting to be notified to the clients. • A Vector for transactions.

LockManager locking • The lock queue is checked. If there are xlocks preventing locking, the lock is put into the queue, otherwise the lock is added. • When locks are released, the queue is checked and the first possible queueing lock is given. If it is possible to give other locks, they are also given. Those locks are added into the notify queue. • As a LMMultiServerThread communicates with the client, it will also regularly call LockManagers checkNotify() method to see if there are locks given from the queue to be notified.

LockManagerProtocol • A LMMultiServerThread uses a LockManagerProtocol to manage communication with a client. • The LockManagerProtocol parses the input from the client (and should do some more error processing than it does at present). • The LockManagerProtocol calls the LockManager’s services for the parsed commands.

Example lock management implementation - concurrency • A lock manager uses threads to serve incoming connections. • As threads access the lock manager’s services concurrently, those methods are qualified to be synchronized.There may actually be an overkill, as it seems to be enough to synchronize the public methods (xlock/slock/unlock). • An interactive client uses a single thread, but this is just to access the sleep method. There is no concurrent data access and consequently no synchronized methods.

Server efficiency • It would be more efficient to pick threads from a pool of idle threads rather than always creating a new one. • The data structures are not optimized. As there typically is a large number of data items and only some unpredictable number of them is locked at any one moment, a hashing data structure might be suitable for the lock table (like HashTable, but a hash function must exist for the hash data type, e.g. Integer is ok). • It would be ok to connect a lock request queue (e.g. a linked list) for each lock item in the lock table with a direct reference.

Example lock management implementation – txn ids • Currently long integers are used. • Giving global txn ids sequentially (over all sites) is either complicated or needs a centralised server -> it is better to use a local (site-id, local-txn-id) pair to construct the txn ids. If the sites are known in advance, they can be numbered. • We only care about txns accessing some resource manager’s services and there may be a fixed set of them -> ask the first server to talk with for an id (server-id, local-txn-id).

Final notes on the software • This is demo software. • It is not bug free. If/once you find bugs, I am glad to hear. Even better, if you fix them  • It is not robust (error management is far from perfect). • However, it may be able to run some time without errors  • If you for any reason (and I can not think of any) need to use the software, I can give you an official license to use it.

Distributed Deadlock Management

Deadlock - Introdcution • Centralised example: • T1 locks X on time t1. • T2 locks Y on time t2 >t1. • T1 attempts to lock Y on time t3 >t2 and gets blocked. • T2 attempts to X on time t4 >t3 and gets blocked.

Deadlock (continued) • Deadlock can occur in centralised systems. For example: • At the operating system level there can be resource contention between processes • At the transaction processing level there can be data contention between transactions. • In a poorly-designed multithread program, there can be deadlock between threads in the same process.

Distributed deadlock ”management” approaches • The approach taken by distributed systems designers to the problem of deadlock depends on the frequency with which it occurs. • Possible strategies: • Ignore it. • Detection (and recovery). • Prevention and Avoidance (by statically making deadlock structurally impossible and by allocating resources carefully). • Detect local deadlock and ignore global deadlock.

Ignore deadlocks? • If the system ignores the deadlocks, then the application programmers have to make their applications in such a way, that a timeout will force the transaction to abort and possibly re-start. • The same approach is sometimes used in the centralised world.

Distributed Deadlock Prevention and Avoidance • Some proposed techniques are not feasible in practice, like making a process request all of its resources at the start of execution. • For transaction processing systems with timestamps, the following scheme can be implemented (like in centralised world): • When a process blocks, its timestamp is compared to the timestamp of the blocking process. • The blocked process is only allowed to wait if it has a higher timestamp. • This avoids any cyclic dependency.

Wound-wait and wound-die • In the wound-wait approach, if an older process requests a lock to an item held by a younger process, it wounds the younger process and effectively kills it. • In the wait-die approach, if a younger process requests a lock to an item held by an older process, the younger process commits suicide. • Both of these approaches kill transactions blindly. There does not need to be a deadlock.

Further considerations for wound-wait and wound-die • To reduce the number of unnecessarily aborted transactions, it is possible to use the cautious waiting rule: • ”You are always allowed to wait for a process, which is not waiting for another process.” • The aborted processes are re-started with their original timestamps, to guarantee liveness. • Otherwise, a transaction may not make progress if it gets aborted over and over again.

Local waits-for graphs • Each resource manager can maintain its local `waits-for' graph. • A coordinator maintains a global waits-for graph. It can be computed when requested by someone, or automatically at times. • When the coordinator detects a deadlock, it selects a victim process and kills it (thereby causing its resource locks to be released), breaking the deadlock.

Waits-for graph • Centralised: a cycle in the waits-for graph means a deadlock T1 T2 T3

Local vs Global • Distributed: collect the local graphs and create a global waits-for graph. T1 T1 T2 T1 T2 T3 T3 T3 Site 1 Site 2 Global waits-for graph

Local waits-for graphs – problems • It is impossible to collect all information at the same time, thus reflecting a ”global state” of locking. • Since information is collected from ”different times”, it may be that locks are unlocked in between and the global waits-for graph is wrong. • Therefore, the information about changes must be transmitted to the coordinator. • The coordinator knows nothing about change information that is in transit. • In practice, many of the deadlocks that it thinks it has detected will be what they call in the trade `phantom deadlocks'.

Global states and snapshot computation

Global state • We assume that between each pair (Si,Sj) of sites there is a reliable order-preserving communication channel C{i,j}, whose contents is a list of messages L{i,j}=m1(i,j),m2(i,j), ..., mk(i,j). • Let L={L{i,j}} be the collection of all message lists and let K be the collection of all local states. We say that the pair G=(K,L) is the global state of the system.

Consistent cut • We say that a global state G=(K,L) is a consistent cut, if for each event e in G, G contains all events e’ such that e’ <He. • That is, there are no events missing from G such that they have happened before e, if e is in G.

Which lines cut a consistent cut? T’’ T T’ T’’’ m1 m2 m3 m4 m5 m6 m7 m8 m9

Distributed snapshots • A state of a site at one time is called a snapshot. • There is no way we can take the snapshots simultaneously. If we could, that would solve the deadlock detection. • Therefore, we want to create a snapshot that reflects a consistent cut.

Computing a distributed snapshot – assumptions • If we form a graph with sites as nodes and communication channels as edges, we assume that the graph is connected. • Neither channels nor processes fail. • Any site may initiate snapshot computation. • There may be several simultaneous snapshot computations.

Computing a distributed snapshot / 1 • As a site Sj initiates snapshot collection, it records its state and sends snapshot token to all sites it communicates with.

Computing a distributed snapshot / 2 • If a site Si, ij, receives a snapshot token for the first time, and it receives it from Sk, it does the following: 1. Stops processing messages. 2. Makes L(k,i) the empty list. 3. Records its state. 4. Sends a snapshot token to all sites it communicates with. 5. Continues to process messages.

Computing a distributed snapshot / 3 • If a site Si receives a snapshot token from Sk and it has received a snaphost token also earlier on or i=j, then the list L(k,i) is the list of messages Si has received from Sk after recording its state.

Distributed snapshot example Assume the connection set { (S1,S2), (S2,S3), (S3,S4) }. S3 S1 S2 S4 snap snap m2 m3 m1 snap snap snap m4 m5 snap L(1,2) = { m1 }, L(3,2) = {}, L(4,3) = { m3, m4 }. Effects of m2 are included in the state of S4. Message m5 takes place entirely after the snapshot computation.

Termination of the snapshot algorithm • Termination is, in fact, straightforward to see. • Since the graph of sites is connected, and the local processing and message delivery between any two sites takes only a finite time, all sites will be reached and processed within a finite time.

The snapshot algorithm selects a consistent cut • Proof. Let ei and ej be events, which occurred on sites Si and Sj and ej <Hei. Assume that ei was in the cut produced by the snapshot algorithm. We want to show that also ej is in the cut. If Si = Sj , this is clear. Assume that SiSj. Assume, now, for contradiction, that ej was not in the cut. Consider the sequence of messages <m1,m2,….,mk> by which ej <Hei. By the way snap markers are sent and received, a snap marker message has reached Sj before each m1,m2,….,mk and Si has therefore recorded its state before ej. Therefore, ei is not in the cut. This is a contradiction, and we are done.

A picture relating to the theorem Sj Si Conclusion: Si must haverecorded itsstate before ejand ei can notbe in the cut - contradiction. ej <Hei ej m1 m2 … mk ei Assume for contradiction ->

Several simultaneous snapshot computations • The snapshot markers have to be different and sites just manage each separate snapshot computation separately.

Snapshots and deadlock detection • Apparently, in this case the state to be recorded is the waits-for graph and lock request / lock release messages. • After the snapshot computation, the waits-for graphs are collected to a site, say, the initiator of the snapshot computation. • For this, the snap messages should include the initiator of the snapshot computation.

Distributed Transaction Management