Plan

Scalable Distributed Data Structures & High-Performance ComputingWitold Litwin Fethi BennourCERIAUniversity Paris 9 Dauphinehttp://ceria.dauphine.fr/

Plan • Multicomputers for HPC • What are SDDSs ? • Overview of LH* • Implementation under SDDS-2000 • Conclusion

Multicomputers • A collection of loosely coupled computers • Mass-produced and/or preexisting hardware • share nothing architecture • Best for HPC because of scalability • message passing through high-speed net (0Mb/s) • Networkmulticomputers • use general purpose nets & PCs • LANs: Fast Ethernet, Token Ring, SCI, FDDI, Myrinet, ATM… • NCSA cluster : 1024 NTs on Myrinet by the end of 1999 • Switched multicomputers • use a bus, or a switch • IBM-SP2, Parsytec...

Why Multicomputers ? • Unbeatable price-performance ratio for HPC. • Cheaper and more powerful than supercomputers. • especially the network multicomputers. • Available everywhere. • Computing power. • file size, access and processing times, throughput... • For more pro & cons : • IBM SP2 and GPFS literature. • Tanenbaum: "Distributed Operating Systems", Prentice Hall, 1995. • NOW project (UC Berkeley). • Bill Gates at Microsoft Scalability Day, May 1997. • www.microoft.com White Papers from Business Syst. Div. • Report to the President, President’s Inf. Techn. Adv. Comm., Aug 98.

Typical Network Multicomputer Client Server

Why SDDSs • Multicomputers need data structures and file systems • Trivial extensions of traditional structures are not best • hot-spots • scalability • parallel queries • distributed and autonomous clients • distributed RAM & distance to data • For a CPU,data on a disk are as far as those at the Moon for a human (J. Gray, ACM Turing Price 1999)

What is an SDDS ? • Data are structured • records with keys  objects with OIDs • more semantics than in Unix flat-file model • abstraction most popular with applications • parallel scans & function shipping • Data are on servers • waiting for access • Overflowing servers split into new servers • appended to the file without informing the clients • Queries come from multiple autonomous clients • Access initiators • Not supporting synchronous updates • Not using any centralized directory for access computations

What is an SDDS ? • Clients can make addressing errors • Clients have less or more adequate imageof the actual file structure • Servers are able to forward the queries to the correct address • perhaps in several messages • Servers may send Image Adjustment Messages • Clients do not make same error twice • Servers supports parallel scans • Sent out by multicast or unicast • With deterministic or probabilistic termination • See the SDDS talk & papers for more • ceria.dauphine.fr/witold.html • Or the LH* ACM-TODS paper (Dec. 96)

High-Availability SDDS • A server can be unavailable for access without service interruption • Data are reconstructed from other servers • Data and parity servers • Up to k ³ 1 servers can fail • At parity overhead cost of about 1/k • Factor k can itself scale with the file • Scalable availability SDDSs

An SDDS growth through splits under inserts Servers Clients

An SDDSClient Access Clients

An SDDSClient Access IAM Clients

An SDDSClient Access Clients

Known SDDSs DS Classics

Known SDDSs DS SDDS (1993) Classics Hash LH* DDH Breitbart & al

RP* Kroll & Widmayer Known SDDSs DS SDDS (1993) Classics Hash 1-d tree LH* DDH Breitbart & al

k-RP* dPi-tree RP* Kroll & Widmayer Known SDDSs DS SDDS (1993) Classics m-d trees Hash 1-d tree LH* DDH Breitbart & al

k-RP* dPi-tree Nardelli-tree RP* Kroll & Widmayer Known SDDSs DS SDDS (1993) Classics m-d trees Hash 1-d tree LH* DDH Breitbart & al H-Avail. LH*m, LH*g Security LH*s

k-RP* dPi-tree Nardelli-tree RP* Kroll & Widmayer Breitbart & Vingralek Known SDDSs DS SDDS (1993) Classics m-d trees Hash 1-d tree LH* DDH Breitbart & al Disk SDLSA H-Avail. LH*m, LH*g LH*SA Security s-availability LH*s LH*RS http://192.134.119.81/SDDS-bibliograhie.html

LH* (A classic) • Scalable distributed hash partitionning • generalizes the LH addressing schema • variants used in Netscape products, LH-Server, Unify, Frontpage, IIS, MsExchange... • Typical load factor 70 - 90 % • In practice, at most 2 forwarding messages • regardless of the size of the file • In general, 1 m/insert and 2 m/search on the average • 4 messages in the worst case

LH* bucket servers For every record c, its correct address a results from the LH addressing rule a Ühi(c) if n = 0 then exit else if a < n then aÜ h i+1 ( c) ; end (i, n) = the file state, known only to the LH*-coordinator Each server a keeps only track of the function hj used to access it: j = ior j = i+1

LH* clients • Each client uses the LH-rule for address computation, but with the client image (i’, n’) of the file state. • Initially, for a new client (i’, n’) = 0.

LH* Server Address Verification and Forwarding • Server a getting key c, a = m in particular,computes : a' := hj (c) ; if a' = a thenaccept c ; else a'' := hj - 1(c) ; if a'' > a and a'' < a' then a' := a'' ; send c to bucket a' ;

Client Image Adjustment • The IAM consists of address a where the client sent c and of j (a) if j > i' then i' := j - 1, n' := a +1 ; if n' 2^i' then n' = 0, i' := i' +1 ; • The rule guarantees that client image is within the file • Provided there is no file contractions (merge)

LH* : file structure servers j = 4 j = 4 j = 3 j = 3 j = 4 j = 4 0 1 2 7 8 9 n = 2 ; i = 3 n' = 0, i' = 0 n' = 3, i' = 2 Coordinator Client Client

LH* : split servers j = 4 j = 4 j = 3 j = 3 j = 4 j = 4 0 1 2 7 8 9 n = 2 ; i = 3 n' = 0, i' = 0 n' = 3, i' = 2 Coordinator Client Client

LH* : split servers j = 4 j = 4 j = 4 j = 3 j = 4 j = 4 j = 4 0 1 2 7 8 9 10 n = 3 ; i = 3 n' = 0, i' = 0 n' = 3, i' = 2 Coordinator Client Client

LH* : addressing servers j = 4 j = 4 j = 4 j = 3 j = 4 j = 4 j = 4 0 1 2 7 8 9 10 15 n = 3 ; i = 3 n' = 0, i' = 0 n' = 3, i' = 2 Coordinateur Client Client

LH* : addressing servers 15 j = 4 j = 4 j = 4 j = 3 j = 4 j = 4 j = 4 0 1 2 7 8 9 10 n = 3 ; i = 3 n' = 0, i' = 0 n' = 3, i' = 2 Coordinateur Client Client

LH* : addressing servers 15 j = 4 j = 4 j = 4 j = 3 j = 4 j = 4 j = 4 0 1 2 7 8 9 10 a =7, j = 3 n = 3 ; i = 3 n' = 0, i' = 3 n' = 3, i' = 2 Coordinateur Client Client

LH* : addressing servers j = 4 j = 4 j = 4 j = 3 j = 4 j = 4 j = 4 0 1 2 7 8 9 10 9 n = 3 ; i = 3 n' = 0, i' = 0 n' = 3, i' = 2 Coordinateur Client Client

LH* : addressing servers 9 j = 4 j = 4 j = 4 j = 3 j = 4 j = 4 j = 4 0 1 2 7 8 9 10 n = 3 ; i = 3 n' = 0, i' = 0 n' = 3, i' = 2 Coordinateur Client Client

LH* : addressing servers 9 j = 4 j = 4 j = 4 j = 3 j = 4 j = 4 j = 4 0 1 2 7 8 9 10 a = 9, j = 4 n = 3 ; i = 3 n' = 1, i' = 3 n' = 3, i' = 2 Coordinateur Client Client

Result • The distributed file can grow to even whole Internet so that : • every insert and search are done in four messages (IAM included) • in general an insert is done in one message and search in two message

SDDS-2000Prototype Implementation of LH* and of RP* on Wintel multicomputer • Architecture Client/Server • TCP/IP Communication (UDP and TCP) with Windows Sockets • Multiple threads control • Processes synchronization (mutex, critical section, event, time_out, etc) • Queuing system • Optional Flow control for UDP messaging

Server Network Socket Request Response send Request Receive Response Update Client Image Server Address file i n ..... ..... Request Response Receive Request Id_Req Id_App ... ..... Return Response Queuing system Interface : Applications - SDDS Applications SDDS-2000 : ClientArchitecture • Send Request • Receive Response • Return Response • Client Image process.

SDDS-2000 : ServerArchitecture Bucket SDDS Insertion Search Update Delete Request Analyse • Listen Thread • Queuing system • Work Thread • Local process • Forward • Response … W.Thread 1 W.Thread 4 Queuing system Listen Thread Response Response Socket Network client Request Client

0 4 data 1 2 data 2 6 dataX 8 data3 -1 dataY -1 0 1 2 3 4 5 6 7 8 9 LH bucket . . . A record dynamic array LH* bucket LH*LH: RAM buckets

Measuring conditions • LAN of 4 computers interconnected by a 100 Mb/s Ethernet • F.S : Fast Server : Pentium II 350 MHz & 128 Mo RAM • F.C : Fast Client : Pentium II 350 MHz & 128 Mo RAM • S.C : Slow Client : Pentium I 90 Mhz & 48 Mo RAM • S.S : Slow Server : Pentium I 90 Mhz & 48 Mo RAM • The measurements result from 10.000 records & more. • UDP Protocol for insertions and searches • TCP Protocol for splitting

Best performances of a F.S : configuration S.C (1) F.S J=0 S.C (2) 100 Mb/s Bucket 0 S.C (3) UDP communication

Fast Server Average Insert time • Inserts without ack • 3 clients create lost messages •  best time: 0,44 ms

Fast ServerAverage Search time • The time measured include the search process + response return • More than 3 clients, there are a lot of lost messages • Whatever is the bucket capacity (1000,5000, …, 20000 records), • 0,66 ms is the best time

Plan

Plan

Presentation Transcript

plan

plan