400 likes | 616 Vues
gluepy: A Simple Distributed Python Programming Framework for Complex Grid Environments. 8/1/08 Ken Hironaka, Hideo Saito, Kei Takahashi, Kenjiro Taura The University of Tokyo. Barriers of Grid Environments. Grid = Multiple Clusters (LAN/WAN) Complex environment Dynamic node joins
E N D
gluepy:A Simple Distributed Python Programming Framework for Complex Grid Environments 8/1/08 Ken Hironaka, Hideo Saito, Kei Takahashi, KenjiroTaura The University of Tokyo www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Barriers of Grid Environments • Grid = Multiple Clusters (LAN/WAN) • Complexenvironment • Dynamic node joins • Resource removal/failure • Network and nodes • Connectivity • NAT/firewall Fire Wall leave Grid enabled frameworks are crucial to facilitate computing in these environments join www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
What type of applications? • Typical Usage • Standalone jobs • No interaction among nodes • Parallel and distributed Applications • Orchestrate nodes for a single application • Map an existing application on the Grid • Requires complex interaction ⇒frameworks must make it simple and manageable www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Common Approaches(1) execute • Programming-less • Batch Scheduler • Task placement (inter-cluster) • Transparent retries on failure • Enables minimal interaction • Pass data via files/raw sockets • Embarrassingly parallel tasks • Very limited for application SUBMIT redo www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Common Approaches(2) • Incorporate some user programming • e.g.:Master-Worker framework • Program the master/worker(s) • Job distribution • Handling worker join/leave • Error handling • Enables simpleinteraction • Still limited in application doJob() error() join() For more complex interaction (larger problem set) must allow more flexible/generalprogramming www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
The most flexible approach • Parallel Programming Languages • Extend existing languages: retains flexibility • Countless past examples • (MultiLisp[Halstead ‘85], JavaRMI, ProActive[Huet et al. ‘04], …) • Problem:not in context of the Grid • Node joins/leaves? • Resolve connectivity with NAT/firewall? • Coding becomes complex/overwhelming Can we not complement this? www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Our Contribution • Grid-enableddistributed object-oriented framework • a focus on coping with complex environment • Joins, failures, connectivity • simpleProgramming& minimalConfiguration • Simple tool to act as a glue for the Grid • Implemented parallel applications on Grid environment with 900cores (9clusters) www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Programming-less frameworks • Condor/DAGMan [Thain et al. ‘05] • Batch scheduler • Transparent retires/ handle multiple clusters • Extremely limited interaction among nodes • Tasks with DAG dependencies • Pass on data using intermediate/scratch files Task Interaction using files Central Manager Assign Busy Nodes Cluster www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
“Restricted” Programming frameworks • Master-Worker Model: Jojo2 [Aoki et al. ‘06], OmniRPC [Sato et al. ‘01], Ninf-C [Nakata et al. ‘04], NetSolve [Casanova et al. ‘96] • Event driven master code: handle join/leave • Map-Reduce [Dean et al. ‘05] • define 2 functions: map(), reduce() • Partial retires when nodes fail • Ibis – Satin [Wrzesinska et al. ‘06] • Distributed divide-and-conquer • Random work stealing: accommodate join/leave • Effective for specialized problem sets • Specialize on a problem/model, made mapping/programming easy • For “unexpected models”, users have to resort to out-of-band/Ad-hoc means Join Handler Failure Handler Join fib(n) Map() divide Reduce() fib(n-1) Map() Reduce() Input Data Map() www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Distributed Object Oriented frameworks foo.doJob(args) • ABCL [Yonezawa ‘90] JavaRMI, Manta [Maassen et al. ‘99] ProActive [Huet et al. ‘04] • Distributed Object oriented • Disperse objects among resources • Load delegation/distribution • Method invocations • RMI (Remote Method Invocation) • Async. RMIs for parallelism • RMI: • good abstraction • Extension of general language: • Allow flexible coding compute RMI foo Async. RMI www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Hurdles for DOO on the Grid • Race conditions • Simultaneous RMIs on 1 object • Active Objects • 1 object = 1 thread • Deadlocks: e.g.: recursive calls • Handling asynchronous events • e.g., handling node joins • Why not event driven? • The flow of the program is segmented, and hard to flow • Handling joins/failures • Difficult to handle them transparently in a reasonable manner deadlock b.f() b a.g() a if join: add if done: give more … event Checkpoint? Automatic retry? … failure www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Hurdles for Implementation NAT • Connecivity with NAT/firewall • Solution: Build an overlay • Existing implementations • ProActive [Huet et al. ‘04] • Tree topology overlay • User must hand write connectable points • Jojo2[Aoki et al. ‘06] • 2-level Hierarchical topology • SSH / UDP broadcast • assumes network topology/setting • out of user control • Requirements • Minimal user burden Configure each link Firewall Connection Configuration File www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Summarization of the Problems • Distributed Object-Oriented on the Grid • Thread race conditions • Event handling • Node join/leave • underlying Connectivity www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Proposal: gluepy • Grid enabled distributed object oriented framework • As a PythonLibrary • glue together Grid resources via simple and flexible coding • Resolve the issues in an object-oriented paradigm • SerialObjects • define “ownership” for objects • blocking operations unblock on events • Constructs for handling Node join/leave • Resolve the “first reference” problem • Failures are abstracted as exceptions • Connectivity(NAT/firewall) • Peers automatically construct an overlay www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
The Basic Programming Model Proc: A Proc: B • RemoteObjects • Created/mapped to a process • Accessible from other processes (RMI) • Passive Objects • Threads are not bound to objects • Thread • Simply to gain parallelism • RMIs / async. invocations (RMIs) implicitly spawn a thread • Future • Returned for async. invocation • placeholder for result • Uncaught exception is stored and re-raised at collection a Spawn for RMI a.f() f() Proc a Spawn for async F = a.f() async f() store in F www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Programming in gluepy inherit Remote Object • Basics: RemoteObject • Inherit Base class • Externally referenceable • Async. invocation with futures • No explicit threads • Easier to maintain sequential flow • mutual exclusion? events? ⇒ SerialObjects class Peer(RemoteObject): def run(self, arg): # work here… return result futures = [] for p in peers: f = p.run.future(arg) futures.append(f) waitall(futures) for f in futures: print f.get() async. RMI run() on all wait forallresults read forallresults www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
“ownership” with SerialObjects waiting threads owner thread object • SerialObjects • Objects with mutual exclusion • RemoteObjectsub-class • No explicit locks • Ownership for each object • call ⇒ acquire • return ⇒ release • Method execution by only 1 thread • The “owner thread” • Owner releases ownership on blocking operations • e.g: waitall(), RMI to other SerialObject • Pending threads contest for ownership • Arbitrary thread is scheduled • Eliminate deadlocks for recursive calls Th Th Th Th new owner thread object Th Th Th block Give-up Owner ship Th re-contest for ownership object Th Th Th Th www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy unblock
Signals to SerialObjects • We don’t want event-driven loops! • Events → “signals” • Blockingop. unblock on signal • Signals to objects • Unblock a thread blocking in object’s context • If none, unblock a next blocking thread • Unblocked thread can handle the signal(event) object SIGNAL Th unblock handle object Th www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
SerialObjects in gluepy class DistQueue(SerialObject): def __init__(self): self.queue = [] def add(self, x): self.queue.append(x) if len(self.queue) == 1: self.signal() def pop(self): while len(self.queue) == 0: wait([]) x = self.queue.pop(0) return x • e.g.:A Queue • pop() • blocks on empty Queue • add() • call signal() to unblock waiter • Atomic Section: • Between blocking ops in a method • Can update obj. attr.s and do invocation on Non-Serial Objects Atomic Section Signal & wake Block until signal www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Managing dynamic resources Objects in computation • Node Join: • Python process starts • Node leave: • Process termination • Constructs for node joins/leaves • Node Join ⇒“first reference” problem Object lookup • obtain ref. to existing objects in computation • Node Leave ⇒ RMI exception • Catch to handle failure lookup joining node Exception! Object on failed node www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
e.g.:Master-worker in gluepy (1/3) class Master(SerialObject): ... def nodeJoin(self, node): self.nodes.append(node) self.signal() def run (self): assigned = {} while True: while len(self.nodes)>0 and len(self.jobs)>0: ASYNC. RMIS TO IDLE WORKERS readys = wait(futures) if readys == None: continue for f in readys: HANDLE RESULTS • Handles join/leave • code for join: • join will invoke signal • signal will unblock main master thread Signal for join Block & Handle join www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
e.g. :Master-worker in gluepy (2/3) for f in readys: node, job = assigned.pop(f) try: print ”done:”, f.get() self.nodes.append(node) except RemoteException, e: self.jobs.append(job) • Failure handling • Exception on collection • Handle exception to resubmit task Failure handling www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
e.g.: Master-worker in gluepy (3/3) • Deployment • Master exports object • Workers get reference and do RMI to join Master init master = Master() master.register(“master”) master.run() Worker init worker = Worker() master = RemoteRef(“master”) master.nodeJoin(worker) while True: sleep(1) lookup on join www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Automatic Overlay Construction(1) • Solution for Connectivity • Automatically construct an overlay • TCP overlay • On boot, acquire other peer info. • Each node connects to a small number of peers • Establish a connected connection graph NAT Global IP Firewall Attempt connection established connections www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Automatic Overlay Construction(2) • Firewalled clusters • Automatic port-forwarding • User configure SSH info • Transparent routing • P-P communication is routed • (AODV [Perkins ‘97]) Firewall traversal SSH #config file use src_patdst_pat, prot=ssh, user=kenny P-to-P communication www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
RMI failure detection on Overlay RMI handler • Problem with overlay • A route consists of a number of connections • RMI failure ⇒ failure of any intermediate connection • Path Pointers • Recorded on each forwarding node • RMI replyreturns the path it came • Failure of intermediate connection • The preceding forwarding node back-propagates the failure Path pointer RMI invoker Backpropagate failure www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Experimental Environment InTrigger Grid Platform in Japan Max. scale:9clusters, over 900 cores requires SSH forwarding Global IPs istbs:316 tsubame:64 mirai:48 okubo:28 hongo:98 All packets dropped hiro:88 chiba:186 kyoto:70 suzuk:72 InTrigger imade:60 kototoi:88 Private IPs Firewall www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Necessary Configuration • Configuration necessary for Overlay • 2clusters( tsubame, istbs) require SSH-portforwarding to other clusters ⇒ 2 lines of configuration add connection instruction by regular expression # istbs cluster uses SSH for inter-cluster conn. use 133\.11\.23\. (?!133\.11\.23\.), prot=ssh, user=kenny #tsubame cluster gateway uses SSH for inter-cluster conn. use 131.112.3.1 (?!172\.17\.), prot=ssh, user=kenny www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Overlay Construction Simulation • Evaluate the overlay construction scheme • For different cluster configurations, modified number of attempted connections per peer • 1000 trials per each cluster/attempted connection configuration 28 Global/ 238 Private Peers Case: 95 % www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Dynamic Master-Worker • Master object distributes work to Worker objects • 10,000tasksasRMI • Workers repeat join/leave • Tasks for failed nodes are redistributed • No tasks were lost during the experiment www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
A Real-life Application • A combination optimization problem • Permutation Flow Shop Problem • parallelbranch-and-bound • Master-Worker like • Requires periodic exchange of bounds • Code • 250 lines of Python code as glue code • Worker node starts up sequential C++ code • Communicate with local Python through pipes www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Master-Workerinteraction • Master does RMI to worker • Worker: periodical RMI to master • Not your typical master-worker • requires a flexible framework like ours Master exchange_bound() doJob() Worker www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Performance • Work Rate • ci : total comp. time per core • N: num. of cores • T: completion time • Slight drop with 950 cores • due to master node becoming overloaded www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Troubleshoot Search Engine • Ever stuck debugging, or troubleshooting? • Re-rank query results obtained from google • Use results from machine learning web-forums • Perform natural language processing on page contents at query time • Use a Grid backend • Computationally intensive • Require good response time • in 10s of seconds Compute!! Compute!! backend Query: “vmware kernel panic” Search Engine Compute!! www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Troubleshoot Search Engine Overview async. doQuery() Graph extraction Python CGI doSearch() rescoring parsing async. doWork() Leveraged sync/async RMIs to seamlessly integrate parallelism into a sequential program. Merged CGIs with Grid backend www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Conclusion • gluepy: Grid enabled distributed object oriented framework • Supports simple and flexible coding for complex Grid • SerialObjects • Signal semantics • Object lookup / exception on RMI failure • Automatic overlay construction • as a tool to glue together Grid resources simply and flexibly • Implemented and evaluated applications on the Grid • Max. scale: 900core (9 cluster) • NAT/Firewall, with runtime joins/leaves • Parallelized real-life applications • Take full advantage of gluepy constructs for seamless programming www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Questions? • gluepy is available from its homepage www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy