1 / 1

gluepy : A Framework for Flexible Programming in Complex Grid Environments

gluepy : A Framework for Flexible Programming in Complex Grid Environments Ken Hironaka, Hideo Saito, Kei Takahashi , Kenjiro Taura (the University of Tokyo) { kenny , h_saito , kay , tau }@ logos.ic.i.u-tokyo.ac.jp

midori
Télécharger la présentation

gluepy : A Framework for Flexible Programming in Complex Grid Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. gluepy: A Framework for Flexible Programming in Complex Grid Environments Ken Hironaka, Hideo Saito, Kei Takahashi , KenjiroTaura (the University of Tokyo) {kenny, h_saito, kay, tau}@logos.ic.i.u-tokyo.ac.jp Package available from Home Page: www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy • Automatic Overlay Construction on Grid • Construction Scheme: Steps for each peer • obtain endpoint information to other peers • attempt TCP connections to a selected few peers • Firewall-Cluster Peers • Automatic SSH-portforwarding • Adaptive routing on overlay [Perkins et al. 1997] • Failure Detection on Overlay • communication path is maintained for each RMI • Intermediate peers remember the next peer: Path Pointer • On failure of connection, error is returned along path • Overview • Grid-enabled distributed object oriented programming model • Distributed objects with implicit synchronization • Model that allows join/failure of nodes • Incorporate NAT/firewalled clusters via overlay • gluepy : “glue Python” • Distributed object library extension for Python • Implements our proposed programming model • Real Grid Applications on real Grid Environments • Over 900 real nodes across 9 clusters • Heterogeneous Network Settings (including NAT, firewalls) NAT Global IP SSH Attempt connection Firewall RMI handler Path pointer • Related Works • Grid-enabled Programming Models • Satin [Wrzesinska et al. 2006], Jojo [Nakada et al. 2004], Jojo2 [Aoki et al. 2006] • Distributed Objects on the Grid • ProActive [Huet et al. 2004], Ibis RMI [van Nieuwpoort, et al. 2005] • Wide-area Connection Management • SmartSockets [Maassen et al. 2007], MC-MPI [Saito et al. 2007] return error failure Evaluation Results Experimental Environment • Programming Model • Asynchronous RMIs (Remote Method Invocations) with Futures • any invocation may be made asynchronous • returns a future, a place holder • Serialization Semantics (Synchronization) • At most 1 running thread per object • At any given time, at most 1 thread can • execute an object’s method: the owner thread • eliminate race-conditions • If a thread blocks while in the method’s scope, • other threads are permitted to execute methods • on the object • eliminate deadlocks for common usage • Signals to Object • Any thread blocking in the object’s context • will unblock and return None • Runtime Node Joins • Need to obtaining reference to existing objects • A fully decentralized remote object lookup scheme • Node failure (RMI failure) detection • RMI failures are returned as Exceptions Global IPs H (316) • Figure of clusters used for experiments • The numbers denote CPU core count • Total of 9 clusters • Over 900 CPU Cores I (64) D (28) A (98) C (72) B (186) waiting threads owner thread object F (70) All packets dropped G (88) Th Th Th Th E (60) Private IPs Firewall Overlay Connectivity Simulation Master-Worker application with node joins/failures new owner thread object Th Th Th block Give-up Owner ship Th Connect attempts per Peer object • 3 Cluster Combinations • Global Peers: 384 Private Peers: 218 • Global Peers: 100 Private Peers: 218 • Global Peers: 28 Private Peers: 218 • Achieves high probability with ~20 connections per peer • Task distributing Master-Worker (10000 tasks) • New tasks to new workers via async. RMIs • Tasks given to failed workers are redistributed • By handling RMI failure exceptions • Master adapts to new nodes immediately, and completes all tasks in face of worker failures Th Th Th Th re-contest ownership Unblock Application • Parallel Permutation Flowshop Solver • A combination optimization problem • Given a sequence of n jobs that use m machines, • find a permutation of jobs with the shortest makespan • Finds the optimal solution by parallel branch and bound • Master divides the search space into sub-tasks • Worker periodically exchange latest bounds with master Master Example Master-Worker Excerpt class Master : def __init__(self): self.nodes= [] self.jobs= [] def nodeJoin(self , node): self.nodes.append(node) self.signal() def run (self): assigned = {} while True: while len(self.nodes)>0 and len(self.jobs)>0: node = self.nodes.pop() job = self.jobs.pop() f = node.doJob.future(job) assigned[f] = (node, job) readys = wait(assigned.keys()) if readys == None: continue for f in readys: node, job = assigned.pop(f) try: print ”done:”, f.get() self.nodes.append(node) except RemoteException, e: self.jobs.append(job) exchange_bound() doJob() Worker Signal thread blocking in master object Application • “Troubleshooting” Search Engine • Ever stuck debugging, or troubleshooting? • Re-rank google queries and give weight to pages for web-forums and solutions • Natural language processing and machine learning • Parallel Computing Backend • On-line Web-page parsing/analysis • Real-Time response for hundreds of ranked pages Atomic Section async. RMI, doJob() to idle workers Efficiency • Efficiency: • : num. of cores • : completion time • : calc. time per core • 90% efficiency with ~950 cores Block and wait for some results None returns when unblocked by signal Compute!! Compute!! Query: “vmwarekernel panic” backend retrieve results Exception raised on failure Atomic Section • Future Work • Application to much wider range of applications • Development of library package • A prototype package is available at Home Page!! Search Engine Compute!!

More Related