Atlas: An Infrastructure for Global Computing

Atlas: An Infrastructure for Global Computing

People • Eric Baldeschwieler (UC Berkeley) • Bobby Blumofe (UT Austin) • Eric Brewer (UC Berkeley)

Outline • Introduction • Programming model • Architecture • Examples • Discussion • Limitations & Conclusion

Introduction Properties of a Internet computing infrastructure • Scalability: to 106 nodes • Heterogeneity: of machines & OSs • Fault tolerance: completion probability comparable to sequential program • Adaptive parallelism: dynamic set of resources

Properties ... • Safety: Hosts must be secure • Anonymity: Secure privacy of client: data & program • Hierarchy: Locality of communication (local bandwidth typically is higher) • Ease of use: Minimize “costs” of participating. • Reasonable performance: Low overhead  Benefit from a small set of machines.

Introduction ... • Atlas combines mechanisms from: • Cilk • Java • with new mechanisms. • Java “ensures”: • heterogeneity • safety

Introduction ... Atlas: • extends Cilk’s work-stealing scheduler to a hierarchical Internet setting • uses Cilk-NOW’s mechanisms for: • adaptive parallelism • fault tolerance

Programming Model • Applications are written in Java • When a native library is used, heterogeneity is limited to platforms that support it. • Programming model is: • a Java-based implementation of Cilk: • Non-blocking, explicit continuation passing threads • a Unix-like URL-based file system & local caching with coherence.

Architecture Basic architecture Compute Server Client Manager Application (Java) Runtime library Java interpreter Native libraries (C or C++) Compute Server Compute Server Compute Server

Architecture ... • Client is a Java application • connects to compute servers on machines other than its manager’s. • Idle servers steal work from busy ones.

Architecture • Compute server: • relinquishes control when there is non-Atlas work (a screensaver?) • Runs as a daemon: • working • pings manager & siblings for work to steal

Architecture: Porting Atlas • A Java runtime system • Port: • natively written URL-based file system • some support routines.

Hierarchical Work Stealing Manager Manager Manager Manager Manager Compute Server Compute Server Compute Server

Hierarchical Work Stealing ... • Manager keeps track of when its subtree is idle • If manager’s subtree is idle, manager steals work from its siblings • If a subtree has “too much” work, it “allows” work stealing from above What is definition & implementation of “too much”?

Hierarchical Work Stealing • The authors claim that proven properties of Cilk hold in this hierarchical setting. • Goals: • Localize communication • Sub-trees map to domain hierarchy Administrators can control thread migration: • Outflow: Privacy • Inflow: Host security

Examples • Fib: fine grained threads • POV-Ray: coarse grained threads Base 1 Node 3 Nodes 8 Nodes Fib (24) 1.3 80 40 (2.0) 31 (2.6) POV-Ray 20700 21000 - 2700 (7.8) Numbers in ( ) are speedups over 1-node case.

Examples ... • POV-Ray is not written in Java • Partitioning is done in Java • 8 nodes: only 2% overhead. • What about larger P?

Discussion • Scalable: Yes. • Heterogeneity: Incomplete until divorces itself from all native libraries. • Safety: • Java: OK. • Native libraries: ?

Discussion ... • Fault tolerance: A timed out thread is recomputed from a checkpointmaintained by subtree (manager?) • What is affect on performance of checkpointing? Subtree rooted at a thread is its subcomputation.

Fault Tolerance ... Subcomputations are transactions: • Authors claim: side effects can be undone • How does this relate to hierarchical work stealing?

Discussion ... • Anonymity: A host executing a stolen subtree cannot determine client. • Managers are assumed to be trustworthy • Hierarchy: Yes, via manager hierarchy. • Ease of use: Interface incomplete. • clients submit jobs via a special “shell”

Discussion ... • Adaptive parallelism: • “Owner” (?) of compute server sets a policy that defines when server is idle. • How? • When compute server becomes unavailable for Atlas work, all its sub-computations are moved to another computer server.

Adaptive Parallelism ... • Moving a subcomputation requires updating information linking subcomputation to its: • parent • children • How long does it take to retreat? • Is sub-computation restarted? From checkpoint?

Limitations • Atlas inherits tree-structured program limitation from Cilk. • But this is still a rich set! • Generalizing to non-tree-structured programs seems hard. • No shared variables among threads. • Global file system is read-only.

Conclusion • Jicos design goals = those for Atlas. • Use JXTA to give Jicos a “file system” • Then, Jicos becomes Atlas’s heir.

Atlas: An Infrastructure for Global Computing

Atlas: An Infrastructure for Global Computing

Presentation Transcript

DNA Computing Tutorial

IT Infrastructure for the Enterprise

Introduction to Scientific Computing

VMware Overview – (What’s New) and Virtual Infrastructure Performance, Capacity Planning, and Monitoring

Global Reform of payment infrastructure instruments, systems and retail banking

MRI Atlas of Renal Pathology

Cluster Computing with DryadLINQ

IU DLP Infrastructure Update

Chapter

Parallel Computing Explained Parallel Computing Overview

Physical Infrastructure

Infrastructure Security

The Anatomy of Cloud Computing

Dx Imaging 3 review

Atlas

Physical Infrastructure

Grid Services

GRID (European Scientific Computing infrastructure)

Introduction to Grid Computing:

Cluster Computing