The Difficulties of Distributed Data

The Difficultiesof Distributed Data Douglas Thain thain@cs.wisc.edu Condor Project University of Wisconsin http://www.cs.wisc.edu/condor

The Condor Project • Established in 1985. • Software for high-throughput cluster computing on sites ranging from 10->1000s of nodes. • Example installations: • 643 CPUs at UW-Madison in CS building • Comp architecture simulations • 264 CPUs at INFN all across Italy • CMS simulations • Serves two communities: Production software and computer science research.

No Repository Here! • No master source of anyone’s data at UW-CS Condor! • But, large amount of buffer space: • 128 * 10 GB + 64 *30 GB • Ultimate store is at other sites: • NCSA mass store • CERN LHC repositories • We concentrate on software for loading, buffering, caching, and producing output efficiently.

The Challenges of Large-Scale Data Access are… • 1 - Correctness! • Single stage: crashed machines, lost connections, missing libraries, wrong permissions, expired proxies… • End-to-end: A job is not “complete” until the output has been verified and written to disk. • 2 - Heterogeneity • By design: aggregated clusters. • By situation: Disk layout, buffer capacity, net load.

Your Comments: • Jobs need scripts that check readiness of system before execution. • (Tim Smith) • Single node failures not worth investigating: Reboot, reimage, replace. • (Steve DuChene) • “A cluster is a large error amplifier.” • (Chuck Boeheim)

Data Management in Condor • Production -> Research • Remote I/O • DAGMan • Kangaroo • Common denominators: • Hide errors from jobs -- they cannot deal with “connection refused” or “network down.” • Propagate failures first to scheduler, and perhaps later to the user.

Remote I/O • Relink job with Condor C library. • I/O is performed along TCP connection to the submit site: either fine-grained RPCs or whole-file staging. Some failures: NFS down DNS down Node rebooting Missing input On any failure: 1 - Kill -9 job 2 - Log event 3 - Email user? 4 - Reschedule Exec Site Job Exec Site Exec Site Exec Site Exec Site Exec Site Submit Site Exec Site Exec Site Job Exec Site

DAGMan(Directed Acyclic Graph Manager) • A persistent ‘make’ for distributed computing. • Handles dependencies and failures in multi-job tasks, including cpu and data movement. Run Remote Job If transfer fails… Retry up to 5 times. Stage Output Check Output Stage Input Begin DAG DAG Complete Run Remote Job If results are bogus… Retry up to 10 times.

Execution Site Storage Site Kangaroo • Simple Idea: Use all available net, mem, and disk to buffer data. “Hop” it to destination. • Background process, not job, is responsible for handling both faults and variations. • Allows overlap of CPU and I/O. App K K K K Data Movement System Disk

I/O Models Kangaroo Output: INPUT CPU CPU CPU CPU PUSH OUTPUT OUTPUT OUTPUT OUTPUT Stage Output: INPUT CPU CPU CPU CPU OUTPUT OUTPUT OUTPUT OUTPUT

In Summary… • Correctness is a major obstacle to high-throughput cluster computing. • Jobs must be protected from all of the possible errors in data access. • Handle failures in two ways: • Abort, and inform scheduler (not user.) • Fall back to alternate resource. • Pleasant side effect: higher throughput!

The Difficulties of Distributed Data

The Difficulties of Distributed Data

Presentation Transcript

Foundations of Distributed Data Management

The difficulties of predicting future violence

Distributed Data Processing

Difficulties in Aftermarket Data Release

Learning Difficulties of the Gifted

Distributed Data Management

Difficulties and the Collapse of Empires

Difficulties in The Disarmament of Germany

Treaties and the Difficulties of Aboriginals

Distributed Data Management

The Distributed Data Warehouse

Distributed Data, Distributed Governance, Distributed Vocabularies: The NERC DataGrid

NDG Security: Distributed Governance, Distributed Access Control, Distributed Data.

The Distributed Data Interface in GAMESS

Distributed Data Management

The Difficulties of Distributed Data

Distributed Data Management