New Models for Data Intensive Scientific Computing

Getting Beyond the Filesystem: New Models forData Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009

Upload Data Submit 1M Jobs 1 2 File System Batch System 3 Access Data The Standard Model Disk Disk Disk CPU CPU CPU Disk Disk Disk F P CPU CPU CPU The user has to guess at what the filesystem is good at. The FS has to guess what what the user is going to do next. Disk Disk Disk CPU CPU CPU

Understand what the useris trying to accomplish.…not what they are strugglingto run on this machine right now.

The Context: Biometrics Research • Computer Vision Research Lab at Notre Dame: • Kinds of research questions: • How does human variation affect matching? • What’s the reliability of 3D versus 2D? • Can we reliably extract a good iris image from an imperfect video, and use it for matching? • A great context for systems research: • Committed collaborators, lots of data, problems of unbounded size, some time sensitivity, real impact. • Acquiring TB data/month, sustained 90% CPU util. • Always composing workloads that break something.

BXGrid Schema Immutable Replicas Scientific Metadata replicaid=423 state=ok replicaid=105 state=ok replicaid=293 state=creating General Metadata fileid = 24305 size = 300K type = jpg sum = abc123… replicaid=102 state=deleting

BXGrid Abstractions S = Select( color=“brown” ) B = Transform( S,G ) M = AllPairs( A, B, F ) S eye color A1 A2 A3 L brown G B1 F F F L blue ROC Curve G B2 F F F R brown B3 F F F G R brown H. Bui et al, “BXGrid: A Repository and Experimental Abstraction” Special Issue on e-Science, Journal of Cluster Computing, 2009.

An abstraction is a declarative specification of both the data and computation in a batch workload (Think SQL for clusters.)

AllPairs( set A, set B, func F ) B0 B1 B2 Bn 0.5 0.2 0.9 0.8 A0 0.4 0.3 0.1 A1 F 1.0 0.6 0.6 0.2 A2 0.3 0.7 0.1 0.5 An Moretti, et al, “All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids”, TPDS, to appear in 2009.

> AllPairs IrisA IrisB MyComp.exe

200 Turnaround Time Number of CPUs How can All-Pairs do better? • No demand paging! • Distribute the data via spanning tree. • Metadata lookup can be cached forever. • File locations can be cached optimistically. • Compute the right number of CPUs to use. • Check for data integrity, but on failure, don’t abort, just migrate. • On a multicore CPU, use multiple threads and a cache-aware access algorithm. • Decide at runtime whether to run here on 4 cores, or run on the entire cluster.

What’s the Upshot? • The end user gets scalable performance always at the “sweet spot” of the system. • It works despite many different failure modes, including uncooperative admins. • 60Kx60K All-Pairs ran on 100-500 CPUs over one week, largest known experiment on public data in the field. • All-Pairs abstraction is 4X more efficient than the standard model.

AllPairs( set A, set B, func F ) B0 B1 B2 Bn 0.5 0.2 0.9 0.8 A0 0.4 0.3 0.1 A1 F 1.0 0.6 0.6 0.2 A2 0.3 0.7 0.1 0.5 An Moretti, et al, “All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids”, TPDS, to appear in 2009.

SomePairs( set A, set B, pairs P, func F ) B0 B1 B2 Bn P 0.5 0.2 0.9 0.8 (0,1) (0,2) (1,1) (1,2) (2,0) (2,2) (2,3) A0 0.4 0.3 0.5 0.1 A1 1.0 0.6 0.6 0.2 A2 0.3 0.7 0.1 0.5 An Moretti, et al, “Scalable Modular Genome Assembly on Campus Grids”, under review in 2009.

x F d y Wavefront ( R[x,0], R[0,y], F(x,y,d) ) R[0,4] R[2,4] R[3,4] R[4,4] x F d y R[0,3] R[3,2] R[4,3] x x F F d y d y R[0,2] R[4,2] x F x F d y d y R[0,1] x F x F x F x F d y d y d y d y R[0,0] R[1,0] R[2,0] R[3,0] R[4,0] Yu et al, “Harnessing Parallelism in Multicore Clusters with the All-Pairs and Wavefront Abstractions”, IEEE HPDC 2009.

Directed Graph part1 part2 part3: input.data split.py ./split.py input.data out1: part1 mysim.exe ./mysim.exe part1 >out1 out2: part2 mysim.exe ./mysim.exe part2 >out2 out3: part3 mysim.exe ./mysim.exe part3 >out3 result: out1 out2 out3 join.py ./join.py out1 out2 out3 > result Thain and Moretti, “Abstractions for Cloud Computing with Condor”, in Handbook of Cloud Computing Services, in press.

Contributions • Repository for Scientific Data: • Provide a better match with end user goals. • Simplify scalability, management, recovery… • Abstractions for Large Scale Computing: • Expose actual data needs of computation. • Simplify scalability, consistency, recovery… • Enable transparent portability from single CPU to multicore to cluster to grid. • Are there other useful abstractions?

HECURA Summary • Impact on real science: • Production quality data repository used to produce datasets accepted by NIST and published to community. “Indispensable” • Abstractions used to scale up student’s daily “scratch” work to 500 CPUs; increased data coverage of published results by two orders of magnitude. • Planning to adapt prototype to Civil Eng CDI project. • Impact on computer science: • Open source tools that run on Condor, SGE, multicore, recently with independent users. • 1 chapter, 2 journal papers, 4 conference papers. • Broader Impact • Integration into courses, used by other projects at ND. • Undergrad co-authors from ND and elsewhere.

Participants • Faculty: • Douglas Thain and Patrick Flynn • Graduate Students: • Christopher Moretti (Abstractions) • Hoang Bui (Repository) • Li Yu (Multicore) • Undergraduates: • Jared Bulosan • Michael Kelly • Christopher Lyon (North Carolina A&T Univ) • Mark Pasquier • Kameron Srimoungchanh (Clemson Univ) • Rachel Witty

For more information: • The Cooperative Computing Lab • http://www.cse.nd.edu/~ccl • Cloud Computing and Applications • October 20-21, Chicago IL • http://www.cca09.org • This work was supported by the National Science Foundation via grant CCF-06-21434.

New Models for Data Intensive Scientific Computing