The DryadLINQ Approach to Distributed Data-Parallel Computing

The DryadLINQ Approach to Distributed Data-Parallel Computing Yuan Yu Microsoft Research Silicon Valley

Distributed Data-Parallel Computing • Dryad talk: the execution layer • How to reliably and efficiently execute distributed data-parallel programs on a compute cluster? • This talk: the programming model • How to write distributed data-parallel programs for a compute cluster?

The Programming Model • Sequential, single machine programming abstraction • Same program runs on single-core, multi-core, or cluster • Preserve the existing programming environments • Modern programming languages (C# and Java) are very good • Expressive language and data model • Strong static typing, GC, generics, … • Modern IDEs (Visual Studio and Eclipse) are very good • Great debugging and library support • Legacy code could be easily reused

Dryad and DryadLINQ DryadLINQ provides automatic query plan generationDryad provides automatic distributed execution

Outline • Programming model • DryadLINQ • Applications • Discussions and conclusion

LINQ • Microsoft’s Language INtegrated Query • Available in .NET3.5 and Visual Studio 2008 • A set of operators to manipulate datasets in .NET • Support traditional relational operators • Select, Join, GroupBy, Aggregate, etc. • Integrated into .NET programming languages • Programs can invoke operators • Operators can invoke arbitrary .NET functions • Data model • Data elements are strongly typed .NET objects • Much more expressive than relational tables • For example, nested data structures

LINQ Framework Scalability Local machine Execution engines .Netprogram (C#, VB, F#, etc) DryadLINQ Cluster Query PLINQ Multi-core LINQ provider interface LINQ-to-SQL Objects LINQ-to-Obj Single-core

A Simple LINQ Query IEnumerable<BabyInfo> babies = ...; varresults = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;

A Simple PLINQ Query IEnumerable<BabyInfo> babies = ...; varresults = from baby in babies.AsParallel() where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;

A Simple DryadLINQ Query PartitionedTable<BabyInfo> babies = PartitionedTable.Get<BabyInfo>(“BabyInfo.pt”); varresults = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;

DryadLINQ Data Model .Net objects Partition Partitioned Table • Partitioned table exposes metadata information • type, partition, compression scheme, serialization, etc.

Demo • It is just programming • The same familiar programming languages, development tools, libraries, etc.

K-means Execution Graph C0 ac cc C1 ac ac P1 P2 ac cc C2 P3 ac ac ac cc C3 ac ac

K-means in DryadLINQ public class Vector { public double[] entries; [Associative] public static Vector operator +(Vector v1, Vector v2) { … } public static Vector operator -(Vector v1, Vector v2) { … } public double Norm2() { …} } public static VectorNearestCenter(Vector v, IEnumerable<Vector> centers) { return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c); } public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) { return vectors.GroupBy(v => NearestCenter(v, centers)) .Select(group => group.Aggregate((x,y) => x + y) / group.Count()); } var vectors = PartitionedTable.Get<Vector>("dfs://vectors.pt"); var centers = vectors.Take(100); for (inti = 0; i < 10; i++) { centers = Step(vectors, centers); } centers.ToPartitionedTable<Vector>(“dfs://centers.pt”);

PageRank Execution Graph cc N01 ae D N11 cc ae D N12 N02 ae D cc N13 N03 cc E1 ae D N21 cc ae D N22 E2 ae D cc N23 E3 cc ae D N31 cc ae D N32 ae D cc N33

PageRank in DryadLINQ public static IQueryable<Rank> Step(IQueryable<Page> pages, IQueryable<Rank> ranks) { // join pages with ranks, and disperse updates varupdates =frompageinpages joinrankinranksonpage.nameequalsrank.name selectpage.Disperse(rank); // re-accumulate. return fromlistinupdates fromrankinlist grouprank.rankbyrank.nameintog select new Rank(g.Key, g.Sum()); } public struct Page { public UInt64 name; public Int64 degree; public UInt64[] links; public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; } public Rank[] Disperse(Rank rank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; } } public struct Rank { public UInt64 name; public double rank; public Rank(UInt64 n, double r) { name = n; rank = r; } } var pages = PartitionedTable.Get<Page>(“dfs://pages.pt”); var ranks = pages.Select(page=>new Rank(page.name, 1.0)); // repeat the iterative computation several times for (intiter = 0; iter < n; iter++) { ranks = Step(pages, ranks); } ranks.ToPartitionedTable<Rank>(“dfs://ranks.pt”);

MapReduce in DryadLINQ MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs { var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs } // Ex: Count the frequencies of words in a book MapReduce(book, line => line.Split(' '), w => w.ToLower(), g => g.Count())

DryadLINQ System Architecture Client machine Cluster Dryad DryadLINQ .NET program Distributedquery plan Invoke Query Expr Query plan Vertexcode Input Tables LINQ query Dryad Execution Output Table .Net Objects Results Output Tables foreach (11)

DryadLINQ • Distributed execution plan generation • Static optimizations: pipelining, eager aggregation, etc. • Dynamic optimizations: data-dependent partitioning, dynamic aggregation, etc. • Vertex runtime • Single machine (multi-core) implementation of LINQ • Vertex code that runs on vertices • Data serialization code • Callback code for runtime dynamic optimizations • Automatically distributed to cluster machines

A Simple Example • Count the frequencies of words in a book var map = book.SelectMany(line => line.Split(' ')); var group = map.GroupBy(w => w.ToLower()); var result = group.Select(g => g.Count());

Naïve Execution Plan line.Split(' ') Map ..... M M M map Distribute D D D Merge MG MG MG g.Count() GroupBy … G G G reduce Reduce R R R Consumer X X X

Execution Plan Using Partial Aggregation Map M M M GroupBy g.Count() G1 G1 G1 map InitialReduce IR IR IR Distribute D D D g.Sum() Merge MG MG GroupBy aggregation tree G2 G2 Combine C C Merge MG MG GroupBy G3 G3 g.Sum() reduce FinalReduce F F Consumer X X

Challenge: Program Analysis Support • The main sources of difficulty • Complicated data model • User-defined functions all over the places • Requires sophisticated static program analysis at byte-code level • Mainly some form of flow analysis • Possible with modern programming languages and runtimes, such as C#/CLR

Inferring Dataset Property • Useful for query optimizations Hash(y => y.a+y.b) Select(x => new { A=x.a+x.b, B=f(x.c) } Hash(z => z.A) GroupBy(x => x.A) No data partitioning at GroupBy

Inferring Dataset Property • Useful for query optimizations Hash(y => y.a+y.b) Select(x => Foo(x) } Need IL-level data-flow analysis of Foo Hash(x => x.A)? GroupBy(x => x.A)

Caching Query Results • A cluster-wide caching service to support • Reuse of common subqueries • Incremental computations • Cache: [Key Value] • Key is <q, d>, Value is the result of q(d) • Key requires a pretty good reachability analysis • For correctness, Key must include “everything” reachable from q • For performance, Key should only contain things q depends on

More Static Analysis • Purity checking • All the functions called in DryadLINQ queries must be side-effect free • DryadLINQ (and PLINQ) doesn’t enforce it • Metadata validity checking • Partitioned table’s metadata contains the record type, partition scheme, serialization functions, … • Need to determine if the metadata is valid • DryadLINQ doesn’t fully enforce it • Static enforcement of program properties for security and privacy mechanisms

Examples of DryadLINQ Applications • Data mining • Analysis of service logs for network security • Analysis of Windows Watson/SQM data • Cluster monitoring and performance analysis • Graph analysis • Accelerated Page-Rank computation • Road network shortest-path preprocessing • Image processing • Image indexing • Decision tree training • Epitome computation • Simulation • light flow simulations for next-generation display research • Monte-Carlo simulations for mobile data • eScience • Machine learning platform for health solutions • Astrophysics simulation

Decision Tree TrainingMihai Budiu, Jamie Shotton et al Learn a decision tree to classify pixels in a large set of images label image Decision Tree Machinelearning 1M images x 10,000 pixels x 2,000 features x 221 tree nodes Complexity >1020 objects

Sample Execution plan Initial empty tree Image inputs, partitioned Read, preprocess Redistribute Histogram Regroup histograms on node Compute new tree layer Compute new tree Broadcast new tree Final tree

Application Details • Workflow = 37 DryadLINQ jobs • 12 hours running time on 235 machines • More than 100,000 processes • More than 100 days of CPU time • Recovers from several failures daily • 34,000 lines of .NET code

Windows SQM Data Analysis Michal Strehovsky, Sivarudrappa Mahesh et al

The Language Integration Approach • Single unified programming environment • Unified data model and programming language • Direct access to IDE and libraries • Different from SQL, HIVE, Pig Latin • Multiple layers of languages and data models • Works out very well, but requires good programming language supports • LINQ extensibility: custom operators/providers • .NET reflection, dynamic code generation, …

Combining with PLINQ Query DryadLINQ subquery PLINQ The combination of PLINQ and DryadLINQ delivers computation to every core in the cluster

Acyclic Dataflow Graph • Acyclic dataflow graph provides a very powerful computation model • Easy target for higher-level programming abstractions such as DryadLINQ • Easy expression of many data-parallel optimizations • We designed Dryad to be general and flexible • Programmability is less of a concern • Used primarily to support higher-level programming abstractions • No major changes made to Dryad in order to support DryadLINQ

Expectation Maximization (Gaussians) • Generated by DryadLINQ • 3 iterations shown

Decoupling of Dryad and DryadLINQ • Separation of concerns • Dryad layer concerns scheduling and fault-tolerance • DryadLINQ layer concerns the programming model and the parallelization of programs • Result: powerful and expressive execution engine and programming model • Different from the MapReduce/Hadoop approach • A single abstraction for both programming model and execution engine • Result: very simple, but very restricted execution engine and language

Software Stack …… Graph Analysis eScience MachineLearning Image Processing DataMining Applications DryadLINQ Dryad CIFS/NTFS SQL Servers Azure DFS Cosmos DFS Cluster Services (Azure, HPC, or Cosmos) Windows Server Windows Server Windows Server Windows Server

Conclusion • Single unified programming environment • Unified data model and programming language • Direct access to IDE and libraries • An open and extensible system • Many LINQ providers out there • Existing ones: LINQ-to-XML, LINQ-to-SQL, PLINQ, … • Very easy to write one for your app domain • Dryad/DryadLINQ scales out all of them!

Availability • Freely available for academic use • http://connect.microsoft.com/DryadLINQ • DryadLINQ source, Dryad binaries, documentation, samples, blog, discussion group, etc. • Will be available soon for commercial use • Free, but no product support

The DryadLINQ Approach to Distributed Data-Parallel Computing