1 / 34

Large Scale Data Processing with DryadLINQ

Large Scale Data Processing with DryadLINQ. Dennis Fetterly Microsoft Research, Silicon Valley Workshop on Data-Intensive Scientific Computing Using DryadLINQ. Outline. Brief introduction to TidyFS Preparing/loading data onto a cluster Desirable properties in a Dryad cluster

Télécharger la présentation

Large Scale Data Processing with DryadLINQ

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Scale Data Processing with DryadLINQ Dennis Fetterly Microsoft Research, Silicon Valley Workshop on Data-Intensive Scientific Computing Using DryadLINQ

  2. Outline • Brief introduction to TidyFS • Preparing/loading data onto a cluster • Desirable properties in a Dryad cluster • Detailed description of several IR algorithms

  3. TidyFSgoals • A simple distributed filesystem that provides the abstractions necessary for data parallel computations • High performance, reliable, scalable service • Workload • High throughput, sequential IO, write once • Cluster machines working in parallel • Terasort • 240 machines reading at 240 MB/s = 56 GB/s • 240 machines writing at 160 MB/s = 37 GB/s

  4. TidyFS Names • Stream: a sequence of partitions • i.e. tidyfs://dryadlinqusers/fetterly/clueweb09-English • Can have leases for temp files or cleanup from crashes • Partition: • Immutable • 64 bit identifier • Can be a member of multiple streams • Stored as NTFS file on cluster machines • Multiple replicas of each partition can be stored Stream-1 Part 1 Part 2 Part 3 Part 4

  5. Preparation of Data • Often substantially harder than it appears • Issues: • Data format • Distribution of data • Network bandwidth • Generating synthetic datasets is sometimes useful

  6. Data Prep – Format • Text records are simplest • Caveat – information that is not in the line • e.g. - if a line number encodes information • Binary records often require custom code to load to cluster • Serialization/de-serialization code generated by DryadLINQ uses C# Reflection

  7. Custom Deserialization Code public classUrlDocIdScoreQuery{publicstringqueryId;publicstringurl;publicstringdocId;publicstringqueryString;publicdouble score;publicstaticUrlDocIdScoreQuery Read(DryadBinaryReader reader)        {UrlDocIdScoreQuery rec = newUrlDocIdScoreQuery();rec.queryId = ReadAnyString(reader);rec.queryString = ReadAnyString(reader);            rec.url = ReadAnyString(reader);rec.docId = ReadAnyString(reader);rec.score = reader.ReadDouble();return rec;        } public staticstringReadAnyString(DryadBinaryReaderdbr) {…} }

  8. Data Prep - Loading • DryadLINQ job • Often needs a dummy input anchor • Custom program • Write records to TidyFS partitions • “SneakerNet” often a good option

  9. Data Loading - DryadLINQ • Need input “anchor” to run on cluster • Generate or use existing stream • Sample: IEnumerable<Entry> GenerateEntries(Random x, intnumItems) { for (int i = 0; i < numItems; i++) { // code to generate records yield return record; } }

  10. Data Gen - DryadLINQ • Need input “anchor” to run on cluster • Generate or use existing stream • Sample: IEnumerable<Entry> GenerateEntries(Random x, intnumItems) { for (int i = 0; i < numItems; i++) { // code to generate records yield return record; } }

  11. DryadLINQ Job varstreamname = "tidyfs://datasets/anchor”; varos = @"tidyfs://msri/teamname/data?compression=" + CompressionScheme.GZipFast; var r = PartitionedTable.Get<int>(streamname) .Take(1).SelectMany(x => Enumerable.Range(0, partitions)).HashPartition(x => x, partitions).Select(x => new Random(x)).SelectMany(x => GenerateEntries(x, numItems)).ToPartitionedTable(os);

  12. Data Loading - Databases • Bulk copy into files • Use queries to produce multiple files • Perform queries within DryadLINQ UDF IEnumerable<Entry> PerformQuery(string queryArg) { var results = “select * from …”; foreach (var record in results) { yield return record; } }

  13. Building a cluster • Overall goal – a high-throughput system • Not latency sensitive • More slower computers often better than fewer faster computers • Multiple cores better that frequency • Multiple disks – increase throughput • Sufficient RAM

  14. Networking a Cluster • Network topology – medium to large clusters • Attempt to maximize cross rack bandwidth • Two tier topology • Rack switches and core switches • Port aggregation • Bond multiple connections together • 1 GbE or 10 GbE

  15. Cluster Software • Runs on Windows HPC Server 2008 • Academic Release • For non-commercial use • Commercial License

  16. DryadLINQ IR Toolkit • Library that uses DryadLINQ • Source code for a number of IR algorithms • Text retrieval - BM25/BM25F • Link based ranking - PageRank/SALSA-SETR • Text processing - Shingle based duplicate detection • Designed to work well with ClueWeb09 collection • Including preprocessing the data to load the cluster • Available from http://research.microsoft.com/dryadlinqir/

  17. ClueWeb09 Collection • Collected/Distributed by CMU • 1 billion web pages crawled in Jan/Feb 2009 • 10 different languages • en, zh, es, ja, de, fr, ko, it, pt, ar • 5 TB, compressed - 25 TB, uncompressed • Available to research community • Dataset available for your projects • Web graph, 503m English web pages

  18. Example: Term Frequencies Count term frequencies in a set of documents: vardocs = new PartitionedTable<Doc>(“tidyfs://dennis/docs”); varwords=docs.SelectMany(doc => doc.words); vargroups=words.GroupBy(word => word); varcounts = groups.Select(g => new WordCount(g.Key, g.Count())); counts.ToPartitionedTable(“tidyfs://dennis/counts.txt”); IN SM GB S OUT metadata doc => doc.words word => word g => new … metadata

  19. Distributed Execution of Term Freq LINQ expression Dryad execution IN SM DryadLINQ GB S OUT

  20. Execution Plan for Term Frequency SelectMany SM Sort Q SM pipelined GroupBy GB GB C Count (1) Distribute S D Mergesort MS pipelined GB GroupBy Sum Sum

  21. Execution Plan for Term Frequency SM SM SM SM Q Q Q Q SM GB GB GB GB GB C C C C (2) (1) S D D D D MS MS MS MS GB GB GB GB Sum Sum Sum Sum

  22. BM25 “Grep” • For batch evaluation of queries calculating BM25 is just a select operation string queryTermDocFreqURLLocal = @"E:\TREC\query-doc-freqs.txt"; Dictionary<string, int> dfs = GetDocFreqs(queryTermDocFreqURLLocal); PartitionedTable<InitialWordRecord> initialWords = PartitionedTable.Get<InitialWordRecord>(initialWordsURL); var BM25s = from doc in initialWords select ComputeDocBM25(queries, doc, dfs); BM25s.ToPartitionedTable(“tidyfs://dennis/scoredDocs”);

  23. PageRank Ranks web pages by propagating scores along hyperlink structure Each iteration as an SQL query: Join edges with ranks Distribute rank on edges GroupBy edge destination Aggregate into ranks. Repeat.

  24. One PageRank Step in DryadLINQ // one step of pagerank: dispersing and re-accumulating rank public static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks) { // join pages with ranks, and disperse updates varupdates =frompageinpages joinrankinranksonpage.nameequalsrank.name selectpage.Disperse(rank); // re-accumulate. return fromlistinupdates fromrankinlist grouprank.rankbyrank.nameintog select new Rank(g.Key, g.Sum()); }

  25. A Complete DryadLINQ Program public static IQueryable<Rank> PRStep(IQueryable<Page> pages, IQueryable<Rank> ranks) { // join pages with ranks, and disperse updates varupdates =frompageinpages joinrankinranksonpage.nameequalsrank.name selectpage.Disperse(rank); // re-accumulate. return fromlistinupdates fromrankinlist grouprank.rankbyrank.nameintog select new Rank(g.Key, g.Sum()); } public structPage { public UInt64 name; public Int64 degree; public UInt64[] links; public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; } public Rank[] Disperse(Rankrank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; } } public structRank { public UInt64 name; public double rank; public Rank(UInt64 n, double r) { name = n; rank = r; } } var pages = DryadLinq.GetTable<Page>(“tidyfs://pages.txt”); // repeat the iterative computation several times var ranks = pages.Select(page=>new Rank(page.name, 1.0)); for (intiter = 0; iter < iterations; iter++) { ranks = PRStep(pages, ranks); } ranks.ToDryadTable<Rank>(“outputranks.txt”);

  26. PageRank Optimizations • Benchmark PageRank on 954m page graph • Naïve approach – 10 iter ~3.5 hours 1.2TB • Apply several optimizations • Change data distribution • Pre-group pages by host • Renaming host groups with dense names • Cull out leaf nodes • Pre-aggregate ranks for each host • Final version – 10 iter 11.5 min 116 GB

  27. Tactics for Improving Performance • Loop unrolling • Reduce data movement • Improve data locality • Choose what to Group

  28. Gotchas • Non-deterministic output • E.g. RNG in user defined function • Writing to shared state

  29. Schedule for Today • 9:30 – 10:00 Meet with team, finalize project • 10:30-12:00 Work on projects, discuss approach with a speaker

  30. Backup Slides

  31. Cluster Configuration Head Node TidyFS Servers Cluster machines running tasks and TidyFS storage service

  32. How a Dryad job reads from TidyFS TidyFS Service Schedule Vertex Part 1 List Partitions in Stream Job Manager Part 1, Machine 1 Part 2, Machine 2 Schedule Vertex Part 2 D:\tidyfs\0001.data Get Read Path Machine 1, Part 1 Machine 1 … D:\tidyfs\0002.data Get Read Path Machine 2, Part 2 Machine 2 Cluster Machines

  33. How a Dryad job writes to TidyFS TidyFS Service Str1_v1 Part1 Schedule Vertex 1 Str1_v2 Part 2 Job Manager Schedule Vertex 2 Part 1 create Str1_v1 Machine 1 … Part 2 Machine 2 create Str1_v2 Cluster Machines

  34. How a Dryad job writes to TidyFS Str1 TidyFS Service Str1_v1 Part1 Delete Streams str1_v1, str1_v2 Str1_v2 Part 2 Job Manager Create Str1 ConcatenateStreams (str1, str1_v1, str1_v2) D:\tidyfs\0001.data Machine 1 AddPartitionInfo (Part 1, Machine 1, Size, Fingerprint, …) GetWritePath Machine 1, Part 1 … Completed Machine 2 AddPartitionInfo (Part 2, Machine 2, Size, Fingerprint, …) GetWritePath Machine2, Part 2 D:\tidyfs\0002.data Completed Cluster Machines

More Related