200 likes | 349 Vues
This paper presents Delite, a dynamic parallel runtime that enhances the performance of machine learning applications through implicit parallelism and domain-specific optimizations. It addresses key concepts like learning patterns, regression, and adaptive control while providing a familiar API for writing ML applications using a MATLAB-like language. By integrating task and data parallelism on heterogeneous hardware, Delite improves both throughput and latency. The experimental results show significant scalability improvements across various machine learning algorithms, highlighting the framework's potential for future ML applications.
E N D
Accelerating Machine Learning Applications using Delite Anand Atreya, Kevin Brown, George Rossin Stanford University CS315A 1st June, 2010
What is Machine Learning? • Learning patterns from data • Regression • Inference (e.g. Loopy Belief Propagation) • Adaptive control (e.g. Reinforcement Learning) • Neural networks (e.g. Restricted Boltzmann Machine) • A good domain for studying parallelism • Both throughput and latency are important • Many applications exhibit both data and task parallelism • Often at varying granularities • At the core of many emerging applications (speech recognition, robotics, data mining, etc.) • Many optimizations specific to the domain • e.g., Sacrificing accuracy for performance
Domain Specific Languages • A language or library that exploits domain knowledge for productivity and efficiency • Widely used in many application areas • MATLAB, Verilog, OpenGL • Raises the level of abstraction higher than general purpose languages • Programmer describes what he wants to do rather than how he wants to do it • Allows for an implicitly parallel environment
OptiML: A DSL for ML • Provides a familiar (MATLAB-like) language and API for writing ML applications • Embedded in Scala • Encodes common ML kernels as implicitly parallel operations • Matrix multiply, dot product, etc.
What is Delite? • A dynamic parallel runtime • Domain Extracted Locality Informed Task Execution • Executes a task graph on parallel, heterogeneous hardware • CPUs, GPUs, etc. • Performs both static and dynamic scheduling • Integrates task and data parallelism in a single environment • Can apply dynamic domain-specific optimizations provided by a Domain-Specific Language
Delite Execution Model Calls Matrix DSL methods DSL defers OP execution to Delite Delite applies generic & domain transformations and generates mapping
Scheduling • An NP Hard problem in general • Very simple local clustering algorithm for general purpose scheduling • Checks for dependency on previous M OPs to minimize communication • Control flow hints • Allows for an efficient parallel for loop schedule when the loop iterations are independent without an explicit parallelFor construct • Data Parallel operations • Splits each OP into N chunks for N threads
Integrating the GPU(s) • Portion of the task graph to be executed on the GPU is sent to a dedicated GPU scheduler • GPU scheduler identifies OP and sends appropriate CUDA kernel to GPU • Manages the GPU memory • Shipped data remains on GPU for fast re-use until memory overflows or CPU requests data
Experimental Results • Performed using ML applications written in OptiML and using Delite • The application and Delite scheduler are run in a single thread + • Either N CPU worker threads • Or 1 GPU
ML Kernel Tests • 3 Application Kernels • Gaussian Discriminant Analysis • Naïve Bayes • Weighted Linear Regression • System 1: Multi-Core CPU & GPU Tests • Intel Nehalem • 2 sockets, 8 cores, 16 threads • 24 GB DRAM • NVIDIA GTX 275 GPU • System 2: Scalability Tests • Sun Niagara T2+ • 4 sockets, 32 cores, 256 threads • 128 GB DRAM
Gaussian Discriminant Analysis 2.4x 2.6x 3.4x 3.9x 13.1x 18.7x *Normalized to execution time for 1 CPU
Naïve Bayes 2.2x 3.5x 5.6x 7.6x
Weighted Linear Regression 1.1x 2.5x 3.3x 3.9x 4.3x 5.5x
Deep Belief Networks (DBNs) • Very promising algorithms • Learns complex features • Shows great potential in solving difficult problems • Researched by Andrew Ng • Research is limited by compute power • Computation scales quadratically • Algorithm dominated by serial matrix multiplications
DBN Current Results 3.1x 22.3x
Conclusions • Domain knowledge facilitates implicit coarse-grained parallelism • Delite targets heterogeneous hardware automatically • Hits the sweet spot of ease-of-programming and scalable performance
Future Work • Hardware scheduling acceleration • Dataflow processing could become more feasible due to the natural expression of coarse-grained tasks in Delite • Static analysis of task graph • Allows intelligent scheduling before runtime • Task graph optimizations
Thank You! • Questions? • Thanks to Hassan Chafi, ArvindSujeeth, HyoukJoong Lee, Nathan Bronson, and KunleOlukotun