1 / 97

Large-Scale Deep Learning With TensorFlow

Large-Scale Deep Learning With TensorFlow. Jeff Dean Google Brain team g.co/brain In collaboration with many other people at Google. What is the Google Brain Team?. Research team focused on long term artificial intelligence research

mnoonan
Télécharger la présentation

Large-Scale Deep Learning With TensorFlow

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large-Scale DeepLearning WithTensorFlow Jeff Dean Google Brainteam g.co/brain In collaboration with many other people atGoogle

  2. What is the Google Brain Team? • Research team focused on long term artificial intelligence research • Mix of computer systems and machine learning researchexpertise • Pure ML research, and research in context of emerging ML applicationareas: • robotics, language understanding, healthcare,... • g.co/brain

  3. We Disseminate Our Work in ManyWays • By publishing ourwork • See papers atresearch.google.com/pubs/BrainTeam.html • By releasing TensorFlow, our core machine learning research system, as an open-sourceproject • By releasing implementations of our research models in TensorFlow • By collaborating with product teams at Google to get our research into realproducts

  4. What Do We ReallyWant? • Build artificial intelligence algorithms and systems that learn fromexperience • Use those to solve difficult problems that benefithumanity

  5. What do I mean byunderstanding?

  6. What do I mean byunderstanding?

  7. What do I mean byunderstanding?

  8. What do I mean byunderstanding? Query [ car parts for sale]

  9. What do I mean byunderstanding? Query [ car parts for sale] Document1 … car parking available for a smallfee. … parts of our floor model inventory forsale. Document2 Selling all kinds of automobile and pickup truck parts, engines, andtransmissions.

  10. Example Needs of theFuture • Which of these eye images shows symptoms of diabetic retinopathy? • Find me all rooftops in NorthAmerica • Describe this video inSpanish • Find me all documents relevant to reinforcement learning for robotics and summarize them inGerman • Find a free time for everyone in the Smart Calendar project to meet and set up avideoconference • Robot, please fetch me a cup of tea from the snackkitchen

  11. Growing Use of Deep Learning atGoogle #ofdirectoriescontainingmodeldescriptionfiles Across many products/areas: Android Apps drugdiscovery Gmail Imageunderstanding Maps Naturallanguage understanding Photos Roboticsresearch Speech Translation YouTube …many others...

  12. Important Property of NeuralNetworks Results get betterwith more data + bigger models + morecomputation (Better algorithms, new insights andimproved techniques always help,too!)

  13. Aside Manyofthetechniquesthataresuccessfulnowwere developed 20-30 yearsago Whatchanged? We nowhave: sufficientcomputationalresources large enough interestingdatasets Use of large-scale parallelism lets us look aheadmany generations of hardware improvements, aswell

  14. What do you want in a machine learningsystem? • Ease of expression: for lots ofcrazy ML ideas/algorithms • Scalability: can run experimentsquickly • Portability:canrunonwidevarietyofplatforms • Reproducibility: easy to share and reproduceresearch • Production readiness: go from research toreal products

  15. Open, standard softwarefor general machinelearning Great for Deep Learningin particular First released Nov2015 http://tensorflow.org/ and Apache 2.0license https://github.com/tensorflow/tensorflow

  16. http://tensorflow.org/whitepaper2015.pdf

  17. Preprint: arxiv.org/abs/1605.08695 Updated version will appear in OSDI2016

  18. Strong ExternalAdoption GitHub Launch Nov.2015 GitHub Launch Sep.2013 GitHub Launch Jan.2012 GitHub Launch Jan.2008 50,000+ binary installs in 72 hours, 500,000+ since November,2015

  19. Strong ExternalAdoption GitHub Launch Nov.2015 GitHub Launch Sep.2013 GitHub Launch Jan.2012 GitHub Launch Jan.2008 50,000+ binary installs in 72 hours, 500,000+ since November,2015 Most forked new repo on GitHub in 2015 (despite only being available in Nov,‘15)

  20. Motivations • DistBelief (our 1st system) was the first scalable deep learning system, but not as flexible as we wanted for researchpurposes • Better understanding of problem space allowed us to make some dramaticsimplifications • Define the industrial standard for machinelearning • Short circuit the MapReduce/Hadoopinefficiency

  21. TensorFlow: Expressing High-Level MLComputations • Core inC++ • Very lowoverhead • Different front ends for specifying/driving thecomputation • Python and C++ today, easy to addmore ... Python front end C++ front end Core TensorFlow Execution System CPU GPU Android iOS ….

  22. Computation is a dataflowgraph Graph of Nodes, also called Operations orops. Relu Xent biases weights Add MatMul examples labels

  23. Computation is a dataflowgraph withtensors Edges are N-dimensional arrays:Tensors biases weights Add Relu MatMul Xent examples labels

  24. Example TensorFlowfragment • Build a graph computing a neural netinference. import tensorflow as tf from tensorflow.examples.tutorials.mnist importinput_data mnist = input_data.read_data_sets('MNIST_data',one_hot=True) x = tf.placeholder("float", shape=[None, 784]) W= tf.Variable(tf.zeros([784,10])) b = tf.Variable(tf.zeros([10])) y = tf.nn.softmax(tf.matmul(x, W) + b)

  25. Computation is a dataflowgraph withstate 'Biases' is avariable Some ops computegradients −= updatesbiases biases ... Add ... Mul −= learningrate

  26. SymbolicDifferentiation • Automaticallyaddopstocalculatesymbolicgradients of variables w.r.t. lossfunction. • Apply these gradients with an optimizationalgorithm y_ = tf.placeholder(tf.float32, [None, 10]) cross_entropy = -tf.reduce_sum(y_ *tf.log(y)) opt =tf.train.GradientDescentOptimizer(0.01) train_op = opt.minimize(cross_entropy)

  27. Define graph and then execute itrepeatedly • Launchthegraphandrunthetrainingopsinaloop init = tf.initialize_all_variables() sess =tf.Session() sess.run(init) for i inrange(1000): batch_xs, batch_ys =mnist.train.next_batch(100) sess.run(train_step, feed_dict={x: batch_xs, y_:batch_ys})

  28. Computation is a dataflowgraph distributed GPU0 CPU biases Assign Sub Add ... Mul ... learningrate

  29. Assign Devices toOps • TensorFlowinsertsSend/RecvOpstotransporttensorsacrossdevices • Recv ops pull data fromSend ops GPU0 CPU biases Send Recv Assign Sub Add ... Mul ... learningrate

  30. Assign Devices toOps • TensorFlowinsertsSend/RecvOpstotransporttensorsacrossdevices • Recv ops pull data fromSend ops GPU0 CPU biases Send Recv Assign Sub Add ... Mul Send Recv ... Recv Send Recv learningrate Send

  31. Experiment Turnaround Time and ResearchProductivity • Minutes,Hours: • Interactive research! Instant gratification! • 1-4days • Tolerable • Interactivity replaced by running many experiments inparallel • 1-4weeks • High value experimentsonly • Progressstalls • >1month • Don’t eventry

  32. DataParallelism ParameterServers ... ... Model Replicas Data

  33. DataParallelism ParameterServers p ... ... Model Replicas Data

  34. DataParallelism ParameterServers ∆p p ... ... Model Replicas Data

  35. DataParallelism p’ = p +∆p ParameterServers ∆p p ... ... Model Replicas Data

  36. DataParallelism p’ = p +∆p ParameterServers p’ Model Replicas ... ... Data

  37. DataParallelism ParameterServers ∆p’ p’ ... ... Model Replicas Data

  38. DataParallelism p’’ = p’ +∆p ParameterServers ∆p’ p’ ... ... Model Replicas Data

  39. DataParallelism p’’ = p’ +∆p ParameterServers ∆p’ p’ ... ... Model Replicas Data

  40. Distributed trainingmechanisms Graph structure and low-level graph primitives (queues) allow us to play with synchronous vs. asynchronous updatealgorithms.

  41. Cross process communication is thesame! • Communicationacrossmachinesoverthenetworkabstractedidenticallyto cross devicecommunication. /job:worker/cpu:0 /job:ps/gpu:0 biases Send Recv Assign Sub Add ... Mul Send Recv ... Recv Send Recv learningrate Send No specialized parameter serversubsystem!

  42. Image Model TrainingTime 50GPUs 10GPUs 1GPU Hours

  43. Image Model TrainingTime 50GPUs 10GPUs 2.6 hours vs. 79.3 hours(30.5X) 1GPU Hours

  44. Sync converges faster (time toaccuracy) Synchronous updates (with backup workers) trains to higher accuracy faster Better scaling to more workers (less loss ofaccuracy) Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy Bengio, Raal Jozefowicz, ICLR Workshop 2016,arxiv.org/abs/1604.00981

  45. Sync converges faster (time toaccuracy) 40 hours vs. 50hours Synchronous updates (with backup workers) trains to higher accuracy faster Better scaling to more workers (less loss ofaccuracy) Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy Bengio, Raal Jozefowicz, ICLR Workshop 2016,arxiv.org/abs/1604.00981

  46. GeneralComputations AlthoughweoriginallybuiltTensorFlowforourusesaround deep neural networks, it’s actually quiteflexible Widevarietyofmachinelearningandotherkindsofnumeric computations easily expressible in the computation graph model

  47. Runs on Variety ofPlatforms phones single machines (CPU and/or GPUs)… distributed systems of 100s of machines and/or GPUcards custom MLhardware

  48. Trend: Much More Heterogeneoushardware General purpose CPU performance scaling hasslowed significantly Specializationofhardwareforcertainworkloadswillbemore important

  49. Tensor ProcessingUnit Custom machine learningASIC In production use for >16 months: used on every search query, used for AlphaGo match,... See Google Cloud Platform blog: Google supercharges machine learning tasks with TPU custom chip, by Norm Jouppi, May,2016

More Related