1 / 8

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters. Google’s Experience and Large Scale Indexing Presented by Chris Moore. Contents. Abstract Introduction Programming Model Implementation Refinements Performance Experience Google’s Experience with MapReduce

lance
Télécharger la présentation

MapReduce: Simplified Data Processing on Large Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce: Simplified Data Processing on Large Clusters Google’s Experience and Large Scale Indexing Presented by Chris Moore

  2. Contents • Abstract • Introduction • Programming Model • Implementation • Refinements • Performance • Experience • Google’s Experience with MapReduce • Improvements with MapReduce To Search Indexing • Related Work • Conclusion

  3. Abstract • MapReduce - Programming model used for large data sets • Map: key / value -> intermediate key / value pair • Reduce: Merges all int. values associated with the same int. key • Easy utilization of parallel and distributed computing • Hundreds of Programs / Thousands of jobs

  4. Introduction • Issue – How to handle very large data sets? • Parallelize computation, distribute data, failures • Map / Reduce allows for easier programming while a library handles the above issues • Simple, powerful interface • Automatic parallelization and distribution of large computations

  5. Google’s Experience With MapReduce • Extraction of data for popular queries • Google Zeitgeist • Extracting properties of web pages • Geographical locations of web pages for localized search • Clustering problems for Google News and Shopping • Large-scale machine learning problems and graph computations

  6. Google Search – Large Scale Indexing • Production Indexing System • Produces data structures for searches • Completely rewritten with MapReduce • What it does: • Crawler gathers approx. 20 TB of documents • Indexing Process: 5-10 map reduce operations

  7. Improvements on the Indexing System • Indexing code is Simpler • 3800 lines of C++ to 700 w/ MapReduce • Improved Performance • Separates unrelated computations • Avoids extra passes over data • Easier to Operate • MapReduce handles issues without operator intervention • Machine failures, slow machines, networking hiccups

  8. Conclusion of Part Six • Google has seen many benefits and improvements in MapReduce since 2003 • MapReduce has completely changed the way Google handles its search indexing • Source – MapReduce: Simplified Data Processing on Large Clusters • Jeffrey Dean and Sanjay Ghemawat

More Related