MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters Google’s Experience and Large Scale Indexing Presented by Chris Moore

Contents • Abstract • Introduction • Programming Model • Implementation • Refinements • Performance • Experience • Google’s Experience with MapReduce • Improvements with MapReduce To Search Indexing • Related Work • Conclusion

Abstract • MapReduce - Programming model used for large data sets • Map: key / value -> intermediate key / value pair • Reduce: Merges all int. values associated with the same int. key • Easy utilization of parallel and distributed computing • Hundreds of Programs / Thousands of jobs

Introduction • Issue – How to handle very large data sets? • Parallelize computation, distribute data, failures • Map / Reduce allows for easier programming while a library handles the above issues • Simple, powerful interface • Automatic parallelization and distribution of large computations

Google’s Experience With MapReduce • Extraction of data for popular queries • Google Zeitgeist • Extracting properties of web pages • Geographical locations of web pages for localized search • Clustering problems for Google News and Shopping • Large-scale machine learning problems and graph computations

Google Search – Large Scale Indexing • Production Indexing System • Produces data structures for searches • Completely rewritten with MapReduce • What it does: • Crawler gathers approx. 20 TB of documents • Indexing Process: 5-10 map reduce operations

Improvements on the Indexing System • Indexing code is Simpler • 3800 lines of C++ to 700 w/ MapReduce • Improved Performance • Separates unrelated computations • Avoids extra passes over data • Easier to Operate • MapReduce handles issues without operator intervention • Machine failures, slow machines, networking hiccups

Conclusion of Part Six • Google has seen many benefits and improvements in MapReduce since 2003 • MapReduce has completely changed the way Google handles its search indexing • Source – MapReduce: Simplified Data Processing on Large Clusters • Jeffrey Dean and Sanjay Ghemawat

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters

Presentation Transcript

Leadership

Career Clusters: Focusing Education on the Future

Seismic Reflection: Processing and Interpretation

ADCP Toolbox How - To

Optimizing Iterative MapReduce Jobs

NGS Data Processing

Order and Chaos

Data Management and Data Processing Support on Array-Based Scientific Data

Multiple Indicator Cluster Surveys Survey Design Workshop

Hematopoiesis Simplified: Part 1 Erythropoiesis

Apache Mahout Feb 13, 2012 Shannon Quinn

MapReduce , Collective Communication, and Services

Processing XML Documents

The Early Methods of Data Processing

XML and Web Data

High Energy Astrophysics

Data Mining 2

The Wind Lidar Mission ADM-Aeolus Data Processing

Hello!

Data Processing

Merchant Advisors - Your Small Business Funding Experts