1 / 16

MapReduce

MapReduce. michel.bruley@teradata.com. Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …. April 2012. What is MapReduce ?. Restricted parallel programming model meant for large clusters User implements Map() and Reduce() ‏ functions Parallel computing framework

myra
Télécharger la présentation

MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce michel.bruley@teradata.com Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, … April 2012

  2. What is MapReduce? • Restricted parallel programming model meant for large clusters • User implements Map() and Reduce()‏ functions • Parallel computing framework • Libraries take care of EVERYTHING else • Parallelization • Fault Tolerance • Data Distribution • Load Balancing • Useful model for many practical tasks

  3. Map and Reduce • The idea of Map, and Reduce is 40+ year old • Present in all Functional Programming Languages. • See, e.g., APL, Lisp and ML • Alternate names for Map: Apply-All • Higher Order Functions • take function definitions as arguments, or • return a function as output • Map and Reduce are higher-order functions.

  4. Map and Reduce Functions • Functions borrowed from functional programming languages (eg. Lisp)‏ • Map()‏ • Process a key/value pair to generate intermediate key/value pairs • Reduce()‏ • Merge all intermediate values associated with the same key

  5. Example: Counting Words • Map()‏ • Input <filename, file text> • Parses file and emits <word, count> pairs • eg. <”hello”, 1> • Reduce()‏ • Sums all values for the same key and emits <word, TotalCount> • eg. <”hello”, (3 5 2 7)> => <”hello”, 17>

  6. Execution on Clusters Input files split (M splits) Assign Master & Workers Map tasks Writing intermediate data to disk (R regions) Intermediate data read & sort Reduce tasks Return

  7. Map/Reduce Cluster Implementation M map tasks R reduce tasks Input files Intermediate files Output files split 0 split 1 split 2 split 3 split 4 Output 0 Output 1 Several map or reduce tasks can run on a single computer Each intermediate file is divided into R partitions, by partitioning function Each reduce task corresponds to one partition

  8. Map Reduce vs. Parallel Databases • Map Reduce widely used for parallel processing • Google, Yahoo, and 100’s of other companies • Example uses: compute PageRank, build keyword indices, do data analysis of web click logs, …. • Database people say: • but parallel databases have been doing this for decades • Map Reduce people say: • we operate at scales of 1000’s of machines • We handle failures seamlessly • We allow procedural code in map and reduce and allow data of any type

  9. Typical MapReduce Cluster

  10. Map Reduce Implementations • Google • Not available outside Google • Hadoop • An open-source implementation in Java • Uses HDFS for stable storage • Download: http://lucene.apache.org/hadoop/ • Teradata Aster • Cluster-optimized SQL Database that also implements MapReduce • IITB alumnus among founders • And several others, such as Cassandra at Facebook, etc.

  11. MapReduce v. Hadoop

  12. Solutions Stack for Teradata Aster Data Integration / ETL Business Intelligence Tools Query Tools Analytics Specialists Aster Data Ecosystem Systems Management Security Aster Data nCluster Operating System Aster Data Platform Infrastructure Servers Cloud Infrastructure Storage

  13. Teradata Aster Platform Infrastructure For physical infrastructure (non-cloud) deployments Aster Data Analytic Platform Aster Data nCluster packaged software nCluster Operating System Certified Linux operating system Server Hardware Certified commodity (x86) server hardware with internal storage

  14. Teradata Aster Infrastructure For cloud deployments Aster Data Analytic Platform Aster Data nCluster packaged software nCluster Operating System Linux operating system Compute Instance Compute instance from cloud provider (e.g. Amazon Web Services EC2) CC xLarge Storage Storage connected to cloud computing capacity EBS Ephemeral

  15. Teradata Aster Architecture for Analytics Your Analytics & Advanced Reporting Applications • Support for in-database processing of custom applications written in broad variety of languages • Integration with third-party packaged software via ODBC/JDBC or in-database integration App App App App Aster Data nCluster Analytic Functions and Frameworks • Rich libraries of MapReduce analytics from Aster Data and partners • Visual development environment--develop in hours • Standard SQL interface • MapReduce processing integrated with SQL via SQL-MapReduce interface Unified Interface SQL-MapReduce SQL Analytics Processing Engines • Optimized SQL engine • Fully-integrated in-database MapReduce SQL MapReduce … Massively Parallel Data Stores • Hybrid row/column DBMS • Linear, incremental scalability • Commodity hardware

  16. Teradata Aster Ecosystem *Oracle BIEE certification currently in process

More Related