1 / 13

Big Data and Map Reduce

Big Data and Map Reduce. Paula Ta-Shma IBM Haifa Research Storage Systems 1/5/2013. Outline. Historical Context behind Map Reduce What is Big Data ? The Map Reduce Framework Connections with Storage Cloud. Relational Database Management Systems (RDBMS)

tommy
Télécharger la présentation

Big Data and Map Reduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data and Map Reduce Paula Ta-Shma IBM Haifa Research Storage Systems 1/5/2013

  2. Outline • Historical Context behind Map Reduce • What is Big Data ? • The Map Reduce Framework • Connections with Storage Cloud

  3. Relational Database Management Systems (RDBMS) Researched in 70s, products in 80s and beyond Relational (tabular) data model Query Language : SQL Efficient Query Processing: Indexing, Query Evaluation Strategies Transactions, Consistency Concurrency Control Security and Authorization Can be implemented on top of file systems Provide higher level of abstraction and functionality than file systems Example Use Cases Banking, Stock trading, Personnel Management, Inventory Management, Manfuacturing Data, etc. The list is very long SELECT Name FROM Accounts GROUP BY Name HAVING SUM(Balance) < 0 Historical Context Accounts

  4. Historical Context Cont. • Business Intelligence • Extract value from large amounts of data • Banking use case example • Identify and actively retain and pursue profitable customers • Analyze the performance of sales personnel, tellers and account managers • etc. • Massive query processing to analyze data across multiple dimensions • Requires read access to large amounts of data • Typically long running queries, can interfere with transactions • Work on a snapshot of data • Deployed as physically separate Data Warehousing systems • Mission critical • Data warehousing products in early 90s

  5. New Requirements in Internet Era • Massive amounts of data • Unstructured (e.g. text) and semi-structured data (e.g. XML) • Analysis capabilities beyond what is possible in SQL • LOW COST

  6. Map Reduce • Invented by Google • Inspired by functional programming languages map and reduce functions • Seminal paper: Dean, Jeffrey & Ghemawat, Sanjay (OSDI 2004), "MapReduce: Simplified Data Processing on Large Clusters" • Used at Google to completely regenerate Google's index of the World Wide Web. • It replaced the old ad hoc programs that updated the index and ran the various analyses. • Uses: • distributed pattern-based searching, distributed sorting, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation • Hadoop: • Open source implementation which matches Google’s specifications

  7. Source: IBM InfoSphere BigInsights slides, by Bruce Brown https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/Bruce%20Brown%20-%20BigInsights-1-16-12-external/$file/Bruce%20Brown%20-%20BigInsights-1-16-12-external.pdf

  8. Source: IBM InfoSphere BigInsights slides, by Bruce Brown https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/Bruce%20Brown%20-%20BigInsights-1-16-12-external/$file/Bruce%20Brown%20-%20BigInsights-1-16-12-external.pdf

  9. Map Reduce In Detail • Map Reduce material taken from Distributed Systems Course, MapReduce lecture by Paul Krzyzanowski • http://www.seas.gwu.edu/~gparmer/courses/f12_3411/distrib-5-mapreduce.pdf

  10. HDFS Architecture Source http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html

  11. Integrating Hadoop with Object Storage • Implement Hadoop FileSystem API • Leave MapReduce framework unchanged • => no changes needed for user applications • => work with Hadoop based technologies • Hive, Pig Latin, HBase, Jaql, and others HBase, Jaql,… Application Hadoop Map Reduce invokes Hadoop FileSystem API (create,open,close,read,write,seek,getblock locations…) implements Hadoop Distributed File System (HDFS) CDMI FileSystem S3FileSystem

  12. Amazon Elastic Map Reduce Source: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html

  13. The End

More Related