1 / 43

MapReduce: Simplied Data Processing on Large Clusters

MapReduce: Simplied Data Processing on Large Clusters. Yunxing Dai, Huan Feng. 8/24/2014. EECS 584, Fall 2011. 1. Real world problem. Count the number of occurences of each word in a huge collections of word lists. sample input: seven book of Harry Potter. Real world problem.

saskia
Télécharger la présentation

MapReduce: Simplied Data Processing on Large Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng 8/24/2014 EECS 584, Fall 2011 1

  2. Real world problem • Count the number of occurences of each word in a huge collections of word lists. • sample input: seven book of Harry Potter

  3. Real world problem • Count the number of occurences of each word in a huge collections of word lists.

  4. Possible solution • Hash table • each entry is key value pair, (word, occurrence) • scan all the file, put each word into the hash table

  5. Real world problem--follow up • What if you are given a huge set of files and have access to a large set of machines? • Problem with hash table: • low concurrency • hard to scale • one node fail, restart all work • MapReduce solution

  6. Map primitive Idea from functional language Given a function, apply the function to all element INDIVIDUALLY in the list, combine the result into a new list e.g. increment each elems from a list by 1 EECS 584, Fall 2011 6 8/24/2014

  7. Reduce primitive Idea from functional language Apply a function to all elems from a list, combine them into a single resule e.g. calculate the sum of a list EECS 584, Fall 2011 7 8/24/2014

  8. Map reduce solution--Single node • Map each single word into a (key, value) pair. • "Good" -> ("Good", 1) • Put together all the pairs that have the same key. Input these pairs to a reduce program. Add the value together • [("Good", 1), ("Good", 1), ("Good", 1)] -> 3

  9. Map reduce solution Map Reduce Merge and Sort Reduce Reduce Reduce

  10. Map reduce solution Map Reduce Merge and Sort Reduce Reduce Reduce

  11. Map reduce solution Map Reduce Merge and Sort Reduce Reduce Reduce

  12. Map reduce solution • What if now we are given a huge number of files. • And a large number of machines.

  13. It can be scalable! • Map can be applied to different part of input in parallel. • If part of map tasks failed, just need to restart them instead of restarting all.

  14. Map reduce solution-scalable version • Map : split the file into several parts, apply map function to every part of them. • Shuffle : distribute intemediate result into different buckets according to the hash value of key, assign buckets to several reducers. Each reducer sort the pairs by key. • Reduce : apply the reduce function to all the elements that have the same key and produce the result.

  15. MapReduce • Gerneralized Software framework • Users are only responsible for provide two functions : map and reduce • Easy to scale to large amount of machines.

  16. Split the input files into several pieces

  17. Each piece is assigned to one worker(mapper)

  18. Before sorting, the key value pairs are hashed by key into R buckets.

  19. (The, 1), (Good, 1), (The 1), (Never, 1)

  20. (Is, 1), (Is, 1), (Tie, 1), (Work, 1) (The, 1), (Good, 1), (The 1), (Never, 1)

  21. (The, 1), (Good, 1), (The 1), (Never, 1) (Is, 1), (Is, 1), (Tie, 1), (Work, 1) Each bucket is read by one worker(reducer), then sort and produce the results.

  22. Master program, control the process and assign work to workers

  23. Fault tolerance • worker fail : simply assigned the another worker to do it. • master fail : restart the whole work

  24. Implementation details • Locality • Take location information of input files into account • assigned a map task to closest machine of the input data.

  25. Implementation details • Backup tasks • abnormal machines in a task lengthen the total time. • When a task is almost finished, duplicate remaining tasks as back-up tasks. • Whenever a primary or a backup execution is done, mark the task as finished. TASK1 TASK2 TASK3 In progress Completed

  26. Implementation details • Backup tasks • abnormal machines in a task lengthen the total time. • When a task is almost finished, duplicate remaining tasks as back-up tasks. • Whenever a primary or a backup execution is done, mark the task as finished. TASK1 TASK2 TASK3 In progress Completed

  27. Implementation details • Backup tasks • abnormal machines in a task lengthen the total time. • When a task is almost finished, duplicate remaining tasks as back-up tasks. • Whenever a primary or a backup execution is done, mark the task as finished. TASK1 TASK2 TASK3 Back Up In progress TASK1 Completed

  28. Implementation details • Backup tasks • abnormal machines in a task lengthen the total time. • When a task is almost finished, duplicate remaining tasks as back-up tasks. • Whenever a primary or a backup execution is done, mark the task as finished. TASK1 TASK2 TASK3 In progress TASK1 Completed

  29. Useful Extensions Partitioning Functions hash-based or range-based self-defined partition function Combiner Function (similar to Reduce Function) <any, 1> <a, 1> <any, 1> <any ,1> resolve significant repetition in intermediate outputs Skipping Bad Records errors or bugs acceptable to ignore a few records Local Execution help facilitate debugging, profiling and testing EECS 584, Fall 2011 34 8/24/2014

  30. Performance & Evaluation • Cluster Configuration • 1800 nodes • 2×2GHz, 4GB memory, 2×160GB IDE, Gb Ethernet link • 2-level tree-shaped switched network • Grep • in 1010 100-byte records • M = 15000 R = 1 • take ~150 seconds EECS 584, Fall 2011 35 8/24/2014

  31. Performance & Evaluation (Sort) • Sort • 1010 100-byte records • M = 15000, R = 4000 • Normal, No-Backup, 200 tasks killed • A few things to note • Input rate is higher than the shuffle & output rate • no backup, execution flow is similar except the long tail • kill tasks, the tasks restarted & the rate drop to zero EECS 584, Fall 2011 36 8/24/2014

  32. EECS 584, Fall 2011 37 8/24/2014

  33. Application of MapReduce • Broadly applicable • large-scale machine learning problems • clustering problems for Google News • extraction of data & properties • graph computations • Large-Scale Indexing • The indexing code is simpler, smaller (~3800 to ~700) • Indexing process is much easier to operate & easy spead up

  34. MapReduce & Parallel DBMS • MapReduce is not novel at all • a entriely new paradigm? • MapReduce is a step backwards • no schema • no high-level access language • MapReduce Is a poor implementation • no Index • overlook skew • lots of P2P Network traffic in the shuffle phase • Missing features • indexes, updates, transactions • Not compatible to DBMS Tools

  35. MapReduce & Parallel DBMS

  36. MapReduce & Parallel DBMS • Parallel database • has significant performance advantage • take a lot of time to tune and setup • are not general enough (UDFs, UDTs) • SQL is not that easy & straightforward • MapReduce • easy to setup & easy to program • scalable & fault-tolerant • bruteforce solution

  37. What is MapReduce • A parallel programming model /data processing paradigm rather than a complete DBMS • Does not target everything DBMS targets • It's simple but it works • Works for those who • have a lot of data (of some specific type) • UDTs and UDFs are complex to tune • would rather program in sequencial language rather than SQL • no need to index data because data change all the time • do not need to pay

  38. Questions?

More Related