1 / 41

COP5725 Advanced Database Systems

COP5725 Advanced Database Systems. MapReduce. Spring 2014. What is MapReduce?. Programming model expressing distributed computations at a massive scale

glora
Télécharger la présentation

COP5725 Advanced Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COP5725Advanced Database Systems MapReduce Spring 2014

  2. What is MapReduce? • Programming model • expressing distributed computations at a massivescale • “…the computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: map and reduce.”- Jeff Dean and Sanjay Ghemawat [OSDI’04] • Execution framework • organizing and performing data-intensive computations • processing parallelizableproblems across huge datasets using a large number of computers (nodes) • Open-source implementation: Hadoop and others

  3. Why does MapReduce Matter? • We are now in the so-called Big Data era • “Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it within a tolerable elapsed time for its user population.”- Teradata Magazine, 2011 • “Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” – The Mckinsey Global Institute, 2011 The volume, variety and velocity of data that is difficult to manage using traditional data management technology

  4. How Much Data ? • Google processes 100 PB (1015 bytes) a day (2013) • Facebook has 300 PB of user data + 500 TB/day (2013) • YouTube 1000 PB video storage (2013) • CERN’s LHC (Large Hadron Collider) will generate 15 PB a year (2013) 640Kought to be enough for anybody

  5. Who cares ? • Organizations and companies that can leverage large scale consumer-generated data • Consumer markets (hotels, airlines, Amazon, Netflix) • Social media (Facebook, Twitter, LinkedIn, YouTube) • Search providers (Google, Microsoft) • Other Enterprises are slowly getting into it • Healthcare • Financial Institutes • ……

  6. Why not RDBMS? • Types of data • Structured data or transactions, text data, semi-structured data, unstructured data, streaming data, …… • Ways to cook data • Aggregation and statistics • Indexing, searching and querying • Knowledge discovery • Limitations • Very difficult to scale out (but not scale up) • Physically limited to CPUs, memory and disk storage • Require structure of tables with rows and columns • Table schemas have to be pre-defined

  7. What We Need… • A Distributed System • Scalable • Fault-tolerant • Easy to program • Applicable to many real-world Big Data problems • …… Here comes MapReduce

  8. General Idea • Divide & Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result”

  9. Scale Out Over Many Machines • Challenges • Workload partitioning: how do we assign work units to workers? • Load balancing: what if we have more work units than workers? • Synchronization: what if workers need to share partial results? • Aggregation: how do we aggregate partial results? • Termination: how do we know all the workers have finished? • Fault tolerance: what if workers die? • Common theme • Communication between workers (e.g., to exchange states) • Access to shared resources (e.g., data)

  10. Existing Methods • Programming models • Shared memory (pthreads) • Message passing (MPI) • Design Patterns • Master-slaves • Producer-consumer flows • Shared work queues Message Passing Shared Memory Memory P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 producer consumer master work queue slaves producer consumer

  11. Problem with Current Solutions • Lots of programming work • communication and coordination • work partitioning • status reporting • optimization • locality • Repeat for every problem you want to solve • Stuff breaks • One server may stay up three years (1,000 days) • If you have 10,000 servers, expect to lose 10 a day

  12. MapReduce: General Ideas • Typical procedure: • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output • Key idea: provide a functional abstraction for these two operations • map (k, v) → <k’, v’> • reduce(k’, v’) → <k’’, v’’> • All values with the same key are sent to the same reducer • The execution framework handles everything else… Map Reduce

  13. General Ideas k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3

  14. Two More Functions • Apart from Map and Reduce, the execution framework handles everything else… • Not quite…usually, programmers can also specify: • partition (k’, number of partitions) → partition for k’ • Divides up key space for parallel reduce operations • Often a simple hash of the key, e.g., hash(k’) mod n • combine(k’, v’) → <k’, v’>* • Mini-reducers that run in memory after the map phase • Used as an optimization to reduce network traffic

  15. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c c 2 9 8 reduce reduce reduce r1 s1 r2 s2 r3 s3

  16. Importance of Local Aggregation • Ideal scaling characteristics: • Twice the data, twice the running time • Twice the resources, half the running time • Why can’t we achieve this? • Synchronization requires communication • Communication kills performance • Thus… avoid communication! • Reduce intermediate data via local aggregation • Combinerscan help

  17. Example: Word Count v1.0 • Input: {<document-id, document-contents>} • Output: <word, num-occurrences-in-web>. e.g. <“obama”, 1000>

  18. <doc2, “hennesy is the president of stanford”> <doc1, “obama is the president”> <docn, “this is an example”> … … <“obama”, 1> <“this”, 1> <“hennesy”, 1> <“is”, 1> <“is”, 1> <“is”, 1> <“the”, 1> <“an”, 1> <“the”, 1> … <“president”, 1> <“example”, 1> Group by reduce key <“the”, {1, 1}> <“obama”, {1}> … <“is”, {1, 1, 1}> … <“the”, 2> <“obama”, 1> <“is”, 3>

  19. Word Count v2.0

  20. Combiner Design • Combiners and reducers share same method signatures • Sometimes, reducers can serve as combiners • Often, not… • Remember: combiner are optional optimizations • Should not affect algorithm correctness • May be run 0, 1, or multiple times • Example: find average of all integers associated with the same key

  21. Computing the Mean v1.0 Why can’t we use reducer as combiner?

  22. Computing the Mean v2.0 • Why doesn’t this work? • combiners must have the same input and output key-value type, which also must be the same as the mapper output type and the reducer input type

  23. Computing the Mean v3.0

  24. MapReduce Runtime • Handles scheduling • Assigns workers to map and reduce tasks • Handles “data distribution” • Moves processes to data • Handles synchronization • Gathers, sorts, and shuffles intermediate data • Handles errors and faults • Detects worker failures and restarts • Everything happens on top of a distributed FS

  25. Execution UserProgram (1) submit Master (2) schedule map (2) schedule reduce worker split 0 (6) write output file 0 (5) remote read worker split 1 (3) read split 2 (4) local write worker split 3 output file 1 split 4 worker worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files

  26. Implementation • Google has a proprietary implementation in C++ • Bindings in Java, Python • Hadoopis an open-source implementation in Java • Development led by Yahoo, used in production • Now an Apache project • Rapidly expanding software ecosystem • Lots of custom research implementations • For GPUs, cell processors, etc.

  27. Distributed File System • Don’t move data to workers… move workers to the data! • Store data on the local disks of nodes in the cluster • Start up the workers on the node that has the data local • Why? • Not enough RAM to hold all the data in memory • Disk access is slow, but disk throughput (data transfer rate) is reasonable • A distributed file system is the answer • GFS (Google File System) for Google’s MapReduce • HDFS (Hadoop Distributed File System) for Hadoop

  28. GFS • Commodity hardware over “exotic” hardware • Scale “out”, not “up” • Scale out (horizontally): add more nodes to a system • Scale up (vertically): add resources to a single node in a system • High component failure rates • Inexpensive commodity components fail all the time • “Modest” number of huge files • Multi-gigabyte files are common, if not encouraged • Files are write-once, mostly appended to • Perhaps concurrently • Large streaming reads over random access • High sustained throughput over low latency

  29. Seeks vs. Scans • Consider a 1 TB database with 100-byte records • We want to update 1 percent of the records • Scenario 1: random access • Each update takes ~30 ms (seek, read, write) • 108 updates = ~35 days • Scenario 2: rewrite all records • Assume 100 MB/s throughput • Time = 5.6 hours(!) • Lesson: avoid random seeks!

  30. GFS • Files stored as chunks • Fixed size (64MB) • Reliability through replication • Each chunk replicated across 3+ chunk servers • Single master to coordinate access, keep metadata • Simple centralized management • No data caching • Little benefit due to large datasets, streaming reads • Simplify the API • Push some of the issues onto the client (e.g., data layout)

  31. Relational Databases vs. MapReduce • Relational databases: • Multipurpose: analysis and transactions; batch and interactive • Data integrity via ACID transactions • Lots of tools in software ecosystem (for ingesting, reporting, etc.) • Supports SQL (and SQL integration, e.g., JDBC) • Automatic SQL query optimization • MapReduce (Hadoop): • Designed for large clusters, fault tolerant • Data is accessed in “native format” • Supports many query languages • Programmers retain control over performance • Open source

  32. Workloads • OLTP (online transaction processing) • Typical applications: e-commerce, banking, airline reservations • User facing: real-time, low latency, highly-concurrent • Tasks: relatively small set of “standard” transactional queries • Data access pattern: random reads, updates, writes (involving relatively small amounts of data) • OLAP (online analytical processing) • Typical applications: business intelligence, data mining • Back-end processing: batch workloads, less concurrency • Tasks: complex analytical queries, often ad hoc • Data access pattern: table scans, large amounts of data involved per query

  33. Relational Algebra in MapReduce • Projection • Map over tuples, emit new tuples with appropriate attributes • No reducers, unless for regrouping or resorting tuples • Alternatively: perform in reducer, after some other processing • Selection • Map over tuples, emit only tuples that meet criteria • No reducers, unless for regrouping or resorting tuples • Alternatively: perform in reducer, after some other processing

  34. Relational Algebra in MapReduce • Group by • Example: What is the average time spent per URL? • In SQL: • SELECT url, AVG(time) FROM visits GROUP BY url • In MapReduce: • Map over tuples, emit time, keyed by url • Framework automatically groups values by keys • Compute average in reducer • Optimize with combiners

  35. Join in MapReduce • Reduce-side Join: group by join key • Map over both sets of tuples • Emit tuple as value with join key as the intermediate key • Execution framework brings together tuples sharing the same key • Perform actual join in reducer • Similar to a “sort-merge join” in database terminology

  36. Reduce-side Join: Example • R1 • R4 • S2 • S3 • Map • keys • values • R1 • R4 • S2 • S3 • Reduce • keys • values • R1 • S2 • S3 • R4 Note: no guarantee if R is going to come first or S

  37. Join in MapReduce • R1 • R2 • R3 • R4 • S1 • S2 • S3 • S4 • Map-side Join: parallel scans • Assume two datasets are sorted by the join key A sequential scan through both datasets to join(called a “merge join” in database terminology)

  38. Join in MapReduce • Map-side Join • If datasets are sorted by join key, join can be accomplished by a scan over both datasets • How can we accomplish this in parallel? • Partition and sort both datasets in the same manner • In MapReduce: • Map over one dataset, read from other corresponding partition • No reducers necessary (unless to repartition or resort)

  39. Join in MapReduce • In-memory Join • Basic idea: load one dataset into memory, stream over other dataset • Works if R << S and R fits into memory • Called a “hash join” in database terminology • MapReduce implementation • Distribute R to all nodes • Map over S, each mapper loads R in memory, hashed by join key • For every tuple in S, look up join key in R • No reducers, unless for regrouping or resorting tuples

  40. Which Join Algorithm to Use? • In-memory join > map-side join > reduce-side join • Why? • Limitations of each? • In-memory join: memory • Map-side join: sort order and partitioning • Reduce-side join: general purpose

  41. Processing Relational Data: Summary • MapReduce algorithms for processing relational data: • Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce • Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer • Multiple strategies for relational joins • Complex operations require multiple MapReduce jobs • Example: top ten URLs in terms of average time spent • Opportunities for automatic optimization

More Related