1 / 20

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker SIGMOD 2009 2009-10-09 Summarized by Jaeseok Myung. Intelligent Database Systems Lab

lilian
Télécharger la présentation

A Comparison of Approaches to Large-Scale Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker SIGMOD 2009 2009-10-09 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

  2. MapReduce vs. Parallel DBMS Center for E-Business Technology

  3. MapReduce 한재선, SearchDay2008, http://nexr.tistory.com Center for E-Business Technology

  4. Architectural Differences Center for E-Business Technology

  5. Benchmark Environment (1/2) • Systems • Hadoop: The most popular open-source MR implementation • DBMS-X: a parallel DBMS that stores data in a row-based format • Vertica: a column-based parallel DBMS • All Three systems were deployed on a 100-node cluster • Analytical Tasks • Data Loading • Selection Task • Aggregation Task • Join Task • UDF Aggregation Task Center for E-Business Technology

  6. Benchmark Environment (2/2) • Dataset • Documents : 600,000 unique documents for each node • 155 million UserVisits records (20GB/node) • 18 million Rankings records (1GB/node) Center for E-Business Technology

  7. 1. Data Loading Reorganization loading time Center for E-Business Technology

  8. 2. Selection Task • The selection task is a lightweight filter to find the pageURLs in the Rankings table(1GB/node) with a pageRank above a user-defined threshold • Query • SELECT pageURL, pageRank FROM Rankings WHERE pageRank > x; • x = 10, which yields approximately 36,000 records per data file on each node • For MR, implementing the same task with Java language Center for E-Business Technology

  9. 2. Selection Task - Result time for combining the output into a single file (Additional MR) Processing time Center for E-Business Technology

  10. 3. Aggregation Task • The aggregation task is calculating the total adRevenue generated for each sourceIP in the UserVisits(20GB/node), grouped by the sourceIP column • Query • SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; • This task always produces 2.5 million records Center for E-Business Technology

  11. 3. Aggregation Task - Result Center for E-Business Technology

  12. 4. Join Task • The join task consists of two sub-tasks that perform a complex calculation on two data sets • In the first part of the task, each system must find the sourceIP that generated the most revenue within a particular date range • Once these intermediate records are generated, the system must then calculate the average pageRank of all the pages visited during this interval • Query • SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP; • SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1; Center for E-Business Technology

  13. 4. Join Task - Result Center for E-Business Technology

  14. 5. UDF Aggregation Task • The final task is to compute the inlink count for each document in the dataset • Query • SELECT INTO Temp F(contents) FROM Document; • F : a user-defined function that parses the contents of each record in the Documents table and emits URLs into the database • With this function F, we populate a temporary table with a list of URLs and then can execute a simple query to calculate the inlink count • SELECT url, SUM(value) FROM Temp GROUP BY url; Center for E-Business Technology

  15. 5. UDF Aggregation Task - Result Center for E-Business Technology

  16. Conclusion MapReduce < Parallel DBMS Center for E-Business Technology

  17. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin VLDB 2009 2009-10-09 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

  18. HadoopDB • The Basic Idea (An Architectural Hybrid of MR & DBMS) • To use MR as the communication layer above multiple nodes running single-node DBMS instances • Queries are expressed in SQL, translated into MR by extending existing tools, and as much work as possible is pushed into the higher performing single node databases Center for E-Business Technology

  19. The Architecture of HadoopDB Center for E-Business Technology

  20. HadoopDB – Join Task Center for E-Business Technology

More Related