1 / 27

HadoopDB

HadoopDB. Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012. HadoopDB An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. ABSTRACT. The production environment for analytical data management applications is rapidly changing.

noreen
Télécharger la présentation

HadoopDB

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HadoopDB Presenters: Servarashidyan Somaieshahrokhi Aida parbale Spring 2012 azad university of sanandaj

  2. HadoopDBAn Architectural Hybrid of MapReduce andDBMS Technologies for Analytical Workloads azad university of sanandaj

  3. ABSTRACT The production environment for analytical data management applications is rapidly changing. the amount of data that needs to be analyzed is exploding, requiring hundreds to thousands of machines to work in parallel to perform the analysis. azad university of sanandaj

  4. ABSTRACT There tend to be two schools of thought regarding what technology to use for data analysis in such an environment. parallel databases MapReduce-based systems azad university of sanandaj

  5. ABSTRACT Given the exploding data problem, all but three of the above mentioned analytical database start-ups deploy their DBMS on a shared-nothing architecture. azad university of sanandaj

  6. DESIRED PROPERTIES the desired properties of a system designed for performing data analysis. Performance Fault Tolerance Ability to run in a heterogeneous environment Flexible query interface azad university of sanandaj

  7. Parallel DBMSs Parallel database systems stem from research performed in the late 1980s and most current systems are designed similarly to the early parallel DBMS research projects. azad university of sanandaj

  8. MapReduce MapReduce was introduced by Dean et. al. in 2004. MapReduce processes data distributed (and replicated) across many nodes in a shared-nothing cluster via three basic operations. azad university of sanandaj

  9. HADOOPDB The goal of this design is to achieve all of the properties described. The basic idea behind behindHadoopDB is to connect multiple single-node database systems using Hadoop as the task coordinator and network communication layer. azad university of sanandaj

  10. HadoopDB’s Components Database Connector Catalog Data Loader SQL to MapReduce to SQL (SMS) Planner azad university of sanandaj

  11. Consider the following query: SELECT YEAR(saleDate) as Years, SUM(revenue) as Sum FROM sales GROUP BY Years azad university of sanandaj

  12. Hive processes the above SQL query in a series of phases: • Parser • Semantic Analyzer • logical plan generator • Optimizer • physical plan generator • XML plan azad university of sanandaj

  13. BENCHMARKS azad university of sanandaj

  14. Benchmarked Systems • Hadoop • HadoopDB • Vertica • DBMS-X azad university of sanandaj

  15. Performance and Scalability Benchmarks • Data Loading • Grep Task • Selection Task • Aggregation Task • Join Task • UDF Aggregation Task azad university of sanandaj

  16. Data Loading Load Grep Load UserVisits azad university of sanandaj

  17. Grep Task SELECT * FROM Data WHERE field LIKE ‘%XYZ%’; azad university of sanandaj

  18. Selection Task SELECT pageURL, pageRank FROM Rankings WHERE pageRank > 10; azad university of sanandaj

  19. Join Task The join task involves finding the average page Rank of the set of pages visited from the source IP . The key difference between this task and the previous tasks is that it must read in two different data sets and join them together (page Rank information is found in the Rankings table and revenue information is found in the User Visits table). azad university of sanandaj

  20. Summarizes the results of this benchmark task azad university of sanandaj

  21. UDF Aggregation Task The final task computes, for each document, the number of inward links from other documents in the Documents table. HadoopDB was able to store each document separately in the Documents table using the TEXT data type. DBMS-X processed each HTML document file separately. azad university of sanandaj

  22. This overhead is not included azad university of sanandaj

  23. Summary of Results Thus Far In the absence of failures or background processes, HadoopDB is able to approach the performance of the parallel database systems. azad university of sanandaj

  24. Fault Tolerance And HeterogeneousEnvironment As described in Section 3, in large deployments of sharednothing machines, individual nodes may .experience high rates of failure or slowdown For parallel databases, query processing time is usually determined by the the time it takes for the slowest node to complete its task. azad university of sanandaj

  25. The results of the experiments are shown in Fig azad university of sanandaj

  26. Discussion It should be pointed out that although Vertica’s percentage slowdown was larger than Hadoop and HadoopDB, its total query time (even with the failure or the slow node) was still lower than Hadoop or HadoopDB. azad university of sanandaj

  27. Conclusion Our experiments show that HadoopDB is able to approach the performance of parallel database systems while achieving similar scores on fault tolerance, an ability to operate in heterogeneous environments,and software license cost as Hadoop. azad university of sanandaj

More Related