Download
ricardo integrating r and hadoop n.
Skip this Video
Loading SlideShow in 5 Seconds..
Ricardo: Integrating R and Hadoop PowerPoint Presentation
Download Presentation
Ricardo: Integrating R and Hadoop

Ricardo: Integrating R and Hadoop

247 Vues Download Presentation
Télécharger la présentation

Ricardo: Integrating R and Hadoop

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Ricardo: Integrating R and Hadoop Angel Trifonov Yun Lu Ying Wang

  2. Contents • Introduction • Motivating Examples • Preliminaries • Ricardo Design • Experimental Study • Conclusion

  3. Introduction

  4. Data collection • Enterprise datasets • Why are these datasets important? • Statistical analysis on datasets • Data analyst workflow • Explore/summarize data • Built a model • Used to improve business practices • Need a statistical package

  5. R and dms • R design • Single server • Main memory • Large data  FAIL! • Problem for analysts – they work with large datasets • Vertical scalability • Subsets • Neither is ideal! • Large-scale data management systems (DMS) • Example: Hadoop • Aggregation processing

  6. ricardo • Overview • Scalable platform for deep analytics • Part of eXtreme Analytics Platform (XAP) project • Named after economist David Ricardo • Facilitates trading between R and Hadoop • Previous work on Map-Reduce • Small data – combined approach success • Several advantages

  7. Ricardo advantages • Familiar working environment – work within a statistical environment • Data attraction – Hadoop’sflexible data store together with the Jaql query language • Integration of data processing into the analytical workflow – handle large data by preprocessing and reducing it • Reliability and community support – built from open-source projects • Improved user code – facilitates better code • Deep analytics –can handle many kinds of advanced statistical analyses • No re-inventing of wheels – combine existing statistical and DMS technology

  8. Motivating examples

  9. example 1: Simple trading • Analyst workflow: exploration • Graph shows movie perception over time • How does an analyst get this data visualization? • R is good for the job, BUT… • Ricardo can help!

  10. example 2: Simple trading • Analyst workflow: evaluation – already have a model • Analysis must be on all the data • Ricardo can help once again • What did we see? • Simple trading • First case  pass to R • Second case  pass to Hadoop • More complicated analyses? No problem!

  11. example 3: complex trading • Analyst workflow: modeling • How? • Simple-trading scheme  no good • Losing information • Ricardo permits complex trading • Data needs decomposition • Small parts  handled by R • Large parts  handled by Hadoop • Consider an example • Latent-factor model • Each piece of data must be taken into account • Simple-trading won’t work

  12. Latent-factor model

  13. preliminaries

  14. The R project • Developed at the University of Auckland, New Zealand • Open-source language and statistical environment • Small maintenance team, but big popularity • Example of functionality: fit <- lm(df$mean ~ df$year) plot(df$year, df$mean) abline(fit) • Data frame equivalent

  15. Large-scale dms • Enterprise data warehouses – dominant type of DMS • Designed for clean/structured data – not good • Analysts want their data dirty • What to do? Use Hadoop! • Hadoop method • Hadoop Distributed File System • Operates on raw data files • Process according to MapReduce • Map phase results fed to reducer • Used successfully on large-scale datasets • Appealing alternative

  16. Jaql: A JSON Query Language • Hadoop drawback – programming interface • Attempts to help this • Ricardo uses Jaql • Open-source dataflow language • Jaql scripts automatically compiled • Operates directly on data files • JSON view: [{ customer: "Michael", movie: { name: "About Schmidt", year: 2002}, rating: 5}, ...], • Jaql query: read("ratings") -> group by year = $.movie.year into { year, mean: avg($[*].rating) } -> sort by [ $.year ].

  17. Ricardo design

  18. Problem Statement How to bridge between them? Advantage: -Large scale processing Disadvantage: Insufficient analytical functionality • Advantages: • -Statistical • software • -Data analysis • Disadvantages: • Operate in main memory • Limited data

  19. Ricardo Design

  20. Ricardo Design • R driver: • Not memory-resident • Does R need memory to store some data? • Hadoop : • Performance operations • Store data in HDFS • R-Jaql Bridge: • Connect between R driver and Hadoop cluster • Execute query (what kind of query?) • Send the result back to R as data frames • Allow Jaql queries to spawn R processes on Hadoop worker nodes.

  21. R-JaqlBridge • Components: • R package(Jaql R and a Jaql module: R Jaql)  R  Hadoop  Hadoop  R  Hadoop  R  R  Hadoop

  22. Ricardo Workflow • Analyst’s typical workflow • Data exploration • Preliminary observation • Simple trading • Model building • Depth Analytics • Complex trading • Model evaluation • Quality of models • Simple trading Why model building is complex trading?

  23. Review Example • Movies recommendation Data Exploration Model Building Complex Trading: Latent-Factor Model Simple Trading: Linear Regression

  24. Simple Trading – Linear Regression Get data from Hadoop Fit data

  25. Simple Trading – Evaluate Model Fit data Select top 10 outliers

  26. Complex Trading Model Building Objectives

  27. Model Building Random pick up p and q Set up optimization method • Compute • Squared error (e) • The derivative of e with respect to p • The derivative of e with respect to q Update p and q Repeat it until convergence

  28. Model Building • Table r: stores ratings • Table p and q: stores latent factors Table q Table r Table p

  29. Details Compute the sum of squared errors Compute the gradient

  30. Other Models • Principal component analysis (PCA) • Compute eigenvectors and eigenvalues • Perpendicular among eigenvectors • GLM • Compute response variable • Expressed as a nonlinear function • ……

  31. Implementations • Java Native Interface (JNI) as the bridge between C and Java • How to transfer the data between JNI? • Naïve way • Better solution • Japl wrapper handles data-representation incompatibilities • This is in the bridge • What’s the component right now in the R-Jaql bridge now?

  32. Experimental study

  33. Experimental study

  34. Experimental study

  35. Experimental study

  36. Related work • Scaling Out R • Low level message passing type • Task- and data-parallel computing systems • Automatic parallelization of high-level • Deeping a DMS

  37. conclusion • Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R. • Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R. • Future work • Identifying and integrating additional statistical analyses that are amenable to the Ricardo approach.

  38. references • S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD, pages 987-998, 2010.