1 / 33

R with Distributed Systems

R with Distributed Systems. R with Distributed System. RHIPE - R and Hadoop Integrated Processing Environment http://www.stat.purdue.edu/~sguha/rhipe / Ricardo: Integrating R and Hadoop , SIGMOD 2010 Segue http://code.google.com/p/segue/ Hadoop InteractiVE

manjit
Télécharger la présentation

R with Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. R with Distributed Systems

  2. R with Distributed System • RHIPE - R and Hadoop Integrated Processing Environment • http://www.stat.purdue.edu/~sguha/rhipe/ • Ricardo: Integrating R and Hadoop, SIGMOD 2010 • Segue • http://code.google.com/p/segue/ • HadoopInteractiVE • https://r-forge.r-project.org/projects/rhadoop/ • Big Data Analysis with Revolution R Enterprise • Revolution R Enterprise • http://www.revolutionanalytics.com/ • The RevoScaleR package provides a mechanism for scaling the R language to handle very large data sets. • Elastic-R • https://www.elastic-r.org • Biopara • http://hedwig.mgh.harvard.edu/biostatistics/node/20 • http://hedwig.mgh.harvard.edu/biostatistics/files/biopara/biopara.html • RIOT: I/O-Efficient Numerical Computing without SQL, CIDR 2009 • R adopted a relational database as a backend, not Hadoop

  3. Ricardo: Integrating R and Hadoop SudiptoDas*, YannisSismanis**, Kevin S Beyer**, Rainer Gemulla**, Peter J. Haas**, John McPherson** * UC Santa Barbara ** IBM Almaden Research Center SIGMOD 2010

  4. Deep Analytics on Big Data • Enterprises collect huge amounts of data • Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, … • User interaction data and history • Click and Transaction logs • Deep analysis critical for competitive edge • Understanding/Modeling data • Recommendations to users • Ad placement • Challenge: Enable Deep Analysis and Understanding over massive data volumes • Exploiting data to its full potential

  5. Motivating Examples • Data Exploration/Model Evaluation/Outlier Detection • Personalized Recommendations • For each individual customer/product • Many applications to Netflix, Amazon, eBay, iTunes, … • Difficulty: Discern particular customer preferences • Sampling loses Competitive advantage • Application Scenario: Movie Recommendations, Netflix • Millions of Customers • Hundreds of thousands of Movies • Billions of Movie Ratings

  6. Big Data and Deep Analytics – The Gap • R, SPSS, SAS – A Statistician’s toolbox • Rich statistical, modeling, visualization functionality • Operate on small data amounts entirely in memory • Extensions for data handling cumbersome • Hadoop – Scalable Data Management Systems • Scalable, Fault-Tolerant, Elastic, … • “Magnetic”: easy to store data • Limited deep analytics: mostly descriptive analytics

  7. Filling the Gap: Existing Approaches • Reducing Data size by Sampling • Approximations might result in losing competitive advantage • Loses important features of the long tail of data distributions [Cohen et al., VLDB 2009] • Scaling out R • Efforts from statistics community to parallel and distributed variants [SNOW, Rmpi] • Main memory based in most cases • Re-implementing DBMS and distributed processing functionality • Deep Analysis within a DBMS • Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [Apache Mahout] • Not Sustainable – missing out from R’s community development and rich libraries

  8. Ricardo: Bridging the Gap • David Ricardo, famous economist from 19th century • “Comparative Advantage” • Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06] • Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA • Recommender Systems/Latent Factorization [in the paper] • Large-part includes joins, group bys, distributive aggregations • Hadoop + Jaql: excellent scalability to large-scale data management • Small-part includes matrix/vector operations • R: excellent support for numerically stable matrix inversions, factorizations, optimizations, eigenvector decompositions,etc. • Ricardo: Establishes “trade” between R and Hadoop/Jaql

  9. R in a Nutshell • R supports Rich statistical functionality

  10. Jaql in a Nutshell • Scalable Descriptive Analysis using Hadoop • Jaql a representative declarative interface • JSON View of the data: • JaqlExample:

  11. Ricardo: The Trading Architecture • Complexity of Trade between R and Hadoop • Simple Trading: Data Exploration • Complex Trading: Data Modeling

  12. Simple Trading: Exploratory Analytics • Gain insights about data • Example - top-k outliers for a model • Identify data items on which the model performed most poorly • Helpful for improving accuracy of model • The trade: • Use complex statistical models using rich R functionality • Parallelize processing over entire data using Hadoop/Jaql

  13. Complex Trading: Latent Factors • SVD-like matrix factorization • Minimize Square Error: Σi,j (piqj - rij)2 • The trade: • Use complex statistical models in R • Parallelize aggregate computations using Hadoop/Jaql q p

  14. Latent Factor Models with Ricardo • Goal • Minimize Square Error: e = Σi,j (piqj - rij)2 • Numerical methods needed (large, sparse matrix) • Pseudocode • Start with initial guess of parameters piand qj • Compute error & gradient • e.g., de/dpi= Σj 2qj (piqj- rij) • (Data intensive, but parallelizable) • Update parameters • R implements many different optimization algorithms • Repeat steps 2 and 3 until convergence. • R code • optim( c(p,q), fe, fde, method="L-BFGS-B" )

  15. Computing the Model e = Σi,j(piqj - rij)2 3 way join to matchrij, pi, and qj, then aggregate Movie Parameters Customer Parameters Similarly compute the gradients Movie Ratings

  16. Aggregation In Jaql/Hadoop res = jaqlTable(channel, " ratings hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*, m.q } ) hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*, c.p } ) transform { $.*, diff: $.rating - $.p*$.q } expand [ { value: pow($.diff, 2.0) },                   { $.i, value: -2.0 * $.diff * $.p },                   { $.j, value: -2.0 * $.diff * $.q } ] group by g={ $.i, $.j } into { g.*, gradient: sum($[*].value) } ") i j gradient ---- ---- -------- null null 325235 1 null 21 2 null 357 … null 1 9 null 2 64 … Result in R

  17. Experimental Evaluation • 50 nodes at EC2 • Each node: 8 cores, 7GB Memory, 320GB Disk • Total: 400 cores, 320GB Memory, 70TB Disk Space

  18. Result • Leveraging Hadoop’sScalability • Leveraging R’s Rich Functionality • optim( c(p,q), fe, fde, method=“CG" ) • optim( c(p,q), fe, fde, method="L-BFGS-B" )

  19. Extending the Trade: R – Jaql – R • Invoking R through Jaql – distributed statistical computation • Example: Augment model with changing customer preferences with time • Time series model for each customer incorporated into global model

  20. Conclusion • Scaled Latent Factor Models to Terabytes of data • Provided a bridge for other algorithms with Summation Form can be mapped and scaled • Many Algorithms have Summation Form • Decompose into “large part” and “small part” • [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression, neural network, PCA, ICA, EM, SVM • Future & Current Work • Tighter language integration • More algorithms • Performance tuning

  21. RHIPE - R and Hadoop Integrated Processing Environment SaptarshiGuha

  22. RHIPE • R package • INSTALL • Set an environment variable $HADOOP that points to the Hadoopinstallation directory. • It is expected that $HADOOP\bin contains the Hadoop shell executable hadoop • This needs to be installed on all the computers: the one you run your R environment and all the task computers. • Use RHIPE is much easier if your filesystem layout (i.e location of R, Hadoop, libraries etc) is identical across all computers.

  23. Tests • In R • should work successfully • should successfully write the list to the HDFS • should return a list of length 3 each element a list of 2 objects.

  24. Tests (cont’d) • A quick run of this should also work

  25. R and Hadoop Integrated Programming Environment • The R and Hadoop Integrated Programming Environment is R package • compute across massive data sets • create subsets • apply routines to subsets • produce displays on subsets across a cluster of computers • using the Hadoop DFS and HadoopMapReduce framework. • Use Hadoop Streaming • Users can write MapReduceprograms in other languages e.g. Python, Ruby, Perl which is then deployed over the cluster. • HadoopStreaming then transfers the input data from Hadoop to the user program and vice versa.

  26. R and Hadoop Integrated Programming Environment • RHIPE is just that. • RHIPE consist of several functions to interact with the HDFS • e.g. save data sets, read data created by RHIPE MapReduce, delete files. • Commands in R • Compose and launch MapReduce jobs from R using the command rhmr and rhex. • Monitor the status using rhstatus which returns an R object. • Stop jobs using rhkill • Compute side effect files. • The output of parallel computations may include the creation of PDF files, R data sets, CVS files etc. • These will be copied by RHIPE to a central location on the HDFS removing the need for the user to copy them from the compute nodes or setting up a network file system. • Data sets that are created by RHIPE can be read using other languages such as Java, Perl, Python and C. • The serialization format used by RHIPE (converting R objects to binary data) uses Googles Protocol Buffers which is very fast and creates compact representations for R objects. Ideal for massive data sets. • Data sets created using RHIPE are key-value pairs. • A key is mapped to a value. A MapReduce computations iterates over the key, value pairs in parallel. If the output of a RHIPE job creates unique keys the output can be treated as a external-memory associative dictionary. RHIPE can thus be used as a medium scale (millions of keys) disk based dictionary, which is useful for loading R objects into R.

  27. Example: Airline Dataset • Copying the Data to the HDFS

  28. Example: Airline Dataset (cont’d) • rhstatus

  29. Example: Airline Dataset (cont’d) • Job

  30. Example: Airline Dataset (cont’d) • Demonstration of using Hadoop as a Queryable Database

  31. Demonstration of using Hadoop as a Queryable Database • Top 20 cities by total volume of flights.

  32. Example: Transforming Text Data • Text data • The carrier name is column 9. • Southwest carrier code is WN, Delta is DL. • Only those rows with column 9 equal to WN or DL will be saved.

  33. Example: Transforming Text Data (cont’d) • The output • 1 • 2

More Related