R/L/BFL - Overview - Performance Measurement - Limitations - Recommandation

R/L/BFL - Overview - Performance Measurement - Limitations - Recommandation TIP D&NA Data Management

iTab Performance Test Time (sec)

The History: R, S and S-plus • S: an interactive environment for data analysis developed at Bell Laboratories since 1976 • 1988 - S2: RA Becker, JM Chambers, A Wilks • 1992 - S3: JM Chambers, TJ Hastie • 1998 - S4: JM Chambers • Won ACM System Award’98 (along with Unix, WWW, SPIN, Java, etc.) • Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product name: S-plus. • Implemented with C and Fortran. • Homepage: http://cm.bell-labs.com/cm/ms/departments/sia/S/history.html • R: • Open Source S under GNU GPL • Initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of Auckland, New Zealand during 1990s. • Since 1997: international R-core team of ca. 15 people with access to common CVS archive.

What is R? • R is GNU S A language and environment for data manipulation, calculation and graphical display. • R is similar to the award-winning S system, which was developed at Bell Laboratories by John Chambers et al. • A suite of operators for calculations on arrays, in particular matrices, • a large, coherent, integrated collection of intermediate tools for interactive data analysis, • Graphical facilities for data analysis and display either directly at the computer or on hardcopy • A well developed programming language which includes conditionals, loops, user defined recursive functions and input and output facilities. • The core of R is an interpreted computer language. • It allows branching and looping as well as modular programming using functions. • Most of the user-visible functions in R are written in R, calling upon a smaller set of internal primitives. • It is possible for the user to interface to procedures written in C/C++ or FORTRAN languages for efficiency, and also to write additional primitives.

The R grammar file (gram.y)Everything is an expression!

The R Project • Aims at building an open source version of S (under GNU GPL) • Project Home: http://www.r-project.org/ • Available on Windows, Linux, and MaxOS • Latest version 2.12.1 (dated on 16/12/2010) • Now has a core team of about 19 people • Support for multiple languages • CRAN • a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R • The R Journal • a refereed journal of the R project for statistical computing • Some well-known weakness • is not particularly efficient in handling large data sets • it is rather slow in executing a large number of for loops • Learning curve is somewhat steep compared to point and click software

R integration with NewDB R Scripts NewDB Calculation Engine R Scripts R Server R-Client Phase II: R operator library is established on NewDB; execution of R operators is paralleized R operator library Operators to be executed on NewDB Data process Shared Memory ITabs Data exchange Phase I : Data exchange via shared memory. Column data is transformed into array/vector/matrix data structures for more efficient statistical computation by R

R integration with NewDB R Scripts NewDB Phase III: SAP’s own R interpreter/compiler, evaluator, inter-operator parallelism, optimizer Calculation Engine R Scripts R Server R-Engine R-Client R operator library Phase III: R operator library is extended to cover all frequently used ones on NewDB Operators to be executed on NewDB Data process Shared Memory ITabs Data exchange Data exchange via shared memory. Column data is transformed into array/vector/matrix data structures for more efficient statistical computation by R

R Performance with 1M Records Even more 280s 156s 35s 3.2s 0.61s 2.5s • 1M records means 1M rows,100 Columns • K Means perform 5 Iterations, into 20 clusters • R Parallelism and R Wrap C++ are implemented in R External Packages.

Comparison Among Different Implementations

2M records60 Columns2 Iterations 20 clusters

Graph Clustering on Mobile Calling Data

The Problem – Identifying communities from large-scale mobile calling data in real-time In this case, the distance measure does not fulfill the triangle inequality: a c b f d e a f g h d Hence we can not use clustering method based on distance like k-means. Graph clustering is a good supplementary to k-means. b i c g e h i Graph Clustering Three Possible Clusters(communities) Calling Graph

Graph Clustering with Eigenvector Indicator step2 step1 a a a c c c b b b f f f d d d e e e g g g Smallest 3 eigenvalues and corresponding eigenvectors. i i i h h h step1 step2 Just identify by group negative and positive eigenvector values.

Efficient algorithms and packages for spectral clustering Solving the eigenvalue problem for a dense matrix would take O(n3) operations, Lapack (BSD license) support this routine. There is a solver called Lanczos method for solving sparse matrix eigenvalue problem. This method takes only about O(n3/2) operations. Arpack(the ARnoldi Package) support this method. Arpack is a numerical software library written in FORTRAN 77 for solving large scale eigenvalue problems, with a BSD-new style license. Here is an example for the performance comparison. A synthesized dataset with 300,000 calling records, 10,000 users, on Z600: Both single thread, and the results generated by the two solvers are the same.

Storing Calling Data with Adjacency List a c b f d e g i h

Implementing Graph Clustering as a NewDB BFL Function For 1,000,000 users, the memory required: Representing the adjacency list as an iTab The processing time (1 core): Some performance results on z600 (with Intel Xeon X5550 2.67GHz CPU and 24GB memory): sparsify sparsify 3.7 TB ~14 hours 1.1 GB 4.5 minutes In the demo of MobileMiner [sigmod2009], they show a partial clustering among 400 users.

BFL Wiki BFL Wiki Page

BFL For Forecasting Algorithms • Algorithms to support Find more details from here

Original Solution • Customized Operator • Advantage: • As a part of Calculation Engine, so may use CalcEngine’s parallelize mechanism. • Technical difficulty: • Cannot be exposed to common user, for there is no SQL interface for customized operators beside R operator and L operator • No loop support.

Current Solution • BFL functions can be called from SQL by any application.

BFL Evolution App Excel Client App Web Client Other Clients Client Tier Application Services SessionSystem Audit System Log System Application Services Tier BFL Pure Calculation Engine BFL Meta Manager Business Function Library IO Wrapper(with Parallelization) iTable(Data/Log) DatabaseTier

BFL Evolution App Excel Client App Web Client Other Clients Client Tier Admin Services SessionSystem Audit System Log System BFL Universal SDK BFL Meta Manager Database Service Tier Business Function Library IO Wrapper(with Parallelization) Internal Table

New Features beyond SWP • Table to store all available functions/parameters/version for app client to query • Log system, for application to easy check the log information and run time • information • Function name / input parameters values (to easy reproduce) • BFL function start time/end time/total run time • Log level(Info/Warning/Error...) • Log info(concrete message ) • Necessary intermediate result for trouble shooting/get understand how BFL algorithms run • Statistical info on read/write/records involved. • Session context

Limitation with L wrapper • In SQLScript side, the parameters for L wrapper functions can only be internal table. • Each function registered must be aware of the table type(both for input parameters and result ). So if applying same BFL function to different tables, much functions(for different tables) must be created. • Difficult to combine key columns with result values and return it to application.

Recommendation • Recommendation • For the applications which focus on statistical calculation, R is better • For applications need interaction w data, R is better • If application requirements are fixed, BFL is better for its advantages of high performance • BFL can be ported to be external package of R • Current Adoption • SWP (BFL) • Raptor (R/BFL) • SBC-Spend Analysis (R) • BI-IP (BFL) • Pioneer (BFL) • BPC(NW) (BFL) • DSR (BFL) • Saleforecasting (R)

R/L/BFL - Overview - Performance Measurement - Limitations - Recommandation