Enhancing I/O Efficiency in R for Numerical Computing with RIOT

RIOT: I/O-Efficient Numerical Computing in Yi Zhang Herodotos Herodotou Jun Yang

What is R? • R: an open-source language/environment • Statistical computing, graphics • Comprehensive RArchive Network • 1639 packages as of Dec 08 • Interpretive execution • High-level constructs • Arrays, matrices • Code example: • Common to languages for numerical/statistical computing a <- 1:100 … d <- a+b^2+c

Big-Data Challenge • R assumes all data in main memory • If not, VM starts swapping data from/to disk • Excessive I/O, poor performance • Example: x,y E(xe,ye) S(xs,ys) # n points with coordinates stored in x[1:n], y[1:n] (1) d <- sqrt((x-xs)^2+(y-ys)^2)+sqrt((x-xe)^2+(y-ye)^2) (2) s <- sample(n, 100) # draw 100 samples from 1:n (3) z <- d[s] # extract elements of d whose indices are in s …… y-ye x-xs (x-xs)^2 memory swap/ paging file x-xs y (x-xe)^2 y y y x x 1stsqrt x x …

Opportunities • Avoiding intermediate results • Multiple large intermediate results are generated • Can we avoid them without hand-coding loops? • for (i in 1:n) { d[i] <- sqrt((x[i]-xs)^2+…)+… } • Deferred and selective evaluation • Each expression is evaluated in full immediately • Can we defer evaluation until really necessary? • Just compute the 100 elements from d picked by s

Existing Solutions • Rewrite and hand-optimize code • Tedious, not quite reusable • Use I/O-efficient libraries • SOLAR [Toledo’96], DRA [Nieplocha’96], etc. • But efficient individual operations are not enough • Build/extend a DB • RasDaMan [Baumann’99], AML [Marathe’02], ASAP [Stonebraker’07], … • Must rewrite using a new language (often SQL) • Explicit boundary between DB and host language

R with I/OTransparency • Attain I/O efficiency without explicit user intervention • Run legacy code with no or minimal modification • No need to learn new languages/libraries • No boundary between host language and backend processing SQL

RIOT • Implemented as an R package • New types, same interfaces: dbvector, dbmatrix, … • Uses R’s generics mechanism for transparency 1 3 New class definition: setClass(“dbvector”, representation(size=“numeric”,…)) Implementation: SEXP add_dbvectors(SEXP e1, SEXP e2){ … } Method overloading: setMethod(“+”,signature(e1=“dbvector”,e2=“dbvector”), function(e1,e2) { .Call(“add_dbvectors”,e1,e2) } ) 2

RIOT-DB: Hidden DB Backend • A strawman solution: Map large arrays to DB tables • e.g. vector: V(i,v); matrix: M(i,j,v) • Computation  query: a+b  SELECT A.I,A.V+B.V FROM A,B WHERE A.I=B.I • Leverages power of DB only at intra-operation level! • Key: Translate operations to view definitions • Build up larger and larger views a step at a time • Evaluate only when needed  deferred evaluation • Query optimization  selective evaluation + more • Iterator-style execution  no intermediate results • d<-sqrt((x-xs)^2+(y-ys)^2)+… • CREATE VIEW T1(I,V) AS SELECT X.I,X.V-xs FROM X; • SELECT S.I, SQRT(POW(X.V-xs,2)+POW(Y.V-ys,2)) • + SQRT(POW(X.V-xe,2)+POW(Y.V-ye,2)) • FROM X,Y,S WHERE X.I=Y.I AND X.I=S.V … z <- d[s] • CREATE VIEW T2(I,V) AS SELECT T1.I, POW(T1.V,2) FROM T1; • … • CREATE VIEW D(I,V) AS SELECT T6.I, T6.V+T12.V FROM T6,T12 WHERE T6.I=T12.I; • CREATE VIEW Z(I,V) AS SELECT S.I, D.V FROM D,S WHERE D.I=S.V;

RIOT-DB Demo • RIOT-DB built using with MyISAM engine

Performance of RIOT-DB • Plain R • RIOT-DB variants • RIOT-DB/Strawman: use DB to store arrays and execute individual ops;no use of views to defer evaluation • RIOT-DB/MatNamed: use views, but compute/materialize every named object • RIOT-DB: full version; defer/optimize across statements

Lessons Learned • DB-style inter-operation optimization is really the key! • Can we do better? • DB arrays carries too much overhead (ASAP [Stonebraker’07]) • Extra columns in V(i, v), M(i, j, v), …; more for higher dims • SQL & relational algebra may not be the right abstraction • Advanced data layouts and complex ops are awkward • RIOT: The Next Generation • A new expression algebra closer to numerical computation • Flexible array storage/layout options • Optimizations better tailored for numerical computation • … and more

RIOT Expression Algebra • Analogous to the view mechanism, but more flexible • Operators • +, –, *, /, [, … • A[idxRange]<-newVals: turn updates into functional ops • Instead of in-place updates, log them & define Anew over (Aold,log) • X%*%Y(matrix multiply) etc.: built-in, for high-level opt. • E.g. matrix chain multiplication: (XY)Z or X(YZ)?

Processing/Layout Optimization • Matrix multiplication T=A(n1xn2) B(n2xn3), with fixed memory size M R: Plain algorithm For each row i of A: For each column j of B: T[i,j] <- A[i,] * B[,j] RIOT-DB Hashjoin-sort-aggregate T T T A A A B B B BNLJ-inspired algorithm Read as many rows of A as possible: Use one block to scan B in column-major order: Update elements in T = = = x x x Blocked algorithm Divide memory into 3 equal parts Divide each matrix into square blocks For each chunk (i,j) in T: For k=1…p: Read chunk (i,k) from A and chunk (k,j) from B chunk T(i,j) += A(i,k) %*% B(k,j) Write chunk T(i,j) Optimal I/O cost: n1n2n3/(BM1/2)

Conclusion • I/O efficiency can be added transparently • Ditch SQL at user level for broader impact! • DB-style inter-operation optimization is critical • Need to go beyond developing I/O-efficient algorithms and libraries • Integration of DB and programming languages • Lots of interesting analogies and new opportunities

Q&A Thank you! RIOT photos by Zack Gold (www.zackgold.com)

Enhancing I/O Efficiency in R for Numerical Computing with RIOT