160 likes | 301 Vues
This paper presents RIOT, a novel R package that addresses I/O inefficiencies in numerical computing. Despite R's power, it struggles with large datasets due to excessive memory usage and disk swapping. RIOT integrates a hidden database backend, enabling deferment of evaluation and optimized data handling without requiring users to grasp new languages. By leveraging deferred evaluation and optimizing queries, RIOT provides an effective solution for enhancing performance in big data scenarios. Its design preserves R's familiar syntax while optimizing numerical operations for real-world applications.
E N D
RIOT: I/O-Efficient Numerical Computing in Yi Zhang Herodotos Herodotou Jun Yang
What is R? • R: an open-source language/environment • Statistical computing, graphics • Comprehensive RArchive Network • 1639 packages as of Dec 08 • Interpretive execution • High-level constructs • Arrays, matrices • Code example: • Common to languages for numerical/statistical computing a <- 1:100 … d <- a+b^2+c
Big-Data Challenge • R assumes all data in main memory • If not, VM starts swapping data from/to disk • Excessive I/O, poor performance • Example: x,y E(xe,ye) S(xs,ys) # n points with coordinates stored in x[1:n], y[1:n] (1) d <- sqrt((x-xs)^2+(y-ys)^2)+sqrt((x-xe)^2+(y-ye)^2) (2) s <- sample(n, 100) # draw 100 samples from 1:n (3) z <- d[s] # extract elements of d whose indices are in s …… y-ye x-xs (x-xs)^2 memory swap/ paging file x-xs y (x-xe)^2 y y y x x 1stsqrt x x …
Opportunities • Avoiding intermediate results • Multiple large intermediate results are generated • Can we avoid them without hand-coding loops? • for (i in 1:n) { d[i] <- sqrt((x[i]-xs)^2+…)+… } • Deferred and selective evaluation • Each expression is evaluated in full immediately • Can we defer evaluation until really necessary? • Just compute the 100 elements from d picked by s
Existing Solutions • Rewrite and hand-optimize code • Tedious, not quite reusable • Use I/O-efficient libraries • SOLAR [Toledo’96], DRA [Nieplocha’96], etc. • But efficient individual operations are not enough • Build/extend a DB • RasDaMan [Baumann’99], AML [Marathe’02], ASAP [Stonebraker’07], … • Must rewrite using a new language (often SQL) • Explicit boundary between DB and host language
R with I/OTransparency • Attain I/O efficiency without explicit user intervention • Run legacy code with no or minimal modification • No need to learn new languages/libraries • No boundary between host language and backend processing SQL
RIOT • Implemented as an R package • New types, same interfaces: dbvector, dbmatrix, … • Uses R’s generics mechanism for transparency 1 3 New class definition: setClass(“dbvector”, representation(size=“numeric”,…)) Implementation: SEXP add_dbvectors(SEXP e1, SEXP e2){ … } Method overloading: setMethod(“+”,signature(e1=“dbvector”,e2=“dbvector”), function(e1,e2) { .Call(“add_dbvectors”,e1,e2) } ) 2
RIOT-DB: Hidden DB Backend • A strawman solution: Map large arrays to DB tables • e.g. vector: V(i,v); matrix: M(i,j,v) • Computation query: a+b SELECT A.I,A.V+B.V FROM A,B WHERE A.I=B.I • Leverages power of DB only at intra-operation level! • Key: Translate operations to view definitions • Build up larger and larger views a step at a time • Evaluate only when needed deferred evaluation • Query optimization selective evaluation + more • Iterator-style execution no intermediate results • d<-sqrt((x-xs)^2+(y-ys)^2)+… • CREATE VIEW T1(I,V) AS SELECT X.I,X.V-xs FROM X; • SELECT S.I, SQRT(POW(X.V-xs,2)+POW(Y.V-ys,2)) • + SQRT(POW(X.V-xe,2)+POW(Y.V-ye,2)) • FROM X,Y,S WHERE X.I=Y.I AND X.I=S.V … z <- d[s] • CREATE VIEW T2(I,V) AS SELECT T1.I, POW(T1.V,2) FROM T1; • … • CREATE VIEW D(I,V) AS SELECT T6.I, T6.V+T12.V FROM T6,T12 WHERE T6.I=T12.I; • CREATE VIEW Z(I,V) AS SELECT S.I, D.V FROM D,S WHERE D.I=S.V;
RIOT-DB Demo • RIOT-DB built using with MyISAM engine
Performance of RIOT-DB • Plain R • RIOT-DB variants • RIOT-DB/Strawman: use DB to store arrays and execute individual ops;no use of views to defer evaluation • RIOT-DB/MatNamed: use views, but compute/materialize every named object • RIOT-DB: full version; defer/optimize across statements
Lessons Learned • DB-style inter-operation optimization is really the key! • Can we do better? • DB arrays carries too much overhead (ASAP [Stonebraker’07]) • Extra columns in V(i, v), M(i, j, v), …; more for higher dims • SQL & relational algebra may not be the right abstraction • Advanced data layouts and complex ops are awkward • RIOT: The Next Generation • A new expression algebra closer to numerical computation • Flexible array storage/layout options • Optimizations better tailored for numerical computation • … and more
RIOT Expression Algebra • Analogous to the view mechanism, but more flexible • Operators • +, –, *, /, [, … • A[idxRange]<-newVals: turn updates into functional ops • Instead of in-place updates, log them & define Anew over (Aold,log) • X%*%Y(matrix multiply) etc.: built-in, for high-level opt. • E.g. matrix chain multiplication: (XY)Z or X(YZ)?
Processing/Layout Optimization • Matrix multiplication T=A(n1xn2) B(n2xn3), with fixed memory size M R: Plain algorithm For each row i of A: For each column j of B: T[i,j] <- A[i,] * B[,j] RIOT-DB Hashjoin-sort-aggregate T T T A A A B B B BNLJ-inspired algorithm Read as many rows of A as possible: Use one block to scan B in column-major order: Update elements in T = = = x x x Blocked algorithm Divide memory into 3 equal parts Divide each matrix into square blocks For each chunk (i,j) in T: For k=1…p: Read chunk (i,k) from A and chunk (k,j) from B chunk T(i,j) += A(i,k) %*% B(k,j) Write chunk T(i,j) Optimal I/O cost: n1n2n3/(BM1/2)
Conclusion • I/O efficiency can be added transparently • Ditch SQL at user level for broader impact! • DB-style inter-operation optimization is critical • Need to go beyond developing I/O-efficient algorithms and libraries • Integration of DB and programming languages • Lots of interesting analogies and new opportunities
Q&A Thank you! RIOT photos by Zack Gold (www.zackgold.com)