210 likes | 215 Vues
Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems. Carlos Ordonez, Javier Garcia-Garcia , Carlos Garcia-Alvarado, Wellington Cabrera, Veera Baladandayuthapani , Shoaib Quraishi
E N D
Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems Carlos Ordonez, Javier Garcia-Garcia, Carlos Garcia-Alvarado, Wellington Cabrera, VeeraBaladandayuthapani, ShoaibQuraishi Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Motivation • Relational databases are a natural repository of data. • Enterprise Systems • But analytical tasks are often done outside the DBMS • Drawbacks • External data mining software • Data exporting • Privacy issues
Our proposal • Provide analytical algorithms as a service in the cloud, exploiting the processing power of DBMSs • DBMSs present both in the cloud and in the client side • No external packages required • Standard SQL queries , UDFs and Aggregate UDFs • A set of off-the-shelf algorithms are provided
Challenges • Large volume of data to be transmitted • Matrix computations • Processing power requirements of number crunching • Data redundancy • Minimize I/O • All data in relational format • Avoid exporting tasks
Advantages • Cloud system can: • Reduce work load on local system • Accelerate analytical processing • Enforce data security • Simplify multiple model management • It is not required to install data mining software, neither in local system nor in the cloud • Everything stored in relational tables
System attributes • Smart local processing: exploit CPU/RAM of local DBMS • Integrated:Local DBMS and Cloud DBMS are tightly integrated • Fast: one pass over input table for most algorithms; parallel • Simple: Calling the algorithms is simple: Stored Procedure with default parameters • Relational: relational tables to store models, job parameters
System Components • Cloud DBMS • Store procedures, UDFs • Cloud management server • Handling data mining job requests • Monitoring job progress • Cost estimation for 3 alternative processing modes • Managing jobs • Local DBMS • Store procedures, UDFs • Web application • User can post jobs using a web interface
Models • PCA • K-Means • Linear Regression • Variable Selection • Naïve Bayes
Remarks • Hybrid Mode: • Sufficient statistics calculated in local DBMS • Take advantage of local power processing, RAM • Cloud DBMS receives a summarization • Transmitting the entire dataset is avoided • Model computation in cloud DBMS • Cloud Model: • Summarization step • Occurs in cloud • Large data sets: Sampling • Local Mode: • Preferred for small datasets • Summarization/Sampling
Job Scheduler • FIFO job scheduling by default • If wait time for an individual job goes beyond a threshold ψ, then the system switches to SJF • If most jobs take a lot time to compute and the waiting time is beyond ψ, then the system switches to Round Robin(RR). • As the load decreases, the system backtrack to SJF, FIFO
Algorithm Optimizations • Sufficient Statistics are exploited to accelerate data mining algorithms • Previous work [1] shows that Linear Regression, PCA, Naïve Bayes, K-means are efficiently computed by using sufficient statistics n, L , Q • Sufficient Statistics can be computed • On samples • On the whole dataset
Sufficient Statistics: nLQ/Γ • Considering a dataset with n points • The sufficient statistics are generalized as: n=|X| Z=[ 1, X, Y]
Sufficient Statistics: nLQ/Γ • 1 set of sufficient statistic for each class/ cluster is necessary for: • Naïve Bayes • K-means • One matrix Γ is enough for • PCA • Linear Regression • Variable Selection
Data transfer comparison Data set Physical Activity ( n=2.88M, d=42) Dataset : 880.00 MB nLQ/Γ: 0.02 MB 50,000 times smaller!
Optimizations • Sufficient Statistics • Calculated in one parallel scan • Aggregate UDFS • Multithreaded, RAM • Matrix computations in RAM • LAPACK integration • Fast, accurate, stable
Summary • Sufficient statistics transmitted to cloud • Hybrid processing is best • Job policy: FIFO->SJF->RR • Parallel summarization, parallel scan • Model computation in RAM in the cloud • Complicated number crunching in the cloud • Job and model history in the cloud • All data is relational tables: they can be queried, stored securely
References • C. Ordonez. Statistical model computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2010 • C. Ordonez, Y. Zhang, W. Cabrera. The Gamma Operator for Big Data Summarization on an Array DBMS (BigMine 2014). JMLR W&CP 36 :88-103, 2014 • Carlos Ordonez, Carlos Garcia-Alvarado, Veera Baladandayuthapani.Bayesian Variable Selection in Linear Regression in One Pass for Large Data Sets, ACM 2Transactions on Knowledge Discovery from Data (TKDD), 2015