Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems

Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems Carlos Ordonez, Javier Garcia-Garcia, Carlos Garcia-Alvarado, Wellington Cabrera, VeeraBaladandayuthapani, ShoaibQuraishi Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Motivation • Relational databases are a natural repository of data. • Enterprise Systems • But analytical tasks are often done outside the DBMS • Drawbacks • External data mining software • Data exporting • Privacy issues

Our proposal • Provide analytical algorithms as a service in the cloud, exploiting the processing power of DBMSs • DBMSs present both in the cloud and in the client side • No external packages required • Standard SQL queries , UDFs and Aggregate UDFs • A set of off-the-shelf algorithms are provided

Challenges • Large volume of data to be transmitted • Matrix computations • Processing power requirements of number crunching • Data redundancy • Minimize I/O • All data in relational format • Avoid exporting tasks

Advantages • Cloud system can: • Reduce work load on local system • Accelerate analytical processing • Enforce data security • Simplify multiple model management • It is not required to install data mining software, neither in local system nor in the cloud • Everything stored in relational tables

System attributes • Smart local processing: exploit CPU/RAM of local DBMS • Integrated:Local DBMS and Cloud DBMS are tightly integrated • Fast: one pass over input table for most algorithms; parallel • Simple: Calling the algorithms is simple: Stored Procedure with default parameters • Relational: relational tables to store models, job parameters

System Components • Cloud DBMS • Store procedures, UDFs • Cloud management server • Handling data mining job requests • Monitoring job progress • Cost estimation for 3 alternative processing modes • Managing jobs • Local DBMS • Store procedures, UDFs • Web application • User can post jobs using a web interface

Models • PCA • K-Means • Linear Regression • Variable Selection • Naïve Bayes

Job processing

Remarks • Hybrid Mode: • Sufficient statistics calculated in local DBMS • Take advantage of local power processing, RAM • Cloud DBMS receives a summarization • Transmitting the entire dataset is avoided • Model computation in cloud DBMS • Cloud Model: • Summarization step • Occurs in cloud • Large data sets: Sampling • Local Mode: • Preferred for small datasets • Summarization/Sampling

Job Scheduler • FIFO job scheduling by default • If wait time for an individual job goes beyond a threshold ψ, then the system switches to SJF • If most jobs take a lot time to compute and the waiting time is beyond ψ, then the system switches to Round Robin(RR). • As the load decreases, the system backtrack to SJF, FIFO

Job queue

Algorithm Optimizations • Sufficient Statistics are exploited to accelerate data mining algorithms • Previous work [1] shows that Linear Regression, PCA, Naïve Bayes, K-means are efficiently computed by using sufficient statistics n, L , Q • Sufficient Statistics can be computed • On samples • On the whole dataset

Sufficient Statistics: nLQ/Γ • Considering a dataset with n points • The sufficient statistics are generalized as: n=|X| Z=[ 1, X, Y]

Sufficient Statistics: nLQ/Γ • 1 set of sufficient statistic for each class/ cluster is necessary for: • Naïve Bayes • K-means • One matrix Γ is enough for • PCA • Linear Regression • Variable Selection

Data transfer comparison Data set Physical Activity ( n=2.88M, d=42) Dataset : 880.00 MB nLQ/Γ: 0.02 MB 50,000 times smaller!

Optimizations • Sufficient Statistics • Calculated in one parallel scan • Aggregate UDFS • Multithreaded, RAM • Matrix computations in RAM • LAPACK integration • Fast, accurate, stable

Summary • Sufficient statistics transmitted to cloud • Hybrid processing is best • Job policy: FIFO->SJF->RR • Parallel summarization, parallel scan • Model computation in RAM in the cloud • Complicated number crunching in the cloud • Job and model history in the cloud • All data is relational tables: they can be queried, stored securely

References • C. Ordonez. Statistical model computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2010 • C. Ordonez, Y. Zhang, W. Cabrera. The Gamma Operator for Big Data Summarization on an Array DBMS (BigMine 2014). JMLR W&CP 36 :88-103, 2014 • Carlos Ordonez, Carlos Garcia-Alvarado, Veera Baladandayuthapani.Bayesian Variable Selection in Linear Regression in One Pass for Large Data Sets, ACM 2Transactions on Knowledge Discovery from Data (TKDD), 2015

Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems

Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems

Presentation Transcript

Database Systems Research on Data Mining

Relational Cloud: A Database as a Service for the Cloud

Data Mining Algorithms

Database as a (Cloud) Service

Database as a Service: Delivering Database as a Service

Relational Cloud: A Database-as-a-Service for the Cloud

Database Systems I The Relational Data Model

Database as a Service: Delivering Database as a Service

Relational Database Design Algorithms

Database Management Systems: Data Mining

Data Mining Algorithms for Recommendation Systems

Database Systems I The Relational Data Model

Relational Data Mining in Finance

Relational Database Systems

Relational Database Systems

Relational Database Systems

Data Mining Algorithms for Recommendation Systems

Database Systems The Relational Data Model

Store XML Data in a Relational Database

Database Management Systems: Data Mining

Database Management Systems: Data Mining

Database Systems The Relational Database Model