1 / 21

Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems

Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems. Carlos Ordonez, Javier Garcia-Garcia , Carlos Garcia-Alvarado, Wellington Cabrera, Veera Baladandayuthapani , Shoaib Quraishi

haydel
Télécharger la présentation

Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems Carlos Ordonez, Javier Garcia-Garcia, Carlos Garcia-Alvarado, Wellington Cabrera, VeeraBaladandayuthapani, ShoaibQuraishi Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

  2. Motivation • Relational databases are a natural repository of data. • Enterprise Systems • But analytical tasks are often done outside the DBMS • Drawbacks • External data mining software • Data exporting • Privacy issues

  3. Our proposal • Provide analytical algorithms as a service in the cloud, exploiting the processing power of DBMSs • DBMSs present both in the cloud and in the client side • No external packages required • Standard SQL queries , UDFs and Aggregate UDFs • A set of off-the-shelf algorithms are provided

  4. Challenges • Large volume of data to be transmitted • Matrix computations • Processing power requirements of number crunching • Data redundancy • Minimize I/O • All data in relational format • Avoid exporting tasks

  5. Advantages • Cloud system can: • Reduce work load on local system • Accelerate analytical processing • Enforce data security • Simplify multiple model management • It is not required to install data mining software, neither in local system nor in the cloud • Everything stored in relational tables

  6. System attributes • Smart local processing: exploit CPU/RAM of local DBMS • Integrated:Local DBMS and Cloud DBMS are tightly integrated • Fast: one pass over input table for most algorithms; parallel • Simple: Calling the algorithms is simple: Stored Procedure with default parameters • Relational: relational tables to store models, job parameters

  7. System Components • Cloud DBMS • Store procedures, UDFs • Cloud management server • Handling data mining job requests • Monitoring job progress • Cost estimation for 3 alternative processing modes • Managing jobs • Local DBMS • Store procedures, UDFs • Web application • User can post jobs using a web interface

  8. Models • PCA • K-Means • Linear Regression • Variable Selection • Naïve Bayes

  9. Job processing

  10. Remarks • Hybrid Mode: • Sufficient statistics calculated in local DBMS • Take advantage of local power processing, RAM • Cloud DBMS receives a summarization • Transmitting the entire dataset is avoided • Model computation in cloud DBMS • Cloud Model: • Summarization step • Occurs in cloud • Large data sets: Sampling • Local Mode: • Preferred for small datasets • Summarization/Sampling

  11. Job Scheduler • FIFO job scheduling by default • If wait time for an individual job goes beyond a threshold ψ, then the system switches to SJF • If most jobs take a lot time to compute and the waiting time is beyond ψ, then the system switches to Round Robin(RR). • As the load decreases, the system backtrack to SJF, FIFO

  12. Job queue

  13. Job queue

  14. Algorithm Optimizations • Sufficient Statistics are exploited to accelerate data mining algorithms • Previous work [1] shows that Linear Regression, PCA, Naïve Bayes, K-means are efficiently computed by using sufficient statistics n, L , Q • Sufficient Statistics can be computed • On samples • On the whole dataset

  15. Sufficient Statistics: nLQ/Γ • Considering a dataset with n points • The sufficient statistics are generalized as: n=|X| Z=[ 1, X, Y]

  16. Sufficient Statistics: nLQ/Γ • 1 set of sufficient statistic for each class/ cluster is necessary for: • Naïve Bayes • K-means • One matrix Γ is enough for • PCA • Linear Regression • Variable Selection

  17. Data transfer comparison Data set Physical Activity ( n=2.88M, d=42) Dataset : 880.00 MB nLQ/Γ: 0.02 MB 50,000 times smaller!

  18. Optimizations • Sufficient Statistics • Calculated in one parallel scan • Aggregate UDFS • Multithreaded, RAM • Matrix computations in RAM • LAPACK integration • Fast, accurate, stable

  19. Summary • Sufficient statistics transmitted to cloud • Hybrid processing is best • Job policy: FIFO->SJF->RR • Parallel summarization, parallel scan • Model computation in RAM in the cloud • Complicated number crunching in the cloud • Job and model history in the cloud • All data is relational tables: they can be queried, stored securely

  20. References • C. Ordonez. Statistical model computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2010 • C. Ordonez, Y. Zhang, W. Cabrera. The Gamma Operator for Big Data Summarization on an Array DBMS (BigMine 2014). JMLR W&CP 36 :88-103, 2014 • Carlos Ordonez, Carlos Garcia-Alvarado, Veera Baladandayuthapani.Bayesian Variable Selection in Linear Regression in One Pass for Large Data Sets, ACM 2Transactions on Knowledge Discovery from Data (TKDD), 2015

More Related