1 / 11

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs. Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database Systems Group, Department of Computer Science University of Houston Advisor: Dr. Carlos Ordonez. Motivation.

cian
Télécharger la présentation

A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database Systems Group, Department of Computer Science University of Houston Advisor: Dr. Carlos Ordonez

  2. Motivation • Naïve Bayes Classifier(NB) • One of the most popular and important classifiers in Machine Learning • Robust, Powerful, Fast to Compute And Easy to Understand • Programming Inside A DBMS • SQL can easily handle complex computations • UDFs can use arrays and processed in memory

  3. Data Mining Inside A DBMS Avoids Exporting the data outside the DBMS Major overhead Data Security Scales Linearly with large data sets Exploit parallelism provided by a DBMS Use optimized queries with simple database operations Objective: Push computations involving large data sets inside the DBMS

  4. Bayesian Classifier Based On K-Means (BKM) • A Generalization Of Naïve Bayes(NB) • The Algorithm • Initialization: Randomly initialize k clusters per class from the data set. • E-Step: Compute Euclidean distance, find nearest cluster and then compute sufficient statistics. • M-Step: Re-compute cluster centers and radii. Check Convergence. • The E-Step and M-Step are repeated until model converges i.e clusters do not move

  5. BKM: Finding the clusters per class

  6. Database Optimizations • Five different query optimization techniques for distance computation were introduced. • User Defined Functions (UDFs) – Computing distance and nearest cluster in a single UDF. • Using CASE statement instead of aggregations. • Sufficient Statistics of the clusters were computed in a single table scan.

  7. Comparing Accuracy – NB Vs BKM Vs DT • Global Accuracy: BKM better than NB and worse than DT(Decision Tree) in most cases • Class Breakdown Accuracy: • BKM better than NB except 2 cases proving class decomposition is a positive step towards increasing NB accuracy. DT performs poorly here and really worse in case of the bscale.

  8. BKM Scalability- Varying n,d,k Times per Iteration. Defaults: d=4,k=4,n=100k

  9. Comparing DBMS with MapReduce MapReduce: A distributed non-transactional high performance data intensive processing framework.

  10. Incremental Mining • An UDF performing incremental data mining exploiting data parallelism • Minimizing the number of scans(1-3) on the data set • Provides an approximation of the model before we scan through the complete data set • Requires thread safe sharing of the model without affecting performance

  11. Papers • Carlos Ordonez, Sasi K. Pitchaimalai: One-pass data mining algorithms in a DBMS with UDFs. SIGMOD Conference 2011: 1217-1220 • Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado : Comparing SQL and MapReduce to compute Naïve Bayes in a Single Table Scan, CloudDB, CIKM 2010 • Carlos Ordonez, Sasi K. Pitchaimalai: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling, DKE 2010 • Carlos Ordonez, Sasi K. Pitchaimalai - Bayesian Classifiers Programmed in SQL, TKDE 2008 • Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado – Efficient Distance Computation Using SQL Queries and UDFs, ICDM 2008

More Related