620 likes | 713 Vues
This paper discusses the implementation of a model-free classifier, Locus, for large and dynamic datasets. It covers motivation, parallel execution, experimental evaluation, and future directions for improving scalability and accuracy. Locus is compared with Nearest Neighbors and Decision Trees for faster decisions on complex datasets, though it may suffer from the curse of dimensionality. The paper explores the lazy approach of Locus based on simple SQL queries, converging to an optimal Bayes Classifier. Experimental results and parallel execution of Locus are presented alongside discussions on optimizing the method for dynamic datasets.
E N D
University of Athens ADBIS 2007 Database Implementation of a Model-Free Classifier Konstantinos Morfonios
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
ω1 = ω2 = Introduction Classification x = <x1, x2, …, xD> ω = f(x)
Introduction <x1,1, x1,2, …, x1,D, ω1> <x2,1, x2,2, …, x2,D, ω2> <x3,1, x3,2, …, x3,D, ω1> <x4,1, x4,2, …, x4,D, ω1> . . . x1 = <x1, x2, …, xD> x2 = <x1, x2, …, xD> “Lazy” “Eager” (Nearest Neighbors) (Decision Trees) (+) Faster decisions (-) Large/complex datasets (-) Dynamicdatasets (-) Dynamicmodels
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
Motivation • Large/complex datasets
Motivation • Large/complex datasets • Dynamic datasets
Motivation • Large/complex datasets • Dynamic datasets • Dynamic models
Motivation • Large/complex datasets • Dynamic datasets • Dynamic models Lazy (model-free)
Disk-based Motivation • Large/complex datasets • Dynamic datasets • Dynamic models Lazy (model-free) Nearest Neighbors
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Suffers from “curse of dimensionality” • Not reliable [Beyer et al., ICDT 1999] • Not indexable [Shaft et al., ICDT 2005] Nearest Neighbors
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Category?
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Scaling?
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries • Accuracy?
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries • Converges to optimal Bayes Classifier
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries • Converges to optimal Bayes Classifier • Other features?
Motivation LOCUS (Lazy Optimal Classifier of Unlimited Scalability) • Lazy • Based on simple SQL queries • Converges to optimal Bayes Classifier • Parallelizable
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
f2 ω1 = x = <f1, f2> (f1 [0, 20], f2 [0, 10]) ω2 = f1 LOCUS Example
f2 f1 LOCUS Ideally: Dense space
LOCUS f2 ω(<7, 4>) = ? Ideally: Dense space f1
LOCUS f2 ω(<7, 4>) = f1
LOCUS f2 • Many features • Large domains Sparse space Reality: f1
Many features • Large domains Sparse space Reality: LOCUS f2 ω(<7, 4>) = ? ? f1
LOCUS ω1: 2 f2 ω(<7, 4>) = ? ω2: 1 f1 3-NN
LOCUS ω1: 2 f2 ω(<7, 4>) = ω2: 1 f1 3-NN
LOCUS f2 ω(<7, 4>) = ? f1 LOCUS
LOCUS ω1: 7 f2 ω(<7, 4>) = ? ω2: 3 f1 LOCUS
LOCUS ω1: 7 f2 ω(<7, 4>) = ω2: 3 f1 LOCUS
LOCUS f2 Disk-based implementation f1 LOCUS
2δ2 2δ1 LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2 GROUP BY ω R(f1, f2, ω) ω1: 7 ω(<7, 4>) = ω2: 3 <x1, x2>
LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2 GROUP BY ω R(f1, f2, ω) What ifR is large? Classical optimization techniques for a well-known type of aggregate queries • Indexing • Materialized views • Presorting
LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2 GROUP BY ω R(f1, f2, ω) Method reliability? LOCUS converges to the optimal Bayes classifier as the size of the dataset increases (proof in the paper)
LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2 GROUP BY ω R(f1, f2, ω) What if a feature, sayf2, is categorical? (e.g. sex)
LOCUS SELECT ω, count(*) FROM R WHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2=x2 GROUP BY ω R(f1, f2, ω) What if a feature, sayf2, is categorical? (e.g. sex) Not a problem, since generally in practice: • Combinations of categorical and numericfeatures • Categorical features have small domains Hence, they do not contribute to sparsity
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
R1 R2 R3 R4 SELECT SELECT SELECT SELECT Parallel Execution R = R1 R2 R3 R4
R1 R4 R3 R2 12 3 5 2 18 3 23 4 Parallel Execution Count: distributive function ω1: 23 ω1: 7 ω1: 5 ω2: 4 ω2: 1 ω2: 2 ω1: 6 ω2: 0 ω1: 5 ω2: 1
R1 R4 R3 R2 12 3 5 2 18 3 23 4 ω1: 7 ω1: 5 SELECT SELECT ω2: 1 ω2: 2 ω1: 6 SELECT ω2: 0 SELECT ω1: 5 ω2: 1 Parallel Execution • Small network traffic • Load balancing • Lightweight operations on the main server ω1: 5 ω1: 7 ω2: 2 ω2: 1 ω1: 6 ω2: 0 ω1: 5 ω2: 1
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work
Introduction Motivation LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work