Improving Data Mining Utility with Projective Sampling
E N D
Presentation Transcript
Improving Data Mining Utility with Projective Sampling Mark Last Department of Information Systems Engineering Ben-Gurion University of the Negev, Beer-Sheva, Israel E-mail: mlast@bgu.ac.il Home Page: http://www.bgu.ac.il/~mlast/ Mark Last (mlast@bgu.ac.il)
Agenda • Introduction • Learning Curves and Progressive Sampling • The Projective Sampling Strategy • Empirical Results • Conclusions and Future Research Mark Last (mlast@bgu.ac.il)
Motivation: Data is not “born” free • The training data is often scarce and costly • Real-world examples • A limited number of patient records stored by a hospital • Results of a costly engineering experiment • Seasonal records in an agricultural database • Even when the raw data is free, its preparation may still be labor intensive! • Critical question • Should we spend our resources (time and/or money) on acquiring more examples? Mark Last (mlast@bgu.ac.il)
Total Cost of the Classification Process (based on Weiss and Tian, 2008) Training Set Score Set • Total Cost = n·Ctr + err(n)·|S|·Cerr + CPU(n)·Ctime • Ctr - cost of acquiring and labeling each new training example • Cerr - cost of each misclassified example from the score set • Ctime– cost per one unit of CPU time • n –number of training set examples used to induce the model • S - the score setof future examples to be classified by the model • err (n) – the model error rate measured on the score set • CPU(n) – CPU time required to induce the model Used to induce the classification model Future examples to be classified by the model Mark Last (mlast@bgu.ac.il)
What is this research about? • Problem Statement • Find the best training set size n* that is expected to maximize the overall utility (minimize the Total Cost) • Basic Idea - Projective Sampling • Estimate the optimal training set size using learning and run-time curves projectedfrom a small subset of potentially available data • Research Objectives • Calculate the optimal training set size for a variety of learning curve equations (with and without CPU costs) • Improve the utility of the data mining process using the best fitting curves for a given dataset and an algorithm Mark Last (mlast@bgu.ac.il)
Some Learning Curves for a Decision-Tree Algorithm Slow rise Rapid rise with oscillations Rapid rise Plateau Rapid rise Slow rise Mark Last (mlast@bgu.ac.il)
The Best Fit for a Learning Curve • Frey and Fisher (1999) • The power law is the best fit for modeling the C4.5 error rates • Last (2007) • The power law is the best fit for modeling the error rates of an oblivious decision-tree algorithm (Information Network) • Singh (2005) • The power law is only second best to the logarithmic regression for ID3, k-Nearest Neighbors, Support Vector Machines, and Artificial Neural Networks Mark Last (mlast@bgu.ac.il)
Progressive Sampling Strategy(Provost et al., 1999, Weiss and Tian, 2008) • General strategy • Start with some initial amount of training data n0 • Iteratively increase the training set until there is an increase in total cost • Popular schedules • Uniform (arithmetic) sampling • n0, n0+ , n0+ 2 ,… • Geometric Sampling • n0, a∙n0,a2∙n0,… Mark Last (mlast@bgu.ac.il)
Limitations of Progressive Sampling • Overfitting some local perturbations in the error rate • Progressive sampling costs may exceed the optimal ones by 10%-200% (Weiss and Tian, 2008) • Potential overhead associated with purchasing and pre-processing each sampling increment (especially with uniform sampling). • Our expectation • The projective sampling strategy should reduce data mining costs by estimating the optimal training set size from a small subset of potentially available data Mark Last (mlast@bgu.ac.il)
The Projective Sampling Strategy • Set a fixed sampling increment • Each acquired sample = one data point • Do • Acquire a new data point • Compute Pearson's correlation coefficient for each candidate fitting function (given at least three data points) • Dependent variable: err(n) • Independent variable: training set size n • Find a function with a minimal correlation coefficient Best_Corr • Why minimal • While ((Best_Corr ≥ 0) and (n < nmax)) • Estimate the regression coefficients of the selected function • Estimate the optimal training set size n* • Induce the classification model M (n*) from n* examples Mark Last (mlast@bgu.ac.il)
Candidate Fitting Functions • Learning Curves • Logarithmic : errLog (n) = a + b logn • Weiss and Tian: errWT (n) = a + bn / (n + 1) • Power Law : errPL (n) = a·nb • Exponential : errExp (n) = abn • Run-time Curves • Linear: CPUL (n) = dn • Power law: CPUPL (n) = c·nd Mark Last (mlast@bgu.ac.il)
Converting Learning Curves into the Linear Form y = a’ + b’x Mark Last (mlast@bgu.ac.il)
Pearson's Correlation Coefficient k – number of data points Mark Last (mlast@bgu.ac.il)
Linear Regression Coefficientsy = a + bx • The least squares estimate of the slope: • The least squares estimate of the intercept k – number of data points Mark Last (mlast@bgu.ac.il)
Total Cost Functions • Total CostLog (n) = n·Ctr + d∙n∙Ctime + |S|·Cerr · (a + b logn) • Total CostWT (n) = n·Ctr + d∙n∙Ctime + |S|·Cerr·( a+bn / (n+1)) • Total CostPL (n) = n·Ctr + d∙n∙Ctime + |S|·Cerr ·anb • Total CostExp (n) = n·Ctr + d∙n∙Ctime + |S|·Cerr ·abn Mark Last (mlast@bgu.ac.il)
Optimizing the Training Set Size • Let • R = Cerr / Ctr • Ctr = 1 • CPUL (n) = dn • Logarithmic: • Total CostLog (n) = n + d∙n∙Ctime + |S|·R · (a + b logn) • Weiss and Tian: • Total CostWT (n) = n + d∙n∙Ctime + |S|·R ·( a+bn / (n+1)) • Power Law: • Total CostPL (n) = n + d∙n∙Ctime + |S|·R ·anb • Exponential: • Total CostExp (n) = n + d∙n∙Ctime + |S|·R ·abn Mark Last (mlast@bgu.ac.il)
Experimental Settings • Ten benchmark datasets (see next slide) • Each dataset was randomly partitioned into 25%-50% of test examples and 50%-75% of examples potentially available for training . • The sampling increment was set to 1% of the maximum possible training set size • The error rate of each increment was averaged over 10 random partitions of the training set. • Sampling schedules: Uniform, Geometric (a=2), Straw Man, Projective, Optimal • Cost Ratios (R): 1 – 50,000 • CPU Factors: 0 and 1 (per one millisecond of CPU time) Mark Last (mlast@bgu.ac.il)
Datasets Description Mark Last (mlast@bgu.ac.il)
Projected Fitting Functions Mark Last (mlast@bgu.ac.il)
Projected and Actual Learning Curves – Small Datasets Mark Last (mlast@bgu.ac.il)
Projected and Actual Learning Curves – Medium and Large Datasets Mark Last (mlast@bgu.ac.il)
Comparison of Sampling Schedules R = Cerr / Ctr Mark Last (mlast@bgu.ac.il)
Detailed Sampling Schedules without Induction CostsSmall Datasets Uniform Geometric, Straw Man, Projected, Optimal Mark Last (mlast@bgu.ac.il)
Detailed Sampling Schedules without Induction Costs Medium and Large Datasets Geometric Uniform, Straw Man, Projected, Optimal Geometric Optimal Uniform, Straw Man, Projected Mark Last (mlast@bgu.ac.il)
Conclusions • The projective sampling strategy estimates the optimal training set size by fitting an analytical function to a partial learning curve • The proposed methodology was evaluated on 10 benchmark datasets of variable size using a decision-tree algorithm. • The results show that under negligible induction costs and high data acquisition costs, the projective sampling outperforms, on average, the alternative, progressive sampling techniques. Mark Last (mlast@bgu.ac.il)
Future Research • Further optimization of projective sampling schedules, especially under substantial CPU costs • Improving utility of cost-sensitive data mining algorithms • Modeling learning curves for non-random (“active”) sampling and labeling techniques Mark Last (mlast@bgu.ac.il)
Thank you! Merci Beaucoup! Mark Last (mlast@bgu.ac.il)