260 likes | 373 Vues
USING SYNTHETIC DATA SAFELY IN CLASSIFICATION. Jean Nonnemaker 10 January 2009. Motivation. Trainable classifier technologies require large representative training sets Acquiring such training sets is difficult and costly Time and labor to gather existing training data
E N D
USING SYNTHETIC DATA SAFELY IN CLASSIFICATION Jean Nonnemaker 10 January 2009 Using Synthetic Data Safely in Classification
Motivation • Trainable classifier technologies require large representative training sets • Acquiring such training sets is difficult and costly • Time and labor to gather existing training data • Time and labor to label new training data with ground truth • Training sets may not be representative - or sets may be imbalanced Using Synthetic Data Safely in Classification
One Solution Amplify the training data - that is increase it artificially by generating more. We will call such generated data synthetic in contrast to real data collected in the field. Using Synthetic Data Safely in Classification
Sample Space • The set of all samples that exist, e.g. images of the letter ‘e’ Just as they are found in nature. (Here, for example as found in a real book.) CCC Using Synthetic Data Safely in Classification
Feature Space width • Features are measurable characteristics of the sample, e.g. width, height of an image height Data can be thought of as points in a multi dimensional vector space Using Synthetic Data Safely in Classification
Parameter Space TYPESETTING PARAMETERS NOISE PARAMETERS • Parameters may be used to generate the data. Using Synthetic Data Safely in Classification
Ways of Using Sample, Parameter and Feature Spaces We can create synthetic data in • Parameter space – e.g. change the generating parameters and generate new samples • Sample space – e.g. add noise to the sample • Feature space – e.g. adjust feature values Using Synthetic Data Safely in Classification
Supporting Technology - Knuth • TeX’s Metafont system synthetically generates typefaces. • 62 parameters are sufficient to define a typeface • Examples: Width, height, darkness and slant. Using Synthetic Data Safely in Classification
Synthesizing Typefaces The letters ‘e’ and ‘c’ were generated using Knuth's metafont • CMR (Computer Modern Roman) • CMFF (Computer Modern Funny). • Nine interpolations between CMR and CMFF Interpolation is by convex combinations in the 62-dimensional parameter space Using Synthetic Data Safely in Classification
Pure Typefaces CMR and CMFF are well known, standard typefaces which are widely used. We refer to CMR and CMFF as pure typefaces. These are real samples of pure fonts that can be collected. INTERPOLATED PURE PURE Using Synthetic Data Safely in Classification
Interpolated Typefaces Synthesized typefaces are created by interpolating between parameters that define the pure typefaces. They may never have been used but they are legible and should be recognized. These are interpolated samples and so must be synthesized INTERPOLATED Using Synthetic Data Safely in Classification
Description of Experiment • Train two classifiers • First on pure data only • Second on mixture of pure and interpolated data (synthetically amplified) • Ask • Is this safe: Does the amplified classifier continue to work well on pure test data? • Is it better: Does the amplified classifier work better on interpolated data? Using Synthetic Data Safely in Classification
A B A A/A A/B B B/A B /B Details of Experiment Test On • A = pure data (CMR and CMFF fonts) • B = interpolated data (interpolated fonts) Train On Hypothesis 2 – Error rates on AA and BA are the same. We hope not to reject the null hypothesis Hypothesis 1 – Error rates on AB are better than BB. We hope to reject the null hypothesis Using Synthetic Data Safely in Classification
Two Hypotheses Hypothesis 2. • AA is trained and tested on pure data. BA is trained on mixed pure and interpolated data and tested on pure data. • Our null hypothesis is that AB and AA perform equally. • If the experiment does not reject the null hypothesis then synthetic data is safe Hypothesis 1. • AB is trained on pure data and tested on interpolated data. BB is trained on pure and interpolated data and tested on interpolated data. • Our null hypothesis is that AB performs better than BB. • If the experiment rejects the null hypothesis then synthetic data is better. Using Synthetic Data Safely in Classification
Details of Experiment • kNN classifier was trained on 800 samples each of letter ‘e’ and ‘c’ in CMR, and 800 samples each of letter ‘e’ and ‘c’ using CMFF • A second kNN classifier was trained using 800 samples of letter ‘e’ and 800 of letter ‘c’ created by interpolating between CMR and CMFF. • Each classifier was tested on same 400 samples of CMR and CMFF ‘e’s and ‘c’s • Each classifier was tested on 400 samples of ‘e’s and ‘c’s taken by interpolating between CMR and CMFF Note that we tested on frequently confused letter pairs e/c and i/j Using Synthetic Data Safely in Classification
CMR/CMFF/CMSSI Experiments (Computer Modern Roman/Computer Modern Sans Serif Italics/Computer Modern Funny Font) Using Synthetic Data Safely in Classification
Test Samples CMR/CMFF/CMSSI Using Synthetic Data Safely in Classification
Error Counts CMR/CMFF/CMSSI Since χ² = 17.20, and is > 3.84 we can reject the null hypothesis and therefore can conclude that amplified classifier is better on interpolated data with confidence ≥ 95% Since χ² = 2.19, and is < 3.84 we cannot reject the null hypothesis and therefore we conclude that interpolated data is safe with confidence ≥ 95% Using Synthetic Data Safely in Classification
Summary of Many Experiments Using Synthetic Data Safely in Classification
Summary of Many Experiments I and J Using Synthetic Data Safely in Classification
Conclusions • Systematic family of experiments: • A wide range of image qualities • Experiments show that amplifying training sets with synthetic data generated by interpolation in parameter space: • Never worsened accuracy on pure data • Often improved accuracy on interpolated data. Using Synthetic Data Safely in Classification
Conclusions • Improvement seems to be greater when the pure fonts are most dissimilar • Improvement is greater when fonts are more blurred but with little variance • 3-way interpolation showed the most significant results • These results hold • When image quality is normal • When image quality is poor Using Synthetic Data Safely in Classification
Typeface Interpolation This seems to be the first time that typeface generation has been used together with image quality generation to produce synthetic training data. Legibility seems to be convex in typeface and image quality parameter space, that is to say that any font interpolated between two legible fonts is still legible Using Synthetic Data Safely in Classification
Directions of Future Research • Can we devise a method for training on synthetic data that is guaranteed never to increase confusion between any two categories? • What are the conditions for the generation of synthetic data that improve classification? When is no more improvement possible and worsening likely? • Can we generate exactly as many new samples as are needed to force a certain reduction in error rate? Using Synthetic Data Safely in Classification
Directions of Future Research • Can we consistently generate data that is misclassified? We might throw such data into a boosting algorithm so it attempts to accommodate the failure and thus adapt the decision boundary. • Which methods are best suited for operating in the three spaces: parameter space, sample space, and feature space? • Can we generalize convex combinations to allow non-convex combinations which are bounded and controlled, e.g. extrapolation? Can these also be made safe? Using Synthetic Data Safely in Classification
Questions? Using Synthetic Data Safely in Classification