Statistical Methods for Computer Science Data Analysis Techniques

Statistical Methods for Computer Science Marie desJardins (mariedj@cs.umbc.edu) CMSC 601 April 9, 2012 Material adapted from slides byTom Dietterich, with permission

Statistical Analysis of Data • Given a set of measurements of a value, how certain can we be of the value? • Given a set of measurements of two values, how certain can we be that the two values are different? • Given a measured outcome, along with several condition or treatment values, how can we remove the effect of unwanted conditions or treatments on the outcome?

Measuring CPU Time • Here are 37 measurements of the CPU time required to compute C(10000, 500): • 0.27 0.25 0.23 0.24 0.26 0.24 0.26 0.25 0.24 0.250.25 0.24 0.25 0.24 0.25 0.26 0.24 0.25 0.25 0.250.25 0.25 0.24 0.25 0.24 0.25 0.25 0.24 0.25 0.250.24 0.25 0.24 0.24 0.25 0.25 0.26 • What is the “true” CPU cost of this computation? • Before doing any calculations, always visualize your data!

CPU Data Distribution

Kernel Density Estimate • Kernel density: place a small Gaussian distribution (“kernel”) around each data point, and sum them • Useful for visualization; also often used as a regression technique

Sample Mean • The data seems to have reasonably close to a normal (Gaussian or bell curve) distribution • Given this assumption, we can compute a sample mean: • How certain can we be that this is the true value? • Confidence interval [min, max]: • Suppose we drew many random samples of size n=37, and computed the sample means • 95% of the time, this value would lie between max and min

Confidence Intervals via Resampling • We can simulate this process algorithmically • Draw 1000 random subsamples (with replacement) from the original 37 points • This process makes no assumption about a Gaussian distribution! • Sort the means of these subsamples • Choose the 26th and 975th values as min and max of a 95% confidence interval (includes 95% of the sample means!) • Result: The resampled confidence interval is [0.245, 0.251]

Confidence Intervals via Distributional Theory • The Central Limit Theorem says that the distribution of the sample means is normally distributed, • If the original data is normally distributed with mean μand standard deviation σ, then the sample means will be normally distributed with mean μ and standard deviation σ’ = σ/√n (but we don’t know the original μ and σ...): • Note that it isn’t important to remember this formula, since Matlab, R, etc. will do this for you. But it is very important to understand why you are computing it!

t Distribution • Instead of assuming a normal distribution, we can use a t distribution (sometimes called a “Student’s t distribution”), which has three parameters: μ, σ, and the degrees of freedom (d.f. = n-1) • The probability distribution function looks somewhat like a normal distribution, but gives a tighter peak (with longer tails) as n increases • This distribution yields just slightly tighter confidence limits, using the central limit theorem:

Distributional Confidence Intervals • We can use the mathematical formula for the t distribution to compute a p (typically, p=0.95) confidence interval: • The 0.025 t-value, t0.025, is the value such that the probability that μ-μ’ < t0.025 is 0.975 • The 95% confidence interval is then [μ’-t0.025, μ+t0.025] • For the CPU example, t0.025 is 0.028, so the distributional confidence interval is [0.220, 0.276] -- tighter than the bootstrapped CI of [0.245, 0.251]

Bootstrap Computations of Other Statistics • The bootstrap method can be used to compute other sample statistics for which the distribution method isn’t appropriate: • median • mode • variance • Because the tails and outlying values may not be well represented in a sample, the bootstrap method is not as useful for statistics involving the “ends” of the distribution: • minimum • maximum

Measuring the Number of Occurrences of Events • In CS, we often want to know how often something occurs: • How many times does a process complete successfully? • How many times do we correctly predict membership in a class? • How many times do we find the top search result? • Again, the sample rate θ’ is what we have observed, but we would like to know the “true” rate θ

Bootstrap Confidence Intervals for Rates • Suppose we have observed 100 predictions of a decision tree, and 88 of them were correct • Draw many (say, 1000) samples of size n, with replacement, from the n observed predictions (here, n=100), and compute the sample classification rate • Sort the sample rates θi in increasing order • Choose the 26th and 975th values as the ends of the confidence interval: here, the confidence interval is [0.81, 0.94]

Binomial Distributional Confidence • If we assume that the classifier is a “biased coin” with probability θ of coming up heads, then we can use the binomial distribution to analytically compute the confidence interval • This requires a small correction because the binomial distribution is actually discrete, but we want a continuous estimate

Comparing Two Measurements • Consider the CPU measurements of the earlier example, and suppose we have performed the same computation on a different machine, yielding these CPU times: • 0.21 0.20 0.20 0.19 0.20 0.19 0.18 0.20 0.19 0.190.19 0.19 0.20 0.18 0.19 0.20 0.22 0.20 0.20 0.200.19 0.20 0.18 0.19 0.19 0.20 0.20 0.22 0.18 0.290.21 0.23 0.20 • These times certainly seem faster than the first machine, which yielded a distributional confidence interval of [0.220, 0.276] – but how can we be sure?

Kernel Density Comparison • Visually, the second machine (Shark3) is much faster than the first (Darwin):

Difference Estimation • Bootstrap testing: • Repeat many times: • Draw a bootstrap sample from each of the machines, computer sample means • If Shark3 is faster than Darwin more than 95% of the time, we can be 95% confident that it really is faster • We can also compute a 95% bootstrap confidence interval on the difference between the means – this turns out to be [0.0461, 0.0553] • If the samples are drawn from t distributions, then the difference between the sample means also has a t distribution • Confidence interval on this difference: [0.0463, 0.0555]

Hypothesis Testing • Is the true difference zero, or more than zero? • Use classical statistical rejection testing • Null hypothesis: The two machines have the same speed (i.e., μ, the difference in sample rate, is equal to zero) • Can we reject this hypothesis, based on the observed data? • If the null hypothesis were true, what is the probability we would have observed this data? • We can measure this probability using the t distribution • In this case, the computed t value = (μ1 – μ2) / σ’ = 21.69 • The probability of seeing this t value, if μ was actually zero, is nearly nonexistent: The 99.999% confidence interval (for the null hypothesis) is [-4.59, 4.59], so the probability of this t value is (much) less than 0.00001

Paired Differences • Suppose we had a set of 10 different benchmark programs that we ran on the two machines, yielding these CPU times: • Obviously, we don’t want to just comparethe means, since theprograms have suchdifferent running times

Kernel Density Visualization • CPU1 seems to be systemically faster (offset to the left) than CPU2

Scatterplot Visualization • CPU1 is always faster than CPU2 (i.e., above the diagonal line that corresponds to equal speed)

Sequential Visualization • The co-correlation of program “difficulty” (and faster CPU speed of CPU1) is even more obvious in this ordered (by program number) line plot:

Distribution Analysis I • If the differences are in the same “units,” we can subtract the CPU times for the “paired” tests and assume a t distribution on these differences • The probability of observing a sample mean difference as large as 02779, given a null hypothesis of the machines having the same speed, is 0.0000466 – we can reject the null hypothesis • If we have no prior belief about which machine is faster, we should use a “two-tailed test” • The probability of observing a sample mean difference this large in either direction is 00000932 – slightly larger, but still sufficiently improbable that we can be sure that the machines have different speeds • Note that we can also use a bootstrap analysis on this type of paired data

Paired vs. Non-Paired • If we don’t pair the data (just compare the overall mean, not the differences for paired tests): • Distributional analysis doesn’t let us reject the null hypothesis • Bootstrap analysis doesn’t let us reject the null hypothesis

Sign Tests • I mentioned before that the paired t-test is appropriate if the measurements are in the same “units” • If the magnitude of the difference is not important, or not meaningful, we still can compare performance • Look at the sign of the difference (here, CPU1 is faster 10 out of 10 times; but in another case, it might only be faster 9 out of 10 times) • Use the binomial distribution (flip a coin to get the sign) to compute a confidence interval for the probability that CPU1 is faster

Other Important Topics • Regression analysis • Cross-validation • Human subjects analysis and user study design • Analysis of Variance (ANOVA) • For your particular investigation, you need to know which of these topics are relevant, and to learn about them!

Statistically Valid Experimental Design • Make sure you understand the nuances before you design your experiments... • ...and definitely before you analyze your experimental data! • Designing the statistical methods (and hypotheses) after the fact is not valid! • You can often find a hypothesis and associated statistical method and hypothesis ex post facto – i.e., design an experiment to fit the data instead of the other way around • In the worst case, doing this is downright unethical • In the best case, it shows a lack of clear research objectives and may not be reproducible or meaningful

Statistical Methods for Computer Science Data Analysis Techniques

Statistical Methods for Computer Science Data Analysis Techniques

Presentation Transcript

Research Methods: Experimental Computer Science

Statistical Methods in Computer Science

Statistical Methods For Engineers

Statistical Models and Methods for Computer Experiments

Statistical Methods

Research Methods for Computer Science, HILCO

Statistical Methods

Statistical Methods in Computer Science

Statistical Methods in Computer Science

Statistical Methods in Computer Science

Statistical Methods in Computer Science

Using Statistical Methods for Environmental Science and Management

Statistical Methods in Computer Science

Research Methods for Computer Science Introduction

Statistical Methods in Computer Science

Statistical Methods in Computer Science

Statistical Methods

Research Methods in Computer Science

Statistical Methods For Engineers

Statistical Methods

CSE 515 Statistical Methods in Computer Science

Statistical Methods in Computer Science