730 likes | 877 Vues
Estimation – More Applications. So far … defined appropriate confidence interval estimates for a single population mean, . Confidence interval estimators are valuable because they provide: Indicate the width of the central (1-alpha)% of the sampling distribution of the estimator
E N D
Estimation – More Applications
So far … • defined appropriate confidence interval estimates for a single population mean, . • Confidence interval estimators are valuable because they provide: • Indicate the width of the central (1-alpha)% of the sampling distribution of the estimator • Provide an idea of how much the estimator might differ if another study was done.
Next step: • extend the principles of confidence interval estimation to develop CI estimates for other parameters. • The important things to keep track of FOR EACH PARAMETER: • What is the appropriate probability distribution to describe the spread of the point estimator of the parameter? • What underlying assumptions about the data are necessary? • How is the confidence interval calculated?
The confidence interval estimates of interest are: • Confidence Interval calculation for the difference between two means, 1 – 2 , for comparing two independent groups. • Confidence Interval calculation for the mean difference, d , for paired data. • Population variance, s2, when the underlying distribution is Normal. We will introduce the 2 (Chi-square) distribution. • The ratio of two variances for comparing variances of 2 independent groups: – introducing the F-distribution.
Confidence Interval Estimation for: • Population proportion, , using the Normal approximation for a Binomial proportion. • The difference between two proportions, p1 – p2 two independent groups.
1. Confidence Interval calculation for the difference between two means, 1 – 2, for Two Independent groups • We are often interested in comparing two groups: • What is the difference in mean blood pressure between males and females? • What is the difference in body mass index (BMI) between breast cancer cases versus non-cancer patients? • How different is the length of stay (LOS) for CABG patients at hospital A compared to hospital B? • We are interested in the similarity of the two groups.
Statistically, we focus on • the difference between the means of the two groups. • Similar groups will have small differences, or no difference, between means. • Thus, we focus on estimating the difference in means: m1 - m2 An obvious point estimator is the difference between sample means, x1 – x2
To compute a confidence interval for this difference, we need to know the standard error of the difference. Suppose we take independent random samples from two different groups: We know the sampling distribution of the mean for each group:
What is the distribution of the difference between the sample means, (x1 – x2) ? • The sum (or difference) of normal RVs will be normal. What will be the mean and variance? • This is a linear combination of two independent random variables. As a result,
In general, for any constants a and b, : • That is, the distribution of the sum of ax1 and bx2, • The mean is the sum of am1 and bm2 • The variance is the sum of (a2)(var of sampling distribution of x1) and (b2)(var of sampling distribution of x2)
Letting a = 1, and b = -1, we have: Thus, the standard error of the difference between means:
Once we have • a point estimate • its standard error, • we know how to compute a confidence interval estimate. Confidence Interval Estimate Point Estimate Confidence Coefficient Std Error = Percentile From N(0,1) x1 – x2 s2 known Est. of std error Percentile from tdf s2 estimated from samples
Example: Data are available on the weight gain of weanling rats fed either of two diets. The weight gain in grams was recorded for each rat, and the mean for each group computed: • Diet Group #1Diet Group #2 • n1 = 12 ratsn2 = 7 rats • x1 = 120 gmsx2 = 101 gms • What is the difference in weight gain between rats fed on the 2 diets, and a 99% CI for the difference?
We will assume • a. The rats were selected via independent simple random samples from two populations. • b. the variance of weight gain of weanling rats is known, and is the same for both diet groups: • s12 = s22= 400 • Construct a 99% confidence interval estimate ofthe difference in mean weight gain, m1 – m2 • Point Estimate: x1 – x2 = 120 – 101 = 19 gms • Std error of point estimate:
3. With known variance, use a percentile of N(0,1): • For (1 – a) = .99, z.995 = 2.576 • The 99% CI for m1 – m2 is: • = (–5.5, 43.5) gms
How do we interpret this interval? (–5.5, 43.5) gms With different samples, we will have different estimates of the true difference in gains. The endpoints of the confidence interval indicate how wide the difference in estimates is expected to be 99% of the time. Alternatively, if we repeatedly selected samples and computed a CI, then for 99% of the intervals computed would include the true difference in gains.
How do we compute a confidence interval when • we don’t “know” the population variance(s), • but must estimate them from our samples? • If s12 and s22 are UNknown: • Is it reasonable to assume that the variances of the two groups are the same? • That is, is it OK to assume unknown s12=s22 ? • Questions to consider: • Do data arise from the same measurement process? • Have we have sampled from the same population? • Does difference in groups lead us to expect different variability as well as different mean levels?
If OK to assume variances equal: s12 = s22 = s2 • We have 2 estimates of same parameter, s2 • One from each sample: s12 and s22 • We can create a pooled estimate: sp2 • This is a weighted average of the 2 estimates of the variance • Weighting by (ni–1) for the ith sample
The standard error of the difference in means, x1 – x2 • is then: • That is, • Sp2 is used as an estimator of the variance of x1 and of x2 rather than the two sample estimates
Use the t-distribution to compute percentiles: • we are estimating the variance from the samples • One degree of freedom is lost for each sample mean we estimated, resulting in • df = (n1 – 1) + (n2 – 1) = n1 + n2 – 2 • Thus, our confidence interval estimator when the variance is Unknown, but assumed equal for the two groups:
Example: Weanling Rats Revisited • Diet Group #1 Diet Group #2 • n1 = 12 ratsn2 = 7 rats • x1 = 120 gmsx2 = 101 gms • s12 = 457.25 g2 s22 = 425.33 g2 • Is there a difference in mean weight gain among rats fed on the 2 diets? • Use sample estimates of the variance • Assume that the variances are equal since • the rats in each group come from the same breed • were fed the same number of calories on their different diets • Used same scale in weighing.
Assuming s12=s22, equal but unknown, construct a 99% CI for the difference in means, m1 – m2 . • Point Estimate: x1 – x2 = 120 – 101 = 19 gms • Std error of point estimate:Step 1: sp2
Step 2: SE of point estimate: • Confidence Coefficient for 99% CI: • df = n1+n2 – 2 = 12 + 7 – 2 = 17 • 1-a = .99 a/2 = .005 1 – a/2=.995 • t17;.995= 2.898
Confidence interval estimate of (m1 – m2): • = (-10.1, 48.1) • Again, we can conclude that if we repeated the study many times, and looked at how widely the sample mean differences were spread out (99% of the time), then the width would be equal to the confidence interval width. • Notice that the width is wider than when we know the variance.
What if it is not reasonable to assume that the variances of the two groups are the same? • When it seems likely that s12 s22 • For example • we have used a different measuring process • we have other reasons to believe both the mean level and variability are different between the two populations • Then • Use separate estimates of the variance from each sample, s12 and s22 • Compute Satterthwaite’s df – and appropriate t-value
Satterthwaite’s Formula for Degrees of freedom: • horrible … avoid computing by hand! • Note • it is a function both of the sample sizes and the variance estimates. • When in fact the variances and sample sizes are similar – the df will be similar to the pooled variance df.
Putting it all together yields the CI estimator when UNknown s12 s22 : Note: use separate estimates of standard error of sample means for each sample
Example: Weanling Rats Once Again! • Assume that the population variances are not equal – s12 s22 . • Diet Group #1 Diet Group #2 • n1 = 12 ratsn2 = 7 rats • x1 = 120 gmsx2 = 101 gms • s12 = 457.25 g2 s22 = 425.33 g2 • Is there a difference in mean weight gain among rats fed on the 2 diets? • Compute at 99% CI for the difference in the group means, assuming s12 s22 .
Point Estimate: x1 – x2 = 120 – 101 = 19 gms • Std error of point estimate: • Confidence Coefficient for 99% CI: • df = f = … = 13.08 use 13 • 1-a = .99 a/2 = .005 1 – a/2=.995 • t13;.995= 3.012
Confidence interval estimate of (m1 – m2): • = (-11.0, 49.0) • The interpretation of this confidence interval is the same. Notice that it is somewhat wider than the previous two intervals- indicating a wider variation in the sample mean difference when variances are not equal between groups.
Since the unit is common to the two measures, we expect • the two responses to the unit to be similar in some respects • We expect the 1st and 2nd responses within a unit to be related. • Studies use this design to reduce the effects of subject-to-subject variability • This variability can be reduced by subtracting the common part out. • We do this by taking the difference between the 2 measures, on the same subject or unit.
Analysis of Paired Data Focuses on: • difference = Response 2 – Response 1 • for each subject, or paired unit. • Work with the differences – • as if never saw the individual paired responses • and see only the differences as our data set • The data set comprised of differences has been reduced to a one sample set of data. • We already know how to work with this.
1st Response 2nd Response Difference 2nd – 1st 1 x1 = 10 y1 = 12 d1 = 12–10 = 2 … xi yi di = xi – yi n xn = 14 y1 = 11 dn = 11–14 = -3 • Note: • The order in which you take differences is arbitrary, but it must be consistent. If you choose yi – xi , then compute that way for all pairs. • Direction is important. Keep track of positive and negative differences.
Confidence Interval Calculations for the mean difference, md Preliminaries: • Compute sample of differences, d1, …, dn , where n = # of paired measures. • Obtain sample mean and sample variance of the differences • Treat like any other 1-sample case for estimating a mean, m, (here a mean difference.)
Example: Reaction times in seconds to 2 different stimuli are given below for 8 individuals. Estimate the average difference in reaction time, with a 95% CI. Does there appear to be a difference in reaction time to the 2 stimuli? Subject X1 X2Difference (X2 – X1) 1 1 4 3 2 3 2 -1 3 2 3 1 4 1 3 2 5 2 1 -1 6 1 2 1 7 3 3 0 8 2 3 1
We have paired data • each subject was measured for each stimuli • we focus on the within-subject difference. • Since I have subtracted in the direction X2 – X1 : • a positive difference means longer reaction time for stimulus 2 • a negative difference means a longer reaction time for stimulus 1. • We can compute the mean and standard deviation of the differences: • d = .75 and Sd = 1.39
For a 95% confidence interval, • using my sample estimate of standard error, • use the t-distribution. • The confidence interval is: • d ± tn-1; .1-a/2(sd/n) = .75 ± t 7; .975(sd/8) • = .75 ± 2.36 (1.39/8) • 95% CI is (-0.41, 1.91) • The results indicate that repeating the study may produce an estimate quite different from that observed, and even possibly a negative estimate.
Notes: • It is a common error to fail to recognize paired data, and therefore fail to compute the appropriate confidence interval. • The mean difference md is equal to the difference in means, m2 – m1 if we ignore pairs – your point estimate will be correct. • However, the variance of the mean difference does NOT equal the variance of the difference in means – so the confidence interval will not be correctly estimated if you neglect to use a paired data approach. • Sd2/n = (S12/n) + (S22/n)-2Cov/n
Confidence Interval Estimation of the Variance, s2 Standard Deviation, s and Ratio of Variances of 2 groups
3. Confidence Interval for the variance, s2: Introducing the c2 Distribution • What if our interest lies in estimation of the variance, 2 ? • Some common examples are: • Standardization of equipment – repeated measurement of a standard should have small variability • Evaluation of technicians – are the results from person i “too variable” • Comparison of measurement techniques – is a new method more variable than a standard method?
We have an obvious point estimator of 2 s2, which we have shown earlier is an unbiased estimator (when using Simple random with replacement sampling). • How do we get a confidence interval? • We will define a new standardized variable, based upon the way in which s2 is computed: • That is, [(n-1)s2 / s2] follows a chi-square distribution with n-1 degrees of freedom
A quick and dirty derivation: • We defined the sample variance as: • Multiplying each side by (n-1): Note this is the numerator from the 2 variable. This side is the sum of squared deviations from the mean.
Recall, for X ~ N(m, s2) • We can standardize as: If we square this, we have a squared standard normal variable: That is, a squared standard normal variable follows a chi- square distribution, with 1 degree of freedom – this is the definition of a chi-square, df=1
If we sum n such random variables, we define a chi-square distribution with n degrees of freedom: However, if we first estimate from the data: x, we reduce the degrees of freedom:
Features of the Chi Square Distribution • Chi-squared variables are sums of squared Normally distributed variables. • Chi-squared random variables are always positive. (Why? –square is always positive) • The distribution is NOT symmetric. A typical shape is: 0
df = 1 df = 2 df = 4 df = 6 df =10 Features of the Chi Square Distribution • • Each degree of freedom defines a different distribution. • The shape is less skewed as n increases. df = 100
How to Use the Chi Square Table – Table 6, Rosner The format is the same as for the Student t-tables: 2 2 2 2 c c c c d … .005 .01 .025 .995 1 7.88 2 … 5 10.60 … 16.75 Each row gives information for a separate chi square distribution, defined by the degrees of freedom. The column heading tells you which percentile will be given to you in the body of the table. The body of the table is comprised of the values of the percentile
c2 distribution This area = .995 with 5 df 16.750 Pr[ c25£ 16.750]=.995 • Note: Because the c2 distribution is not symmetric • will often need to look up both upper and lower percentiles of the distribution
Confidence Interval for s2 For To obtain a (1-a) confidence interval, we want to find percentiles of the c2 distribution so that: This area Is a/2 This area is a/2 (1 – a) 2 c 2 c a/2 1- a/2
Substitute for X2 in the middle of the inequality: A little algebra yields the confidence interval formula: