390 likes | 529 Vues
This module explores the foundational concepts of random samples, statistics, and estimation in statistics. It covers the process of obtaining random samples from a population, the significance of sample averages in estimating population means, and the assessment of variability through sampling distributions. Key topics also include estimating variance and standard deviation, as well as making decisions based on comparisons of means and variances through confidence intervals and hypothesis tests. This rich overview provides essential insights into statistical analysis.
E N D
Characterizing Variability and Comparing Patterns from Data “Statistics” Module 3
Outline • random samples • notion of a statistic • estimating the mean - sample average • assessing the impact of variation on estimates - sampling distribution • estimating variance - sample variance and standard deviation • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests J. McLellan
Random Samples Scenario - • we have an underlying pattern of variability for a process which we would like to characterize -- the population • we perform a series of experiments on the process in such a way that the results are independent - outcome of one experiment has no influence on any other experiment • the underlying distribution in place during each experimental run is identical to that of the population • when we run each experiment, we are collecting a value from the random variable Xi - which has uncertainty • Xi represents the “i-th” act of sampling - referred to as a sample random variable J. McLellan
Definition - Random Sample A random sample of size “n” of a population random variable is a collection of random variables X1, … Xn such that • the Xi’s are independent • the Xi’s have distributions identical to that of X, i.e., Each Xi represents a snapshot of the process. The Xi’s are referred to as sample random variables. What do we do with these sample values?... = F ( x ) F ( x ) X X i J. McLellan
Sample Average • used to estimate the mean • given “n” samples, X1, …, Xn, compute • interpretation - a rule for computing the sample average, involving sampling • is a random variable • observed value n 1 = å X X i n = i 1 n Lower case is used to denote observed values of the sample random variables and average. 1 = å x x i n = i 1 J. McLellan
Statistics • Sample average is an example of a “statistic” Definition A statistic is a function of sample random variables that is used to estimate a value of a parameter, and does not depend on any unknown parameters. • e.g., sample average estimates mean and doesn’t depend on unknown parameters n 1 = å X X i n = i 1 J. McLellan
Sampling Distribution A statistic is a random variable, with its own probability distribution • distribution arises from probability distribution of underlying population, via the sample random variables • distribution of the statistic is called the sampling distribution • characteristics of the sampling distribution depend on: • the form of the statistic - e.g., linear function of the sample random variables • the distribution of the underlying population J. McLellan
Sampling Distribution for the Sample Average • determine the mean and variance of the sample average Mean ì ü ì ü n n 1 1 = = å å E { X } E X E X í ý í ý i i n n î þ î þ = = i 1 i 1 n n m 1 1 n = = m = = m å å E { X } i n n n = = i 1 i 1 Value expected on average of the sample average is the true mean of the process - sample average is an UNBIASED estimator for the mean. because of independence of sample random variables J. McLellan
Sampling Distribution for the Sample Average Variance æ ö n 1 ç ÷ = å Var ( X ) Var X ç ÷ i n è ø = i 1 æ ö n n 1 1 ç ÷ = = å å Var X Var ( X ) ç ÷ i i 2 2 è ø n n = = i 1 i 1 2 2 s s n = = 2 n n J. McLellan
Aside - Variance If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then Var( a X+ b Y) = a2 Var(X) + b2 Var(Y) J. McLellan
Variance of Sample Average Interpretation • variance of sample average is 2 / n • as n becomes larger, variance of sample average becomes smaller • as more data is used, estimate becomes more precise • sample average represents a concentration of information J. McLellan
Distribution of the Sample Average • in preceding slides, no assumption was made about distribution of population (e.g., normal, exponential) • Central Limit Theorem implies that distribution of sample average approaches a Normal distribution when number of samples becomes large • even if underlying population is non-Normal • important consequences for comparing values - hypothesis tests and confidence limits J. McLellan
Outline • random samples • notion of a statistic • estimating the mean - sample average • assessing the impact of variation on estimates - sampling distribution • estimating variance - sample variance and standard deviation • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests J. McLellan
Sample Variance … is estimated using the following statistic: Observed value: Mean of the sample variance: n 1 2 2 = - å s ( X X ) i - n 1 = i 1 n 1 2 2 = - å s ( x x ) i - n 1 = i 1 Sample variance is an UNBIASED estimator of variance. 2 2 = s E { s } J. McLellan
Sample Standard Deviation … is simply the square root of the sample variance BUT • sample standard deviation is a biased estimator of population standard deviation • value on average does not tend to population value ¹ s E { s } J. McLellan
Outline • random samples • notion of a statistic • estimating the mean - sample average • assessing the impact of variation on estimates - sampling distribution • estimating variance - sample variance and standard deviation • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests J. McLellan
Confidence Intervals Consider the sample average We can standardize this to have zero mean and unit variance: 2 m s X ~ N ( , / n ) X X “Normally distributed with mean and variance” “is distributed as” - m X X = Z s / n X J. McLellan
Confidence Intervals Distribution for standard normal: Start with - and consider Z - - < < = P ( 1 . 96 Z 1 . 96 ) 0 . 95 - m X X - < < = P ( 1 . 96 1 . 96 ) 0 . 95 s / n X Û m - s < < m + s = P ( 1 . 96 / n X 1 . 96 / n ) 0 . 95 X X X X J. McLellan
Confidence Intervals Rearrange this last statement to obtain: Interpretation - • limits of interval have uncertainty - if we repeated sequence of estimating average and computing the limits, the endpoints would change somewhat BUT95% of the time, the interval would contain the true value of the mean - s < m < + s = P ( X 1 . 96 / n X 1 . 96 / n ) 0 . 95 X X X RANDOM NOT random RANDOM J. McLellan
Confidence Intervals • this interval DOES NOT imply that the mean is uncertain Picture - sequence of intervals associated with repeated experimentation true value of mean J. McLellan
Confidence Intervals General result for mean - 100(1-)% confidence interval given by: where - • z/2 - “fence” - value for which P(Z> z/2 ) = /2 • value obtained from tables • 95% - value is 1.96 - approximately 2 • 99% - value is 2.57 - s < m < + s X z / n X z / n a a / 2 X X / 2 X J. McLellan
Confidence Intervals General Approach • form a quantity with a known distribution that depends on the parameter of interest • form a probability statement - choose fences (limits) with a known probability • re-arrange statement to obtain an interval specifying a range of values for the parameter of interest - m X X = Z s / n X - m X X - < < = P ( 1 . 96 1 . 96 ) 0 . 95 s / n X - s < m < + s = P ( X 1 . 96 / n X 1 . 96 / n ) 0 . 95 X X X J. McLellan
Confidence Intervals for Mean When population variance is “known”, 100(1-)% confidence interval is - Known variance - • knowledge of variance when process has been operating steadily for long period of time • on basis of extensive operating experience • “large number of data points” - s < m < + s X z / n X z / n a a / 2 X X / 2 X J. McLellan
Confidence Intervals for Mean What if variance is unknown? • Estimate using sample variance s2 Follow previous approach by forming standardized quantity: • issue - s2 is a statistic itself, and is a random variable • this quantity no longer has a standard Normal distribution Solution - • what is the probability distribution of this quantity, whendata are Normally distributed? - m X X s / n X J. McLellan
Student’s t Distribution When the data are Normally distributed, follows a Student’s t distribution with n-1 degrees of freedom Degrees of freedom - • number of statistically independent pieces of information used to compute sample variance • recall that in s2, we divide by n-1 where n is the number of data points - m X X s / n X J. McLellan
Student’s t Distribution … has a shape similar to that of Normal distribution • symmetric • values are available in tables • extra parameter in tables - degrees of freedom 3 degrees of freedom J. McLellan
Confidence Intervals for Mean Variance Unknown • estimated using sample variance • 100(1-)% case • is the number of degrees of freedom (n-1), where n is number of data points used to compute sample variance (and average) • obtained following identical argument used in the known variance case - < m < + X t s / n X t s / n n a n a , / 2 X X , / 2 X J. McLellan
Example #1 Conversion in a chemical reactor using new catalyst preparation • data collected, average conversion computed using 10 data points is 76.1% • prior operating history indicates that variance of conversion is 4.41 %2 • determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70% J. McLellan
Example #1 • Confidence interval - 95% • upper tail area is 2.5% • standard devn = sqrt(4.41) = 2.1 • confidence interval • conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion - < m < + 76 . 1 ( 1 . 96 )( 2 . 1 ) / 10 76 . 1 ( 1 . 96 )( 2 . 1 ) / 10 Þ < m < 74 . 8 77 . 4 J. McLellan
Example #2 Conversion in a chemical reactor using new catalyst preparation • data collected, average conversion computed using 10 data points is 76.1% • current data set of 10 points used to estimate sample variance, which is 5.3 %2 • determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70% J. McLellan
Example #2 • Confidence interval - 95% • variance UNKNOWN - need to use Student’s t distribution -- degrees of freedom = 10-1 = 9 • upper tail area is 2.5% • standard devn = sqrt(5.3) = 2.3 • confidence interval • conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion - < m < + 76 . 1 ( 2 . 262 )( 2 . 3 ) / 10 76 . 1 ( 2 . 262 )( 2 . 3 ) / 10 Þ < m < 74 . 5 77 . 7 J. McLellan
Confidence Intervals for Variance First, we need to know the sampling distribution of the sample variance: • when data are Normally distributed, sample variance is the sum of squared Normal random variables • squaring “folds over” the negative values of the Normal random variable and makes them positive - asymmetry n 1 2 2 = - å s ( X X ) i - n 1 = i 1 J. McLellan
Chi-squared distribution • is the distribution of a squared standard Normal random variable • Chi-squared random variable with 1 degree of freedom • degrees of freedom = number of independent standard Normal random variables being squared • e.g., • 3 degrees of freedom 2 2 c Z ~ 1 2 2 2 2 + + c Z Z Z ~ 1 2 3 3 3 degrees of freedom J. McLellan
Sampling distribution -sample variance Sample variance • is the sum of n squared Normal random variables BUT we add the sum of squared deviations from the sample average • given value of sample average introduces constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the average) • sample variance contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1 2 s 2 2 c s ~ - n 1 - n 1 J. McLellan
Confidence Intervals - Sample Variance • Form probability statement • Re-arrange statement • 100(1-)% interval is 2 - ( n 1 ) s 2 2 c < < c = - a P ( ) 1 - - a - a n 1 , 1 / 2 n 1 , / 2 2 s 2 2 - - ( n 1 ) s ( n 1 ) s 2 < s < = - a P ( ) 1 2 2 c c - a - - a n 1 , / 2 n 1 , 1 / 2 2 2 - - ( n 1 ) s ( n 1 ) s 2 < s < 2 2 c c - a - - a n 1 , / 2 n 1 , 1 / 2 J. McLellan
Confidence Limits for Variance Notes 1) the tail areas are equal • symmetric tail areas however the interval can be asymmetric • consequence of asymmetry of Chi-squared distribution 2) is the value of the Chi-squared random variable with upper tail area of 1-/2 and n-1 degrees of freedom equal tail areas 2 c - - a n 1 , 1 / 2 J. McLellan
Variance Confidence Intervals - Example Temperature controller has been implemented on a polymer reactor - • variance under previous operation was 4.7 C • under new operation, we have collected 10 data points and computed a sample variance of 3.2 C • is the variance under the new control operation significantly better? • i.e., is variance under new operation significantly lower? J. McLellan
Variance Confidence Intervals - Example Use confidence interval for variance • n-1 = 10-1 = 9 degrees of freedom • form 95% confidence interval ( = 0.05) • from tables: • interval for variance: • conclusion - variance reduction isn’t significant after background variation in sample variance computation is taken into account • note that interval isn’t symmetric 2 c = 2 . 7 - 9 , 1 0 . 025 2 c = 19 . 0 9 , 0 . 025 2 < s < 1 . 52 10 . 67 J. McLellan
Variance Confidence Intervals - Example Comment • variance is sensitive to degrees of freedom • need larger number of data points to obtain precise estimate • e.g., if variance estimate was 3.2 C with 30 degrees of freedom (31 data points), the interval would be: • cf. previous interval with 10 data points Conclusion still doesn’t change, however. 2 < s < 2 . 04 5 . 71 2 < s < 1 . 52 10 . 67 J. McLellan