Business Statistics for Managerial Decision

Business Statistics for Managerial Decision Inference for proportions

Inference for Proportions • Some statistical studies concern variables measured in a scale of equal units such as dollars or grams. • We have discussed inference about the mean of variables likes these in our previous lectures. • Other studies record categorical variables, such as the race or occupation of a person, the make of a car, or type of complaint received from a customer. • When we record categorical variables, our data consists of counts or percents obtained from counts.

Inference for Proportions • The parameters we want to do inference about in these settings are population proportions. • Just as in the case of inference about population means, we may be concerned with a single population or with comparing two populations. • Inference about one or two proportions is very similar to inference about means and it is based on sampling distributions that are approximately Normal.

Example: Work stress and personal life • The human resources manager of a chain restaurants is concerned that work stress may be affecting the chain’s employees. She asks a random sample of 100 employees to respond Yes or No to the question “Does work stress have a negative impact on your personal life?” Of these 68 say “yes.”

Example: Work stress and personal life • The Parameter of interest is the proportion of the chain’s employee who would answer “Yes” if asked. • This is population proportion, which we call P. • The statistic used to estimate the unknown parameter is the sample proportion

Inference for a Single Proportion • The sample proportion is a discrete random variable that can take the values 0, 1/100, 2/100, …, 99/100 or 1. • The probability model for can be based on the Binomial distributions for counts. • If the sample size n is very small, we must base tests and confidence intervals for P on the discrete distribution of . • We can approximate the distribution of by a Normal distribution when the sample size is large.

Sampling Distribution of a Sample Proportion • Choose a SRS of size n from a large population that contains population proportion P of “successes.” Let be the sample proportion of successes, • Then: • As the sample size increases, the sampling distribution of becomes approximately Normal. • The mean of the sampling distribution is P. • The standard deviation of the sampling distribution is

Sampling Distribution of a Sample Proportion • The sampling distribution of the sample proportion of successes has approximately a Normal distribution.

Confidence Interval for a Single Proportion • The sample proportion is the natural estimator of the population proportion P. • The traditional confidence interval for P is based on the Normal approximation to the distribution of . • Unfortunately, confidence intervals based on this statistic can be quite inaccurate, even for large samples. • We can do better by moving sample proportion slightly away from 0 and 1. • The following simple adjustment works very well in practice.

Confidence Interval for a Single Proportion • Wilson Estimate: • Assume we have 4 additional observations, 2 of which are successes and 2 of which are failures. • The new sample size is n + 4 and the count of successes is X+2. • The estimator of the population proportion is

Confidence Interval for a Single Proportion • We base a confidence interval on the z statistic obtained by standardizing the Wilson estimate . • The distribution of is close to the Normal distribution with mean P and standard deviation .

Confidence Interval for a Single Proportion • Choose a SRS of size n from a large population with unknown proportion p of successes. The Wilson estimate of the population proportion is • The standard error of is • An approximate Level C confidence interval for P is • Where z* is the value for the standard Normal density curve with C area between –z* and z*. • Use this interval when sample size is at least n = 5 and the confidence level is 90% or more.

Example: estimating the effect of work stress • The sample survey in previous example found that 68 out of 100 employees agreed that work stress had a negative impact on their personal lives.The sample size is n= 100 and the count of successes is X = 68. The Wilson estimate of the proportion of all employees affected by work stress is • The standard error is

Example: estimating the effect of work stress • The z critical value for 95% confidence is z* = 1.96, so the confidence interval is • We are 95% confident that between 58.3% and 76.3% of the restaurant chain’s employees feel that work stress is damaging their personal lives.

Significance Test for a Single Proportion • The sample proportion is approximately Normal with mean and standard deviation • For confidence interval we used the Wilson estimate and estimated the standard deviation from the data. • When performing significance test, the null hypothesis specifies a value for p which we call p0. • We assume the hypothesized p were actually true and substitute p0 for p in the expression for and then standardize .

Significance Test for a Single Proportion

Example: Work stress • A national survey of restaurant employees found that 75% said that work stress had a negative impact on their personal lives. A sample of 100 employees of a restaurant chain found that 68 answered “Yes” when asked, “does work stress have a negative impact on your personal life?” Is this good reason to think that the proportion of all employees of this chain who say “Yes” differs from the national proportion p0 = 0.75?

Example: Work stress • To answer this question, we test H0: p = 0.75 Ha: P  0.75 • The expected number of “Yes” and “No” responses are • 100 0.75 = 75 and 1000.25 = 25 • Both are greater than 10 , so we can use z test. • Test statistic is

Example: Work stress • From table A we find • The P-value is • P = 20.0526 = .1052 • We conclude that the chain restaurant data are compatible with the survey results.

Choosing a Sample Size • We want to see how to choose the sample size n to obtain a confidence interval with specified margin of error m for a population proportion. • The margin of error for the confidence interval for a population proportion is: • Choosing a confidence level C fixes the critical value z*.

Choosing a Sample Size • The margin of error also depends on the the value of and the sample size n. • We don’t know the value of until we gather data, therefore we must guess a value to use in the calculations. • Let’s call the guess value p*. There are two ways to get p*. • Use sample estimate from a pilot study or from similar studies done earlier. • Use p* = 0.5. Because the margin of error is largest when , this choice gives a sample size that is somewhat larger than we really need for the confidence level we choose. It is a safe choice no matter what the data later show.

Choosing a Sample Size • The level C confidence interval for a proportion p will have a margin of error approximately equal to a specified value m when the sample size satisfies • Here z* is the critical value for confidence C, and p* is a guessed value for the proportion of successes in the future sample. • The margin of error will be less than or equal to m if p* is chosen to be 0.5. The sample size required is then given by

Example: Planning a sample of customers • Your company has received complaints about its customer support service. You intend to hire a consulting company to carry out a sample survey of customers. Before contacting the consultant, you want some idea of the sample size you will have to pay for. One critical question is the degree of satisfaction with your customer service, measured on a five-point scale. You want to estimate the proportion P of your customers who are satisfied (That is , who choose either “satisfied” or “very satisfied,” the two highest levels on the five point scale).

Example: Planning a sample of customers • You want to estimate P with 95% confidence and a margin of error less than or equal to 3%. For planning purposes, you are willing to use p* = 0.5. The sample size required is: • Round up to get n+4 = 1068 or n= 1064 (Always round up. Rounding down would give a margin of error slightly greater than 0.03.) • Similarly for a 2.5% margin of error we have (after rounding up)

Comparing Two Proportions • We often want to compare the proportions of two groups (such as men and women) that have some characteristics. • We call the two groups being compared Population 1 and population 2. • The two population proportions of “Successes” P1 and P2. • The data consist of two independent SRS • The sample sizes are n1 from population 1 and n2 from population 2.

Comparing Two Proportions • The proportion of successes in each sample estimates the corresponding population proportion. • Here is the notation we will use population population Sample Count of Sample proportion size successes proportion 1 P1 n1 X1 2 P2 n2 X2

Sampling Distribution of • Choose independent SRS of sizes n1 and n2 from two populations with proportions P1 and P2 of successes. • Let be the difference between the two sample proportions of successes. • Then as both sample sizes increase, the sampling distribution of D becomes approximately Normal. • The mean of the sampling distribution is . • The standard deviation of the sampling distribution is

Sampling Distribution of • The sampling distribution of the difference of two sample proportions is approximately Normal. • The mean and standard deviation are found from the two population proportions of successes, P1 and P2

Confidence Interval • Just as in the case of estimating a single proportion, a small modification of the sample proportions greatly improves the accuracy of confidence intervals. • The Wilson estimates of the two population proportions are

Confidence Interval • The standard deviation of is approximately • To obtain a confidence interval for P1-P2, we replace the unknown parameters in the standard deviation by estimates to obtain an estimated standard deviation, or standard error.

Confidence Interval for Comparing Two Proportions

Example:”No Sweat” Garment Labels • Following complaints about the working conditions in some apparel factories both in the United States and Abroad, a joint government and industry commission recommended in 1998 that companies that monitor and enforce proper standards be allowed to display a “No Sweat” label on their product. A survey of U.S. residents aged 18 or older asked a series of questions about how likely they would be to purchase a garment under various conditions.

Example:”No Sweat” Garment Labels • For some conditions, it was stated that the garment had a “No Sweat” label; for others, there was no mention of such label. On the basis of of the responses, each person was classified as a “label user” or “ a “label nonuser.” About 16.5% of those surveyed were label users. One purpose of the study was to describe the demographic characteristics of users and nonusers.

Example:”No Sweat” Garment Labels • The study suggested that there is a gender difference in the proportion of label users. Here is a summary of the data. Let X denote the number of label users. population n X 1 (women) 296 63 0.213 0.215 2 (men) 251 27 0.108 0.111

Example:”No Sweat” Garment Labels • First calculate the standard error of the observed difference. • The 95% confidence interval is

Example:”No Sweat” Garment Labels • With 95% confidence we can say that the difference in the proportions is between 0.04 and 0.16. • Alternatively, we can report that the women are about 10% more likely to be label users than men, with a 95% margin of error of 6%. • In this example we chose women to be the first population. Had we chosen men as the first population, the estimate of the difference would be negative (-0.104). • Because it is easier to discuss positive numbers, we generally choose the first population to be the one with the higher proportion. • The choice does not affect the substance of the analysis.

Significance Tests • It is sometimes useful to test the null hypothesis that the two population proportions are the same. • We standardize by subtracting its mean P1-P2 and then dividing by its standard deviation • If n1 and n2 are large, the standardized difference is approximately N(0, 1). • To estimate D we take into account the null hypothesis that P1 = P2.

Significance Tests • If these two proportions are equal, we can view all of the data as coming from a single population. • Let P denote the common value of P1 and P2. The standard deviation of is then

Significance Tests • We estimate the common value of P by the overall proportion of successes in the two samples. • This estimate of P is called the pooled estimate. • To estimate the standard deviation of D, substitute for P in the expression for DP. • The result is a standard error for D under the condition that the null hypothesis H0: P1 = P1 is true. • The test statistic uses this standard error to standardize the difference between the two sample proportions.

Significance Tests for Comparing Two Proportions

Example:men, women, and garment labels. • The previous example presented the survey data on whether consumers are “label users” who pay attention to label details when buying a shirt. Are men and women equally likely to be label users? • Here is the data summary: Population n X 1 (women) 296 63 0.213 2 (men) 251 27 0.108

Example:men, women, and garment labels • We compare the proportions of label users in the two populations (women and men) by testing the hypotheses H0:P1= P2 Ha:P1 P2 • The pooled estimate of the common value of P is: • This is the proportion of label users in the entire sample.

Example:men, women, and garment labels • The test statistic is calculated as follows: • The observed difference is more than 3 standard deviation away from zero.

Example:men, women, and garment labels • The P-value is: • Conclusion: • 21% of women are label users versus only 11% of men; the difference is statistically significant.

Business Statistics for Managerial Decision