Testing Hypotheses with Experiment Data: Rationalizing Poor Performance

Chapter 14 Learning from Experiment Data Created by Kathy Fritz

Variability and Random Assignment

What might you learn using data from an experiment? • We could learn if there is a significant difference in the means (or proportions) of different treatment groups. • Even if all the individuals or objects all receive the same treatment, we would NOT expect the groups to have exactly the same means. • This difference could be due to the variability in the response or to the random assignment to treatments. • Often the individuals or objects receiving the treatments are not selected at random from some larger population. • This means that you cannot generalize the results of the experiment to some larger population.

Testing Hypotheses About the Differences In Treatment Effects

Two-Sample t Test for a Difference in Treatment Means Appropriate when the following conditions are met: • Individuals or objects are randomly assigned to treatments • Both number of individuals or objects in each of the treatment groups is large (n1 ≥ 30 and n2 ≥ 30) or the treatment response distributions are approximately normal. The treatment response distribution is the distribution of response values that would result if the treatment were applied to a VERY large number of individuals or objects.

Two-Sample t Test for a Difference in Treatment Means Where m1 – m2 is the hypothesized value of the difference in population means from the null hypothesis (often this will be 0). When the conditions are met and the null hypothesis is true, the t test statistic has a t distribution with The computed value of df should be truncated to obtain an integer value.

Two-Sample t Test for a Difference in Treatment Means Form of the null hypothesis: H0: m1 – m2 = hypothesized value Area under the tcurve to the left of the calculated value of the test statistic Ha: m1– m2< hypothesized value 2·(area to the right of t) if t is positive Or 2·(area to the left of t) if tis negative Ha: m1– m2≠ hypothesized value

An experiment was performed to investigate the extent to which people rationalize poor performance. In this study, 246 college undergraduates were assigned at random to one of two groups – a negative feedback group or a positive feedback group. Each participant took a test in which they were asked to guess the emotions displayed in photographs of faces. The researchers hypothesized that people who receive negative feedback would tend to rationalize their poor performance by rating the validity of the test and the importance of being a good face reader lower than the people who receive positive feedback. At the end of the test, those in the negative feedback group were told that they had correctly answered 21 of the 40 items and were assigned a “grade” of D. Those in the positive feedback group were told that they had answered 35 of the 40 correctly and were assigned an A grade. After a brief time, participants were asked to answer two sets of questions. One set of questions asked about the validity of the test, and the other set asked about the importance of being able to read faces.

Rationalize Poor Performance Continued . . . Do the data from this experiment support the researchers’ hypotheses? Let’s test the relevant hypotheses using a significance level of 0.01, beginning with test validity. Step 1 (Hypotheses): The two treatment means are defined as m1 = mean test validity rating for the negative feedback treatment m2 = mean test validity rating for the positive feedback treatment m1 – m2 = difference in treatment means H0: m1 – m2= 0 versus Ha: m1 – m2< 0

Rationalize Poor Performance Continued . . . Step 2 (Method): Considering the four key questions (QSTN), this situation can be described as 1) hypothesis testing, 2) experiment data, 3) one numerical variable (validity rating), and 4) two treatments (negative feedback and positive feedback). This combination suggests a two-sample t test. A significance level of 0.01 was specified for this test. • Step 3 (Check): • Participants were assigned at random to one of the two treatments • Both treatment groups are large enough (n = 123 > 30).

Rationalize Poor Performance Continued . . .

Rationalize Poor Performance Continued . . . Step 5 (Communicate Results): Because the P-value is less than the selected significance level (0.01), the null hypothesis is rejected. The mean validity rating for the negative feedback treatment is significantly lower than the mean for the positive feedback treatment.

Rationalize Poor Performance Continued . . . The researchers also hypothesized that negative feedback results in a lower rating of the importance of being able to read faces. Step 1 (Hypotheses): The two treatment means are defined as m1= mean importance rating for the negative feedback treatment m2= mean importance rating for the positive feedback treatment H0: m1 – m2= 0 Ha: m1 – m2< 0 Steps 2 and 3 are the same as before.

Rationalize Poor Performance Continued . . . Step 4 (Calculate): Using Minitab to do the computations results in the following output: Step 5 (Communicate Results): Because the P-value is less than the selected significance level (0.01), the null hypothesis is rejected. The mean importance rating for the negative feedback treatment is significantly lower than the mean for the positive feedback treatment.

A Large-Sample Test for a Difference in Two Population Proportions Appropriate when the following conditions are met • Individuals or objects are randomly assigned to treatments.

A Large-Sample Test for a Difference in Two Population Proportions Continued . . .

A Large-Sample Test for a Difference in Two Population Proportions Continued . . . Null Hypothesis: H0: p1 – p2 = 0 Ha: p1 – p2 < 0 Area under the z curve to the left of the calculated value of the test statistic Ha: p1 – p2 ≠ 0 2·(area to right of z) if z is positive or 2·(area to left of z) if z is negative

Some people believe that you can fix anything with duct tape. Investigators at Madigan Army Medical Center tested using duct tape to remove warts. Patients with warts were randomly assigned to either the duct tape treatment or to the more traditional freezing treatment. Those in the duct tape group wore duct tape over the wart for 6 days, then removed the tape, soaked the area in water, and used an emery board to scrape the area. This process was repeated for a maximum of 2 months or until the wart was gone. The data follows: Do these data suggest that freezing is less successful than duct tape in removing warts?

Duct Tape Continued . . . Step 1 (Hypotheses): The two treatment proportions are defined as p1 = proportion of warts successfully removed by duct tape p2 = proportion of warts successfully removed by freezing p1 – p2 = difference in treatment proportions H0: p1 – p2 = 0 Ha: p1 – p2 > 0 Step 2 (Method): Considering the four key questions (QSTN), this situation can be described as 1) hypothesis testing, 2) experiment data, 3) one categorical variable (with two categories – wart removed and wart not removed), and 4) two treatments (duct tape and freezing). This combination suggests a large-sample z test for difference in treatment proportions. For purposes of this example, a significance level of 0.05 will be used.

P-value: P(z ≥ 4.03) ≈ 0

Duct Tape Continued . . . Step 5 (Check): Because the P-value is less than the selected significance level (0.05), the null hypothesis is rejected. The proportion of warts removed by the duct tape treatment is greater than the proportion of warts removed by the freezing treatment.

Estimating the Difference in Treatment Effects

The Two-Sample t Confidence Interval for a Difference in Treatment Means Appropriate when the following conditions are met: • Individuals or objects are randomly assigned to treatments. • The number of individuals or objects in each of the treatment groups is large (30 or more) or the treatment response distributions are approximately normal.

The Two-Sample t Confidence Interval for a Difference in Treatment Means The computed value of df should be truncated to obtain an integer value. The desired confidence level determines which t critical value is used.

The Two-Sample t Confidence Interval for a Difference in Treatment Means Interpretation of Confidence Interval You can be confident that the actual value of the difference in treatment means is included in the computed interval. This statement should be worded in context. Interpretation of Confidence Level The confidence level specifies the long-run proportion of the time that this method is successful in capturing the actual difference in treatment means.

Does talking elevate blood pressure, contributing to the tendency for blood pressure to be higher when measured in a doctor’s office than when measured in a less stressful environment? (This well-documented effect is called the “white coat effect.”) In a study, patients with high blood pressure were randomly assigned to one of two groups. Those in the first group (the talking group) were asked questions about their medical history and about sources of stress in their lives in the minutes before their blood pressure was measured. Those in the second group (the counting group) were asked to count aloud from 1 to 100 four times before their blood pressure was measured. The researchers were interested in estimating the difference in mean blood pressure for the two treatments (talking and counting). The following data values for diastolic blood pressure (in mg of Hg) are consistent with summary quantities appearing in the paper.

Step 1 (Estimate): You want to estimate: m1– m2 = mean difference in blood pressure where m1 = mean blood pressure for the talking treatment and m2= mean blood pressure for the counting treatment Step 2 (Method): The answers to the four key questions are 1) estimation, 2) experiment data, 3) one numerical variable, and 4) two treatments. These answers leads you to consider a two-sample t confidence interval for the difference in treatment means. For this example, a 95% confidence level will be used. • Step 3 (Check): • The patients were randomly assigned to one of two treatment groups. • Boxplots were constructed using the data from the two treatment groups. There are no outliers in either data set and the boxplots are reasonably symmetric suggesting that the assumption of approximate normality is reasonable.

Step 4 (Calculate):

Step 5 (Communicate Results): Confidence Interval: You can be 95% confident that the actual difference in mean blood pressure for the two treatments is somewhere between 1.02 Hg and 11.98 Hg. Because 0 is not included in this interval, you would conclude that the mean blood pressure is higher for the talking treatment than for the counting treatment by somewhere between 1.02 and 11.98. Confidence Level: The method used to construct this interval estimate is successful in capturing the actual difference in treatment means about 95% of the time.

A Large-Samples Confidence Interval for a Difference in Treatment Proportions The desired confidence level determines which z critical value is used. The three most common confidence levels use the following critical values: Confidence Levelz Critical Value 90% 1.645 95% 1.96 99% 2.58 Appropriate when the following conditions are met • Individuals or objects are randomly assigned to treatments. When these conditions are met, a confidence interval for the difference in treatment proportions is

Confidence Intervals Continued . . . Interpretation of Confidence Interval You can be confident that the actual value of the difference in treatment proportions is included in the computed interval. In a given problem, this statement should be in context. Interpretation of Confidence Level The confidence level specifies the long-run percentage of the time that this method will be successful in capturing the actual difference in treatment proportions.

Bypass patients who consented to take part in the experiment were divided randomly into three groups. Some patients received prayers but were not informed of that. In the second group the patients got no prayers, and also were not informed one way or the other. The third group got prayers and were told so. An article described an experiment to investigate the possible effects of prayer. The following two paragraphs are from the article: There was virtually no difference in complication rates between the patients in the first two groups. But the third group, in which patients knew they were receiving prayers, had a complication rate of 59 percent – significantly more than the rate of the 52 percent in the no-prayer group. The article also states that a total of 1800 people participated in the experiment with 600 being assigned at random to each treatment group.

Step 1 (Estimate): You want to use the given information to estimate the difference between the proportion of patients with complications for the no prayer treatment, p1, and the proportion of patients with complications for the treatment where people knew someone was praying for them, p2. Step 2 (Method): Because this is an estimation problem, the data are from an experiment, the one response variable (complications or no complications) is categorical, and two treatments are being compared, a method to consider is a large-sample z confidence interval for the difference in treatment proportions. For this example, a 90% confidence level will be used.

Step 5 (Communicate Results): You can be confident that the difference in the proportion of patients with complications for the no-prayer treatment and the treatment where patients knew that someone was praying for them is between -0.118 and -0.022. Because both endpoints of the interval are negative, you would conclude that the proportion with complications is higher for the treatment where patients know that someone is praying for them than for the no-prayer treatment by somewhere between 0.022 and 0.118. The method used to construct this estimate captures the true difference in treatment proportions about 90% of the time.

Avoid These Common Mistakes

Avoid These Common Mistakes • Random assignment to treatments is critical. If the design of the experiment does not include random assignment to treatments, it is not appropriate to use a hypothesis test or a confidence interval to draw conclusions about treatment differences. B A C

Avoid These Common Mistakes • Remember that it is not reasonable to generalize conclusions from experiment data to a larger population unless the subjects in the experiment were selected at random from the population or a convincing argument can be made that the group of volunteers is representative of the population. And even it subjects are selected at random from a population, it is still important that there be random assignment to treatments.

Avoid These Common Mistakes • Remember that a hypothesis test can never show strong support for the null hypothesis. In the context of using experiment data to test hypotheses, this means you cannot say that data from an experiment provide convincing evidence that there is no difference between treatments.

Avoid These Common Mistakes • Even when the data used in a hypothesis test are from an experiment, there is still a difference between statistical significance and practical significance. It is possible, especially in experiments with large numbers of subjects in each experimental group, to be convinced that two treatment means or treatment proportions are not equal. It may be useful to look at a confidence interval estimate of the difference to get a sense of practical significance.

Testing Hypotheses with Experiment Data: Rationalizing Poor Performance

Testing Hypotheses with Experiment Data: Rationalizing Poor Performance

Presentation Transcript

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14.

Chapter 14

Chapter 14

CHAPTER 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14