Tests of Significance in Statistical Inference

Topic 5 Statistical Inference: Tests of Significance

The Reasoning of Tests of Significance • One may claim that he makes 80% of his basketball free throws. How is his claim tested? The only solution is to ask him to shoot, say 50 free throws. If he makes • 10 or 20%, his claim should be rejected? • 20 or 40%, his claim should be rejected? • 30 or 60%, his claim should be rejected? • 40 or 80%, his claim should be rejected? • 45 or 90%, his claim should be rejected?

The Reasoning of Tests of Significance The Survey of Study Habits and Attitudes (SSHA) is a psychological test that measures students’ study habits and attitudes toward school. Scores range from 0 to 200. Suppose we know that scores for college students in the population are normally distributed with mean 115 and standard deviation σ = 30. A teacher suspects that the mean score for older students is higher than 115. She gives the SSHA to an SRS of 25 students who are at least 30 years old. By the CLT, the sampling distribution of the sample mean under the claim of µ = 115 is N(115, 6). – why Sketch the density curve of this distribution and mark the axis with cutoffs specified in the 68-95-99.7 rule. Which of the following sample means might be indicative of good evidence against the claim of µ = 115? (a) 118.6 (b) 120.3 (c ) 128.2

Steps of Conducting Tests of Significance • Step 1: Identify the null and alternative hypotheses. • The null hypothesis (H0)states the claim we are seeking evidence against (usually a claim of “no effect” or “no difference.”). • The alternative hypothesis (Ha) is the claim about the population that we are trying to find evidence for. • Step 2: Choose a test statistic, which measure the distance between the parameter value stated in the null hypothesis and an appropriate estimate of the parameter from the data. • Step 3: Report a P-value based on the sampling distribution of the test statistic under the null hypothesis. • Step 4 (optional): Make your decision based on a comparison between the P-value and a significance level (if given).

Z test for a Population Mean µ • Draw an SRS of size n from a Normal population that has unknown mean µ but known standard deviation σ. To test the null hypothesis H0: µ = µ0, calculate the one-sample z test statistic • It can be shown that Z has the standard normal distribution N(0,1). The P-value for a test of H0 against Ha: µ > µ0, is the right tail area under the z density curve that is beyond the z statistic value. Ha: µ < µ0, is the left tail area under the z density curve that is beyond the z statistic value. Ha: µ ≠ µ0, is twice the right tail area under the z density curve that is beyond the z statistic value.

Interpretation of P-values The P-value of a test H0 is the probability , computed assuming that H0 is true, that the test statistic would take a value as extreme or more extreme than that actually observed. Usually, the P-value of a test H0 is compared with a threshold value called significance level (denoted α). A P-value smaller than α indicates rejection of H0, or significance of the test.

Example: (Plasma Aldosterone in Dogs) Aldosterone is a hormone involved in maintaining fluid balance in the body. In a veterinary study, 8 dogs with heart failure were treated with the drug Captopril, and plasma concentrations of aldosterone were measured before and after the treatment. Suppose that the before-after change (before – after) in concentration has a normal distribution with standard deviation 15. Test the claim that the drug Captopril has an effect of reducing plasma concentrations of aldosterone. Interpret the P-value.

Example: Left-sided Test Here are the IQ scores of 31 seven-grade girls randomly chosen from a school district: 114 100 104 89 102 91 114 114 103 105 108 130 98 122 111 118 108 116 86 72 111 103 74 112 107 103 98 96 112 112 93 The sample mean is 105.8387. Suppose that the standard deviation of IQ scores in the population is known to be 15. IQ scores in a broad population are supposed to have mean µ = 110. Is there evidence that the mean in this district is less than100?

Example: Two-sided Test Here are the IQ scores of 31 7-grade girls randomly chosen from a school district: 114 100 104 89 102 91 114 114 103 105 108 130 120 132 111 128 118 119 86 72 111 103 74 112 107 103 98 96 112 112 93 Suppose that the standard deviation of IQ scores in the population is known to be 15. IQ scores in a broad population are supposed to have mean µ = 110. Is there evidence that the mean in this district differs from 100?

Comparison between Confidence Intervals and Tests of Significance A confidence interval for a parameter gives a set of plausible values of the parameter at a given confidence level. A test of significance makes a decision about whether a claimed value is a plausible value of the parameter considered.

Test of Significance Using Confidence Intervals A level αtwo-sided significance test rejects a null hypothesis H0: µ = µ0 exactly when the value µ0 falls outside a level 1 – α CI for µ. A level αone-sided significance test rejects a null hypothesis H0: µ = µ0 exactly when the value µ0 falls outside a level 1 – 2α CI for µ.

The 1-Sample t Test • Draw an SRS of size n from a large population. To test the hypothesis H0: µ = µ0, compute the 1-sample t statistic • The P-value for a test of H0 against Ha: µ > µ0, is the right tail area under the t(n-1) density curve that is beyond the t statistic value. Ha: µ < µ0, is the left tail area under the t(n-1) density curve that is beyond the t statistic value. Ha: µ ≠ µ0, is twice the right tail area under the t(n-1) density curve that is beyond the t statistic value. • The t test as well as the t confidence interval do not change very much when the conditions for use of the procedure are violated (i.e., the distribution of the population is slightly skewed). Both procedures are said to be robust.

Example: The 1-Sample t Test • The 1-sample t statistic for testing H0: µ = 10 versus Ha: µ > 10 from a sample of n = 15 observations has the value t = 1.82. • What are the degrees of freedom for this statistic? • Find the P-value. Is the value t = 1.82 significant at the 5% level of significance? Is it significant at the 1% level? (Answer: P-value is between 0.025 and 0.05)

Example: The 1-Sample t Test We wish to see if the dial indicating the oven temperature for a certain model oven is properly calibrated. Four ovens of this model are selected at random. The dial on each is set to 300ºF, and, after one hour, the actual temperature of each is measured. The temperatures measured are 305º, 310º, 300º, and 305º. Assuming that the actual temperatures for this model when the dial is set for 300º are normally distributed with mean µ, we test whether the dial is properly calibrated by testing the hypotheses H0: µ = µ0 versus Ha: µ ≠ µ0. Find the P-value for this test.

Example: The 1-Sample t Test Do students tend to improve their SAT mathematics (SAT-M) score the second time they take the test? A random sample of four students who took the test twice received the following scores.Student 1 2 3 4 First score 450 520 720 600 Second score 440 600 720 630 Assuming that the change in SAT-M score (second score - first score) for the population of all students taking the test twice is normally distributed with mean µ, are we convinced that retaking the test improves scores? Find the P-value for an appropriate test. This example shows what is termed the matched pairs t procedure. The design is called a matched pairs design, in which subjects are matched in pairs and each treatment is given to one subject in each pair.

Two Sample t Procedure: Test of Significance • Draw an SRS of size n1 from a normal population with unknown mean µ1, and draw an independent SRS of size n2 from another normal population with unknown mean µ2. To test the hypothesis H0: µ1 = µ2, calculate the 2-sample t statistic The P-value for a test of H0 against Ha: µ1 > µ2, is the right tail area beyond the t statistic value under the t density curve with degrees of freedom equal to the smaller of n1 – 1 and n2 – 1. Ha: µ1 < µ2, is the left tail area beyond the t statistic value under the t density curve with degrees of freedom equal to the smaller of n1 – 1 and n2 – 1. Ha: µ1 ≠ µ2, is twice the right tail area beyond the t statistic value under the t density curve with degrees of freedom equal to the smaller of n1 – 1 and n2 – 1. • Warning: Data from paired designs can not be analyzed using this procedure.

Example: the 2-Sample t Test Procedure A researcher wished to compare the effect of two stepping heights (low and high) on heart rate in a step-aerobics workout. A collection of fifty adult volunteers was randomly divided into two groups of twenty-five subjects each. Group 1 did a standard step-aerobics workout at the low height. The mean heart rate at the end of the workout for the subjects in Group 1 was = 90.00 beats per minute with a standard deviation = 9 beats per minute. Group 2 did the same workout but at the high step height. The mean heart rate at the end of the workout for the subjects in Group 2 was = 95.08 beats per minute with a standard deviation = 12 beats per minute. Assume that the two groups are independent and the data are approximately normal. Let and represent the mean heart rates we would observe for the entire population represented by the volunteers, if all members of this population did the workout using the low or high step height, respectively. Suppose the researcher had wished to test the hypotheses H0: µ1 = µ2 against Ha: µ1 < µ2. The P-value for the test is (use the conservative value for the degrees of freedom) A. larger than 0.10.B. between 0.10 and 0.05.C. between 0.05 and 0.01.

Wilcoxon-Mann-Whitney Test • The W-M-W test is used to compare two independent samples. It is a competitor to the t test, but unlike the t test, the W-M-W test is valid even if the population distribution is not normal. It’s therefore called a distribution-free type of test. In addition, this test does not focus on any particular parameter, it’s further called a nonparametric type of test. • The W-M-W test is also known as Wilcoxon Rank-Sum test.

Example: Soil Respiration • Soil respiration is a measure of microbial activity in soil, which affects plant growth. In one study, soil cores were taken from two locations in a forest: (1) under an opening in the forest canopy (the “gap” location) and (2) at a nearby area under heavy tree growth (the “growth” location). The amount of carbon dioxide given off by each soil core was measured (in mol CO2/g soil/hr). Here are data: Growth: 17, 20, 170, 315, 22, 190, 64 Gap: 22, 29, 13, 16, 15, 18, 14, 6 • Test H0: the gap and growth areas do not differ with respect to soil respiration againstHa: Soil respiration tends to be greater in the growth area than they are in the gap area.

R codes X = c(17,20,170,315,22,190,64) Y = c(22,29,13,16,15,18,14,6) wilcox.test(X, Y, alternative = "greater", correct = FALSE) Results: Wilcoxon rank sum test data: x and y W = 49.5, p-value = 0.00638 alternative hypothesis: true location shift is greater than 0.

The (Wilcoxon) Signed Rank Test for Two Paired Samples • The signed rank test is a nonparametric test that can be used to compare twopairedsamples. • Not particularly powerful but very flexible and simple to use.

Example: Skin Grafts • Skin from cadavers can be used to provide temporary skin grafts for severely burned patients. The longer such a graft survives before its inevitable rejection by the immune system, the more the patient benefits. A medical team investigated the usefulness of matching graft to patient with respect to the HL-A antigen system. Each patient received two grafts, one with close HL-A compatibility and the other with poor compatibility. The survival times (in days) of the skin grafts are shown here: Notice that a t test could not be applied here because two of the observations are incomplete: patient 3 died with a graft still surviving and observation on patient 10 was incomplete for an unspecified reason. Carry out a sign test to compare the survival times of the two sets of skin grafts. The null hypothesis is H0: The survival time distribution is the same for close compatibility as it is for poor compatibility against the directional alternative Ha: Skin grafts tend to last longer when the HL-A compatibility is close.

R codes X = c(37, 19, 57, 93, 16, 23, 20, 63, 29, 60, 18) Y = c(29, 13, 15, 26, 11, 18, 26, 43, 18, 42, 19) wilcox.test(X, Y, paired = TRUE, alternative = "greater", correct = FALSE) Results: Wilcoxon signed rank test data: X and Y v = 60.5, p-value = 0.007193 alternative hypothesis: true location shift is greater than 0.

Skip Remaining Slides? • You read them.

Significance Tests for a Proportion • Draw an SRS of size n from a large population with unknown proportion p. To test the hypothesis H0: p = p0, calculate the 1-sample z statistic The P-value for a test of H0 against Ha: p > p0, is the right tail area beyond the z statistic value under the standard normal density curve. Ha: p < p0, is the left tail area beyond the z statistic value under the standard normal density curve. Ha: p ≠ p0, is twice the right tail area beyond the z statistic value under the standard normal density curve.

Examples: Significance Tests for a Proportion A Gallup Poll asked a sample of Canadian adults if they thought the law should allow doctors to end the life of a patient who is in great pain and near death if the patient makes a request in writing. The poll included 270 people in Quebec, 221 of whom agreed that doctor-assisted suicide should be allowed. Is the poll evidence that the majority of people in Quebec favor doctor-assisted suicide? Flip a coin 25 times and the heads side appears 13 times. Is the coin balanced?

How Small a P-value is Convincing? The P-value quantifies the degree of evidence provided by the sample against the null hypothesis. The smaller the P-value, the stronger the evidence. How small is small? Answers vary. Reporting P-value allows each of us to decide individually if the evidence is sufficiently strong. When we say that the evidence provided by the sample is sufficiently strong (indicated by a very small P-value), we mean the result is significant and the null hypothesis should be rejected.

Sample Size Affects Statistical Significance Large samples can capture even tiny deviations from the null hypothesis; that is, large samples tend to produce significant results. On the other hand, small samples can miss even large deviations from the null hypothesis; that is, small samples tend to produce non-significant results.

Statistical Significance and Practical Significance When a null hypothesis is rejected at a significance level say α = 0.05, there is good evidence that an effect is present. But that effect may be so small that it can be ignored in practice. That this small effect is captured may be because the sample size is very large. Statistical significance does not tell us whether an effect is large enough to be important. That is, Statistical Significance and Practical Significance Are Not the Same. The Author of the textbook suggests that confidence intervals be used more often than tests of significance, because the former estimates the size of an effect while the latter answers if it is too large to occur by chance alone.

Multiple Analyses Running one test and reaching the 5% level of significance is reasonably good evidence that you have found something, but running 20 tests and reaching the 5% level of significance only once is NOT. This is because by chance we would see 1 test significant among 20 non-significant at the 5% level of significance. (1 = 20*5%) Similar arguments can be made for confidence intervals: A single 95% confidence interval has probability 0.95 of capturing the true parameter each time you use it, but the probability that all of 20 confidence intervals will capture their parameters is much less than 0.95.

Example of Multiple Analyses • Suppose that 60 independent studies have been conducted to investigate the association between smoking and lung cancer. 2 studies indicated that such an association exists (P-value < 0.05). • Is it proper to conclude that the association really exists from each of the two studies? – No, since we expect to see about 3 studies to have a P-value of 0.05 or less. (3 = 60*0.05) • What should the researchers who conducted the two studies now do to test whether the association does exist? – Collect new data to repeat the study.

Tests of Significance in Statistical Inference

Tests of Significance in Statistical Inference

Presentation Transcript

TOPIC 5

Topic 5:

Topic 5

Topic--5

Topic 5

Topic 5

Topic 5

TOPIC 5

TOPIC 5

Topic 5

Topic 5

Topic 5

Topic 5

Topic 5

Topic 5

TOPIC 5

Topic 5

TOPIC 5

Topic 5

Topic 5:

Topic 5