Statistical Analysis and Data Interpretation for Athletes, Statisticians, and Team Doctors

Will Hopkins will@clear.net.nz sportsci.org/willVictoria University, Melbourne, Australia Statistical Analysis and Data InterpretationWhat is significant for the athlete, the statistician and team doctor? important in health, exercise and sport research? What is a Statistic? • Simple, effect, and inferential statistics. Making Inferences • Sampling variation; true effects; confidence limits; p values; magnitude-based inference; individual differences. Important Magnitudes of Effects • Means; correlations; slopes or gradients; proportions or risks; odds; hazards; counts. Monitoring Individual Athletes • Subjective and objective assessments.

What is a Statistic? • Definition: a number summarizingan aspect of many numbers. • Examples: mean, correlation, confidence limit… • If the set of many numbers all represent different values of the same kind of thing, we call the numbers values of a numericvariable. • Example: 57, 73, 61, 60 kg are values of the variable body mass. • Values of a variable all have the same units. • A nominal, naming or groupingvariable has levelsor labels rather than numeric values. • Example: union, league, touch… are levels of the variable rugby. • Statistics are useful! • A statistic usually represents the bigpicture or some other important aspect of the original numbers. • The aspect is often not obvious in the original numbers. • Most people hatenumbers, so one number is better than many.

Simple statistic: an aspect of a set of values of one variable. • Sample size (n): the number of values. • Mean: the average value or center of the values. • Standarddeviation (SD): the average scatter around the mean. • Used to evaluate magnitudes of differences in means. • Standard error of the mean (SD/n): the expected variation in the mean with resampling. • A tricky statistical dinosaur. Do not use it! • Quantiles (median, tertiles, quartiles, quintiles…): values that divide the ranked set into 2, 3, 4, or 5 equal-sized subsets. • Use them when the data are skewed by large values (e.g., salaries). • Or use them to compare subgroups. Example: blood pressure in the quintile of lowest physical activity vs each quintile of higher activity. • Proportion or risk: the number of "events" (e.g., injured players) divided by the number of "trials" (total number of players). • Often expressed as a percent (= proportion×100). • Oddsand hazard: statistics related to proportions.

Effect statistic: a relationship between a predictor or independent variable and a dependent or outcome variable. • Difference (or change) in mean: the predictor is a grouping variable and the dependent is numeric. • Slope (or gradient): the difference or change in the mean per difference in a simple linear numeric predictor. • Correlationcoefficient: another form of the slope. • Ratio of proportions, risks, odds or hazards: statistics for comparing the occurrence (presence or absence) of something in two groups. • Ratio of counts: statistics for comparing counts of something in two groups. • Other variables can be included in the analysis as covariates. Moderators or modifiers are interacted with the predictor to estimate how the effect differs between subjects. • Example: the modifying effect of athlete status on caffeine’s effect. Mediators are added to try to explain the mechanism of an effect.. • Example: how much of the effect of activity on health is due to age?

Inferentialstatistic: an aspect of the "true" value of a simple or effect statistic derived from a sample. • Confidenceinterval or limits: the likely range of the true value. • P value: provides evidence about the zero or null value of an effect. • Chance of benefit, risk of harm: provide evidence about the true value for making clinical decisions. • Chances of substantially positive, substantiallynegative, and trivial: provide evidence about the true value for making non-clinical decisions • T, F, chi-squared statistics: these "test" statistics are used to derive the other inferential statistics. • Only the statistician needs to know about these. • Their values do not have any meaning in the real world. • Do not show them in publications.

Making Inferences (Decisions or Conclusions)c • Every sample gives a different value for a statistic, owing to sampling variation. • So, the value of a sample statistic is only an estimate of the true (right, real, actual, very large sample, or population) value. • But people want to make an inference about the true value. • The best inferential statistic for this purpose is the confidenceinterval: the range within which the true value is likely to fall. • "Likely" is usually 95%, so there is a 95% chance the true value is included in the confidence interval (and a 5% chance it is not). • Confidence limits are the lower and upper ends of the interval. • The limits represent how small and how large the effect "could" be. • All effects should be shown with a confidence interval or limits. • Example: the dietary treatment produced an average weight loss of 3.2 kg (95% confidence interval 1.6 to 4.8 kg). • The confidence interval is NOT a range of individual responses! • But confidence limits alone don't provide a clinical inference.

zero or null • Statistical significanceis the traditional way to make inferences. • Also known as the null-hypothesis significance test. • The inference is all about whether the effect could be zero or "null". • If the 95% confidence interval includes zero, the effect "could be zero". The effect is "statistically non-significant (at the 5% level)": • If the confidence interval does not include zero, the effect "couldn't be zero". The effect is "statistically significant (at the 5% level)". • Statistical procedures calculate a probability or p value for deciding whether an effect is significant. • p>0.05 means non-significant; p<0.05 means significant. negative positive Researchers using p values should show exact values. 95% confidence interval (p=0.31) statistically non-significant statistically significant (p=0.02) (p=0.003) statistically significant value of effect statistic (e.g., change in weight)

The exact definition of the p value is hard to understand. • I will not explain, because I do not want you to use p values. • People usually interpret non-significant as "no real effect" and significant as "a real effect". • These interpretations apply only if the study was done with the right sample size. • But even then, they are do not properly represent the uncertainty. • And you often do not know if the sample size is right. • Attempts to address this problem with post-hoc power calculations are rare, generally wrong, and too hard to understand. • So the only safe interpretation is whether the effect could be zero. • But the issue for the practitioner is not whether the effect could be zero, but whether the effect could be important. • Importanthas two meanings: beneficial and harmful. • The confidence interval addresses this issue, when clinically important values for benefit and harm are taken into account.

Clinicaldecision Clear: don't use it. smallest clinicallybeneficial effect smallest clinicallyharmful effect Clear: don't use it. • Clinical or practical inferences with the confidence interval • The smallest clinically or practically important effects define values of the effect that are beneficial, harmful and trivial. • Smallest effects for benefit and harm are equal and opposite. • Infer (decide) the outcome from the confidence interval, as follows: P values fail here. harmful trivial beneficial Clear: use it. Clear: use it. Clear: use it. But p>0.05! Clear: depends. Clear: don't use it. But p<0.05! Unclear: more data needed. value of effect statistic (e.g., change in weight)

This approach eliminates statistical significance. • The only issue is what level to make the confidence interval. • To be careful about avoiding harm, you can make a conservative 99% confidence interval on the harm side. • And to use effects only when there is a reasonable chance of benefit. you can make a 50% interval on the benefit side. • But that's hard to understand. Consider this equivalent approach… • Inferences with probabilities of benefit and harm. • The uncertainty in an effect can be expressed as chances that the true effect is beneficial and the risk that it is actually harmful. • You would use an effect with a reasonable chance of benefit, if it has a sufficiently low risk of harm. • I have opted for possibly beneficial (>25% chance of benefit) and most unlikely harmful (<0.5% chance of harm). • An effect with >25% chance of benefit and >0.5% risk of harm is therefore unclear. You would like to use it, but you dare not. • Everything else is either clearly useful or clearly not worth using.

If the chance of benefit is high (e.g., 80%), you could accept a higher risk of harm (e.g., 5%). • I have formalized this less conservative approach by using a threshold odds ratio of 66 (odds of benefit to odds of harm). • When an effect has no obvious benefit or harm (e.g., a comparison of males and females), the inference is only about whether the effect could be substantiallypositive or negative. • For such non-clinical inferences, use a symmetrical confidence interval, usually 90% or 99%, to decide whether the effect is clear. • Equivalently, one of the chances of being substantially positive or negative has to be <5% for the effect to be clear ("a clear non-clinical effect can't be substantially positive and negative"). • Ways to report inferences for clear effects: possibly small benefit, likely moderately harmful, a large difference (clear at 99% level), a trivial-moderate increase [the lower and upper confidence limits]… • In summary, make such magnitude-based inferences by showing confidence limits and interpreting the uncertainty in a (clinically) relevant way readers can understand.

A caution about making an inference… • Whatever method you use, the inference is about theone and only mean effect in the population. • The confidence interval represents the uncertainty in the true effect, not a range of individual differences or individual responses. • For example, with a large sample size, a treatment could be clearly beneficial (mean is beneficial, with a narrow confidence interval)… • But the treatment could be harmful for a substantial proportion of the population. • Individual differences between groups and individual responses to a treatment are best summarized with a standard deviation, in addition to the mean effect. • The mean effect and the SD both need confidence limits. • Individual differences between groups and individual responses to a treatment may be accounted for by including subject characteristics as modifying covariates in the analysis. • Researchers generally neglect this important issue.

Important Magnitudes of Effects • Researchers need the smallest important magnitude of an effect statistic to estimate sample size for a study. • For those who use the null-hypothesis significance test, the right sample size has 80% power (80% chance of statistical significance, p<0.05) if the true effect has the smallest important value. • For those who use clinical magnitude-based inference, the right sample size gives a 0.5% risk of harm and a 25% chance of benefit if the true effect has the smallest important beneficial value. • Practitioners need to know about important magnitudes to monitor their athletes or patients. • Researchers and practitioners need to know about important magnitudes to interpret research findings. • So the next slides are all about values for various magnitudes of various effect statistics.

Strength patients healthy Data are means & SD. Strength pre post1 post2 Trial Differences or Changes in the Mean • This is the most common effect statistic fornumbers with decimals (continuous variables). • Difference when comparing different groups, e.g., patients vs healthy. • In population-health studies, groups are oftensubdivided into quartiles or quintiles (e.g., of age). • Change when tracking the same subjects. • Difference in the changes in controlled trials. Standardization for Effects on Means • The between-subject standard deviationprovides default thresholds for importantdifferences and changes. • You think about the effect (mean) in terms of afraction or multiple of the SD (mean/SD). • The effect is said to be standardized. Data are means & SD.

Trivial effect (0.1x SD) Very large effect (3.0x SD) post post pre pre Cohen Hopkins trivial <0.2 <0.2 small moderate 0.5-0.8 0.6-1.2 Complete scale: large >0.8 1.2-2.0 strength strength extremely large 0.2 0.6 1.2 2.0 4.0 trivial small moderate large very large ext. large very large ? ? 2.0-4.0 >4.0 • Example: the effect of a treatment on strength • Interpretation of standardizeddifference orchange in means: 0.2-0.5 0.2-0.6

Cautions with standardizing • Beware of authors who show standard errors of the mean (SEM) rather than standard deviations (SD). • SEM = SD/(sample size), so SEMs on graphs make effects look a lot bigger than they really are. • Standardizing works only when the SD comes from a sample that is representative of a well-defined population. • The resulting magnitude applies only to that population. • In a controlled trial, use the baseline (pre-test) SD. • Standardization may not be best for effects on means of some special variables: visual-analog scales, Likert scales, athletic performance…

Visual-analog scales • The respondents indicate a perception on a line like this: Rate your pain by placing a mark on this scale: • Score the response as percent of the length of the line. • Magnitude thresholds: 10%, 30%, 50%, 70%, 90% for small, moderate, large, very large, extremely large differences or changes. Likert scales • Example: How easy or hard was the training session today? very easyeasymoderatehardvery hard • Most Likert-type questions have four to seven choices. • Code and analyze them as integers (1, 2, 3, 4, 5…). • Then either rescale them to range from 0-100, and use the same thresholds as for visual-analog scales, • or use equivalent thresholds with the original scale. For example, thresholds for a 6-pt scale (range = 5): 0.5, 1.5, 2.5, 3.5 and 4.5.   none unbearable

Measures of Athletic Performance • Fitness tests and performance indicators of team-sport athletes: • Until you know how changes in tests or indicators of individual athletes affect chances of winning, standardize the scores with the SD of players in each on-field position. • Competitions or matches between top or evenly matched athletes or teams: • A small, moderate, large, very large or extremely large enhancement produces 1, 3, 5, 7 or 9 extra medals or wins for every 10 events or matches. • For matches, analyze the extra wins directly–see later. • For competitions, if there is little interaction between athletes, winning extra medals needs improvements in time or distance. • The within-athlete variability that athletes show from competition to competition determines the improvements. Here's why… • Owing to this variability, each of the top athletes has a good chance of winning at each competition…

0.3 0.9 1.6 2.5 4.0 trivial small moderate large very large ext. large Race 3 Race 2 Race 1 • Your athlete needs an enhancement that overcomes this variability to give her or him a bigger chance of a medal. • Simulations show that an enhancement of 0.3 the variability gives one extra medal every 10 competitions. • (In some early publications I mistakenly stated ~0.5 the variability!) • Example: if the variability is an SD (or CV) of 1%, the smallest important enhancement is 0.3%. • Similarly, 0.9, 1.6, 2.5 and 4.0 the variability give 3, 5, 7 and 9 extra medals every 10 competitions. • Hence this scale for enhancements as factors of the variability:

Beware: smallest effects on athletic performance in performance tests depend on the method of measurement, because… • A percent change in an athlete's ability to output power results in different percent changes in performance in different tests. • These differences are due to the power-duration relationship for performance and the power-speed relationship for different modes of exercise. • Example: a 1% change in endurance power output produces the following changes… • 1% in running time-trial speed or time; • ~0.4% in road-cycling time-trial time; • 0.3% in rowing-ergometer time-trial time; • ~15% in time to exhaustion in a constant-power test. • A hard-to-interpret change in any test following a fatiguing pre-load. (But such tests can be interpreted for cycling road races: see Bonetti and Hopkins, Sportscience 14, 63-70, 2010.)

Physical activity Age Slope (or Gradient) • Used when the predictor and dependent are both numeric and a straight line fits the trend. • The unit of the predictor is arbitrary. • Example: a 2% per year decline in activity seems trivial… yet 20% per decade seems large. • So it's best to express a slope as thedifference in the dependent per two SDs of predictor. • It gives the difference in the dependent (physical activity) between a typically low and high subject. • The SD for standardizing the resulting effect is the standard error of the estimate (the scatter about the line). 2 SD

Correlation Coefficient • Closely related to the slope, this represents the overall linearity in a scatterplot. Examples: • Negative values represent negative slopes. • The value is unaffected by the scaling of the two variables. • And it's much easier to calculate than a slope. • But a properly calculated slope is easier to interpret clinically. • And correlations for athletic performance are hard to interpret. • Smallest important correlation is ±0.1. Complete scale: • My scale for standardized differences in means, and my use of two SD to evaluate slopes, are both derived from this scale. r = 0.00 r = 0.10 r = 0.30 r = 0.50 r = 0.70 r = 0.90 r = 1.00 0.1 0.3 0.5 0.7 0.9 trivial low moderate high very high ext. high

Differences and Ratios of Proportions, Risks, Odds, Hazards • Example: the effect of sex (female, male) on risk of injury in football. • Express the injuries as a proportionof all players. • Risk difference or proportion difference • A common measure. Example: a-b = 75%-36% = 39%. • Problem: the sense of magnitude of a given difference depends on how big the proportions are. • Example: for a 10% difference, 90% vs 80% does not seem big, but… 11% vs 1% can be interpreted as a huge "difference" (11x the risk). 100 Proportioninjured (%) a =75% b =36% 0 male female Sex

1.0 3.0 5.0 7.0 9.0 trivial small moderate large very large ext. large • Another problem: the proportion difference is no good for time-dependent proportions (e.g., injuries). • For very short monitoring periods the proportions in both groups are ~0%, so the proportion difference is ~0%. • Similarly for very long monitoring periods, the proportions in both groups are ~100%, so the proportion difference is ~0%. • So there is no scale of magnitudes for a risk or proportion difference. • Exception: effects on winning a close match can be expressed as a proportion difference: 55% vs 45% is a 10% difference or 1 extra match in every 10 matches; 65% vs 35% is 3 extra, and so on. • Hence this scale for extra matches won or lost per 10 matches: • But the analyses (models) don't work properly with proportions. • We have to analyze hazards or odds instead of proportions. • I will explain shortly.

1.11 1.43 2.0 3.3 10 trivial small moderate large very large ext. large 100 Proportioninjured (%) a =75% • Risk ratio (relative risk) or proportion ratio • Another common measure.Example: a/b = 75/36 = 2.1, which meansmales are "2.1 times more likely" to be injured,or "a 110% increase in risk" of injury for males. • Problem: if it's a time-dependent measure, and youwait long enough, everyone gets affected, so risk ratio = 1.00. • But it works for rare time-dependent risks and for time-independent classifications (e.g., proportion playing a sport). • Magnitude thresholds? Small, moderate, large, very large and extremely large risk ratios occur when, for every 10 males injured, the number of females injured is 9, 7, 5, 3 or 1. • So the ratios are 10/9, 10/7, 10/5, 10/3 and 10/1. • Hence this complete scale for proportion ratio and low-risk ratio: • and the inverses for reductions in proportions: 0.9, 0.7, 0.5, 0.3, 0.1. b =36% 0 male female Sex

1.11 1.43 2.0 3.3 10 trivial small moderate large very large ext. large • Hazard ratio for time-dependent events. • To understand hazards, considerthe increase in proportions with time. • Over a very short period, the risk in both groupsis tiny, and the risk ratio is independent of time. • Example: risk for males = a = 0.28% per 1 d = 0.56% per 2 d, risk for females = b = 0.11% per 1 d = 0.22% per 2d. So risk ratio = a/b = 0.28/0.11 = 0.56/0.22 = 2.5. That is, males are 2.5x more likely to get injuredper unit time, whatever the (small) unit of time. • The risk per unit time is called a hazard or incidence rate. • Hence hazard ratio, incidence-rate ratio or “right-now” risk ratio. • Magnitude thresholds are the same as for the proportion ratio: • and the inverses 0.9, 0.7, 0.5, 0.3, 0.1. 100 males Proportioninjured (%) females 0 Time (months) a b

100 c =25% d =64% Proportionplaying(%) a =75% • Odds ratio for time-independentproportions or classifications. • Odds are the awkward but only way to analyze classifications. • Example: proportion of males and females playing a school sport. • Odds of a male playing = a/c = 75/25. • Odds of a female playing = b/d = 36/64. • Odds ratio = (75/25)/(36/64) = 5.3. • The odds ratio can be interpreted as "…times more likely" only when the proportionsin both groups are small (<10%). • The odds ratio is then approximately equal to the proportion ratio. • When one or both proportions are >10%, you must convert the odds ratio and its confidence limits into a proportion ratio to interpret the magnitude. b =36% 0 male female Sex

1.11 1.43 2.0 3.3 10 trivial small moderate large very large ext. large Ratio of Counts • Example: 93 vs 69 injuries per 1000 player-hours of match play in sport A vs sport B. • The effect is expressed as a ratio: 93/69 = 1.35x more injuries. • It can also be expressed as 35% more injuries. • The scale of magnitudes is the same as for ratio of proportions: • and the inverses 0.9, 0.7, 0.5, 0.3, 0.1. –––––––––– • Effects of numeric linear predictors (slopes) for ratio outcomes are analyzed as hazard, odds or count ratios per unit of the predictor and evaluated as hazard, proportion or count ratios per 2 SD. • When each individual has a count or proportion of something, you can use standardization to define the magnitude thresholds. • Example: counts of total tackles or proportions of successful tackles of the players in football matches.

Modeling (Analyzing) Effects • Estimates and inferential statistics for mean effects and slopes come from various kinds of general linear model… • t tests, simple and multiple linear regression, ANOVA… • Use mixed linear models for repeatedmeasures and clustering. • Testing for normality is pointless, but uniformity is the real issue. • Many effects are more uniform when estimated as percents or ratios via analysis of the log-transformed dependent variable. • Bootstrapping of confidence limits works with difficult data. • Ratios of odds, hazards and counts need various kinds of generalizedlinear model… • All include log transformation to estimate ratios. • Logistic (log-odds) regression for odds, log-hazard and Cox regression for hazards, Poisson (log-count) regression for counts.

Monitoring an Individual (Athlete) • It’s usually about any change since the last assessment. • The subjective assessments (perceptions) of the athlete, coach, and support personnel provide important evidence. • Their assessments of change can have high validity. • Objective assessments of change with an instrument or test are contaminated with error or "noise". • The noise is represented by the standard deviation of repeated measurements: the standard (or typical) error of measurement. • Think of ± the error as the equivalent of confidence limits for the athlete's true change. • Take into account clinically or practically important changes. • “Wow, you've made a moderate improvement!” • “No real change today.” [A very reliable test is needed for this.] • “Hmmm… I can’t say whether you’re getting better or worse.” [This is the more usual scenario!]

Summary • Inferential statistics are used to make conclusions about the truevalue of a simple or effect statistic derived from a sample. • The inference from a null-hypothesis significance test is about whether the true value of an effect statistic could be null (zero). • Magnitude-based inference addresses the issue of whether the true value could be important (beneficial and harmful, or substantial). • Effect magnitudes have key roles in research and practice. • Effects for continuous dependents are mean differences, slopes (expressed per 2 SD of the predictor), and correlations. • Thresholds for small, moderate, large, very large and extremely large standardized mean differences: 0.20, 0.60, 1.2, 2.0, 4.0. • Thresholds for correlations: 0.10, 0.30, 0.50, 0.70, 0.90. • Magnitude thresholds for ratios of proportions, hazards, counts: 1.11, 1.43, 2.0, 3.3, 10 and their inverses 0.9, 0.7, 0.5, 0.3, 0.1. • Take noise and thresholds into account when monitoring athletes.

Statistical Analysis and Data Interpretation for Athletes, Statisticians, and Team Doctors

Statistical Analysis and Data Interpretation for Athletes, Statisticians, and Team Doctors

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7