Hypothesis Testing

Part 3 Hypothesis Testing

Course Outline Day 1 Day 2 • Part 0: Student Introduction • Paper Helicopter - Pt 0 Use what you know • Part 1: DOE Introduction • What is a Designed Experiment? • Part 2: Planning • Understand the test item’s process from start to finish • Identify Test Objective – screen, characterize, optimize, compare • Response variables • Identify key factors affecting performance • Paper Helicopter - Pt 1 Planning • Part 3: Hypothesis Testing • Random variables • Understanding hypothesis testing • Demonstrate a hypothesis test • Sample size, risk, and constraints • Seatwork Exercise 1 – Hang time Measurements • Part 4: Design and Execution • Part 4: Design and Execution • Understanding a test matrix • Choose the test space – set levels • Factorials and fractional factorials • Execution – randomization and blocking • In-Class F-18 LEX Study - Design Build • Paper Helicopter - Pt 2 Design for Power • Part 5: Analysis • Regression model building • ANOVA • Interpreting results – assess results, redesign and plan further tests • Optimization • In-Class F-18 LEX Study - Analysis • Seatwork Exercise 2 – NASCAR • Paper Helicopter – Pt 3 Execute and Analyze • Paper Helicopter – Pt 4 Multiple Response Optimization • Part 6: Case Studies

Science of Test IV Metrics of Note Plan Sequentially for Discovery Factors, Responses and Levels Design with Confidence and Power to Span the Battlespace N, a, Power, Test Matrices Analyze Statistically to Model Performance Model, Predictions, Bounds DOE Execute to Control Uncertainty Randomize, Block, Replicate

Random Variables: Central Problem of Test • The Central Problem of Test is to determine the nature of the real world with a limited sample • Want to draw a sample and make educated guesses about the population based on the sample • There are 2 outcomes: • Conclusions we draw: (based on sample) • Reality of population: (regardless of sample) • Ideally we would like the two to match or at least minimize the risk of making incorrect conclusions • Our Goal: Structure the test in such a way that test leads to correct conclusions with the minimal risk of error

Random Variables • Our test data points are realizations of random variables. • Def: “random variable”: real-valued function of the outcome of an experiment • Examples: • The measured thrust from an engine test stand • Miss distance in meters of a dumb bomb drop • Time needed to transmit a message • Percent of time during a pass that an aircraft radar correctly identifies a threat • Time of use until encountering an error in computer software

Response Variables (MOPs) Types of Measures of Performance (MOPs): • Categorical: • Discrete, category but no order (e.g. missile types, ECM types) • Ordinal: • Discrete, category but order determines value (e.g. survey data) • Interval: • Continuous, but zero and ratios have no real world meaning (e.g. temperature, dates) • Ratio: • Continuous, zero has a real meaning, ratios are relevant (e.g. miss distance) Continuous data are better!

Statistics based on Data • Responses are gathered and transformed intotest statistics • Test Statistics: • A quantity calculated from a sample of data • Examples:Quick definitions: • MeanAverage • Median Rank-order halfway point • Sample Standard Deviation Spread of data (variability or dispersion)

Reference Distributions • We make conclusions about our data by comparing a test statistic (some composition of many random variables) to its reference distribution • Sometimes we know or can assume a reference distribution • Historical data • Classical distributions (normal…etc) • Sometimes we don’t…what then? • Resampling • Depending on how “unusual” our data is in the reference distribution, gives us evidence to make conclusions about our test • In many cases, distributions may be assumed to be Normal – small departures from this, not an issue

Dispersion • The value of the standard deviation of a population describes the variability or spread s large s small m m

Bounds on Dispersion f (y) y m The probability that an observation lies within limitsis the area under the curve Limits here are expressed as multiples of standard deviation

The Normal Distribution – Central Limit Theorem “Number of Occurrences” arranged in “bins” If we keep increasing The sample size, the distribution of means will approach the normal distribution Consider the plot of many (say 50) means of size n = 10 observations from a strain gage force transducer Now arrange the sample means in a histogram

Hypothesis Testing • Hypothesis Test for Simple Comparative Experiment: • Test is simplified into two competing claims or hypotheses • H0: Null hypothesis, examples • H1: Alternate hypothesis, examples

Case study: The Maverick H/K • Present B (EO) and D/G (IIR) seekers degrading with age • Replace EO/IIR versions with H and K variants • Critical Operational Issue: Does Maverick still perform at least as well as previously? • Shoot several of each type – same day, same conditions, same target • What chance of getting it right?

Example: Maverick H/K • The Air Force wants to field a new air-to-ground missile. Early testing has shown that the missile often lands either well short or long of the target. • The new Maverick H/K variant must perform as well as or better than the legacy Maverick model. The Maverick H/K supply is limited.

Sample Size • How many tests? • Function of several influences • Test time, resources and budget • More is always better, but may not be necessary • Right size based on limitations, confidence and power n, a, b, s, d 1.5 4.5

Maverick H/K : Choosing Best Response MOP Categorical response: Hit or Miss Numeric response: Miss Distance Constant n and s Distribution of real-value metric Distribution of proportions based on 0 / 1 outcome Distance Y MISS HIT MISS Dispersion around central tendency Very large spread, difficult to estimate

Maverick Test: How Many Trials ? • Let’s draw a sample of _n_ shots • How many is enough to get it right? • 3 – because that’s our budget • 8 – because 8 sounds about right • 10 – ah, just slide the decimal • 30 – because something good happens at 30!

1.5 1.8 1.2 Distributions of Observations Suppose we could make many missile trial shots and then plot their miss distance in a dot plot (~histogram) Each dot is one launch Miss Distance • Distributions are often described by their shape, center, and spread

Characterizing Distributions 1.5 2.5 0.5 We often refer to distributions by their center and their spread. Center Inflection point Spread Eyeballing the results, the center is about 1.5 and the spread is about 0.4

Distributions of Observations vs. Means Distributions of individualshot miss distances n = 1 Distributions of meanmiss distances n = 40 n = 5 n = 20 • The distributions of means are used in hypothesis testing and determination of risk (error)

Type I and II error • Weapon testing can be framed in terms of a hypothesis • There are risks associated with the verdict, either conclusion has some probability of error • Hypothesis test: H0: Maverick H/Kperforms the same as older (Null Hypothesis) H1: Maverick H/K performance is degraded (Alternative Hypothesis)

Visualizing Errors Type I error = a = False Positive Truth: Miss Distance  1.5 Conclude: Miss Distance > 1.5 Type II error = b= False Negative Truth: Miss Distance > 1.5 Conclude: Miss Distance  1.5 • Type I and II errors are probabilities determined prior to test based on our knowledge of sample size, and null vs. alternative hypotheses • Risks or Errors: • Type I error = probability a = False Positive • Reject Ho when Ho is true • Type II error = probability b= False Negative • “Accept” Ho when H1 is true • Example: A hypothesis test to determine if Maverick H/K miss distance is is equal to or less than 1.5 (legacy performance)

Delta, Power, and Sample Size • Example: Maverick H/K • Recall, we wish to test whether the new version is at least as good as the old one, old version had a miss distance mean of 1.5 • What is the error associated with the new missile, true miss distance of 2.5; how does sample size influence the magnitude of this error? H0 H1 Miss distance 1.5 2.5 d = 1 • The difference in means we wish to detect is d (delta) • Larger deltas increase power (easier to detect a large difference) • Small deltas typically require a large sample size and/or small run-to-run variation (i.e. small standard deviation)

Sample Size • How many tests should be performed ? • Function of several influences • Test time, resources, and budget • More is always better, but may be excessive • Right size based on limitations, confidence and power n, a, b, s, d 1.5 2.5

Errors in Conclusion, n = 3, fixed a, s, d Decision rule: declare changed if miss dist > 2.7 Distribution associated with H0 Performs as expected - unchanged a = probability declare changed when it has not changed – pick apriori 1.5 • and d stay same For next 2 slides Distribution associated with H1 Change in performance - degraded b b= probability declare unchangedwhen it has changed 2.5

Errors in Conclusion, n = 10, fixed a, s, d Decision rule: declare changed if miss dist > 2.2 (More precision vs. n = 3) Performs as expected - unchanged a 1.5 • and d stay same Change in performance - degraded b Note that the areaassociated with b is reducedversus the n=3 case 2.5

Errors in Conclusion, n = 20, fixed a, s, d Decision rule: declare changed if miss dist > 1.9 (More precision vs n = 10) Performs as expected - unchanged a 1.5 • and d stay same Change in performance - degraded b Note that the areaassociated with b is reducedagain vs. the n=10 case 2.5

Effect of Sample Size and Dispersion Distributions of average miss distance for Maverick H/K For a given s n = 1 n = 10 n = 25 For a given n s = 2.0 s = 1.0 s = 0.5 • Sample size and dispersion are influenced by proper planning, which aids in seeing the true underlying factor effects P3 - 28

Power Calculation Russ Lenth offers an applet to help with power calculations: http://www.stat.uiowa.edu/~rlenth/Power/ Consider the cement example again What is the powerto detect a differencein means of d=0.5 musing 8 samples forweaponand a standard deviation of 0.25 mwith 95% confidence ? Power from program 1-b=0.96 What if d=0.25 ? 1-b=0.46 How about d=0.25, b=.05What n is required ?n=27

Importance of the t-Test + + A C C _ _ _ B + Provides an objective framework for simple comparative experiments Could be used to test all relevant hypotheses in a two-level factorial design, because all of these hypotheses involve the mean response at one “side” of the cube versus the mean response at the opposite “side” of the cube We will take a slightly differentapproach

Session 3: Summary Hypothesis testing and risk • Random variables • Probability distributions – the normal distribution • Understanding hypothesis testing • Demonstrate a hypothesis test • Sample size, Confidence, Power and constraints

Seatwork Do Seatwork Exercise 1 – Hang time Measurements

Hypothesis Testing