Statistical Methods in Computer Science

The Basis for Experiment Design Ido Dagan Statistical Methods in Computer Science

Experimental Lifecycle Model/Theory Analysis Hypothesis Experiment

Proving a Theory? We've discussed 4 methods of proving a proposition Everyone knows it Someone specific says it An experiment supports it We can mathematically prove it Some propositions cannot be verified empirically: “This compiler has linear run-time” Infinite possible inputs --> cannot prove empirically But they may still be disproved: e.g., code that causes the compiler to run non-linearly

Karl Popper's Philosophy of Science Popper advanced a particular philosophy of science: Falsifiability For a theory to be considered scientific, it must be falsifiable There must be some way to refute it, in principle Not falsifiable <==> Not scientific Examples: “All crows are black” falsifiable by finding a white crow “Compile in linear time” falsifiable by non-linear performance Theory tested on its predictions

Proving by disproving... Platt (“Strong Inference”, 1964) offers a specific method: Devise alternative hypotheses for observations Devise experiment(s) allowing elimination of hypotheses Carry out experiments to obtain a clean result Go to 1. The idea is to eliminate (falsify) hypotheses

Forming Hypotheses So, to support theory X, we: Construct falsifiability hypotheses X1,.... Xn, .... Systematically experiment to disprove X, by proving Xi If all falsification hypotheses eliminated, then this lends support to the theory Note that future falsification hypotheses may be formed Theory must continue to hold against “attacks” Popper: Scientific evolution, “survival of the fittest theory” E.g. Newton’s theory How does this view hold in computer science?

Forming Hypotheses in CS Carefully identify the theoretical object we are studying: e.g., “the relation between input-size and run-time is linear” e.g., “the display improves user performance” Identify falsification hypothesis (null hypothesis) H0 e.g., “there is an input-size for which run-time is non-linear” e.g., “the display will have no effect on user performance” Now, experiment to eliminate H0

The Basics of Experiment Design Experiments identify a relation between variables X, Y, ... Simple experiments: Provide indication of a relation Better/worse, linear or non-linear, .... Advanced experiments: help identify causes, interactions Linear in input size but constant factor depends on type of data

Types of Experiments and Variables Manipulation experiments Manipulate (= set value of) independent variables(input size) Observe (measure value of) dependent variables(run time) Observation experiments Observe predictor variables(person height) Observe response variables(running speed) Also running time – if observing system in actual use Other variables: Endogenous: On causal path between independent and dependent Exogenous: Other variables influencing dependent variables

An example of observation experiment Theory: Gender affects score performance Falsifying hypothesis: Gender does not affect performance I.e. Men & women perform the same Cannot use manipulation experiments Cannot control gender Must use observation experiments

An example observation experiment(ala “Empirical methods in AI”, Cohen 1995) # Siblings: 2 Teacher's attitude Mother: artist Test score: 650 Gender: Male Child confidence Height: 145cm Independent (Predictor) Variables # Siblings: 3 Teacher's attitude Mother: Doctor Test score: 720 Gender: Female Child confidence Height: 135cm

An example observation experiment(ala “Empirical methods in AI”, Cohen 1995) # Siblings: 2 Teacher's attitude Mother: artist Test score: 650 Gender: Male Child confidence Height: 145cm Dependent (Response) Variables # Siblings: 3 Teacher's attitude Mother: Doctor Test score: 720 Gender: Female Child confidence Height: 135cm

An example observation experiment(ala “Empirical methods in AI”, Cohen 1995) # Siblings: 2 Teacher's attitude Mother: artist Test score: 650 Gender: Male Child confidence Height: 145cm Endogenous Variables # Siblings: 3 Teacher's attitude Mother: Doctor Test score: 720 Gender: Female Child confidence Height: 135cm

An example observation experiment(ala “Empirical methods in AI”, Cohen 1995) # Siblings: 2 Teacher's attitude Mother: artist Test score: 650 Gender: Male Child confidence Height: 145cm Exogenous Variables # Siblings: 3 Teacher's attitude Mother: Doctor Test score: 720 Gender: Female Child confidence Height: 135cm

Experiment Design: Introduction Different experiment types explore different hypotheses For instance, a very simple design: treatment experiment Sometimes known as a lesion study treatment Ind1 Ex1 Ex2 .... ExnDep1 control Not(Ind1) Ex1 Ex2 .... ExnDep2 Treatment condition: Independent variable set to “with treatment” Control condition: Independent var set to “no treatment” Dependent Variable Variables: V0 V1 V2 ... Vn

Single-Factor Treatment Experiments A generalization of treatment experiments Allow comparison of different conditions treatment1Ind1 Ex1 Ex2 .... ExnDep1 treatment2 Ind2 Ex1 Ex2 .... ExnDep2 [control Not(Ind) Ex1 Ex2 .... ExnDep3 ] Compare performance of algorithm A to B to C .... Control condition: Optional (e.g., to establish baseline) Determine relation of categorical var V0 and the dependent var Vn Dependent Variable V1 V2 V0

Careful ! An effect on the dependent variable may not be as expected Example: An experiment Hypothesis: fly's ear is on its wings Fly with two wings. Make loud noise. Observe flight. Fly with one wing. Make loud noise. No flight. Conclusion: Fly with only one wing cannot hear! What's going on here? First, interpretation by the experimenter But also, lack of sufficient falsifiability: There are other possible explanations for why fly wouldn't fly.

Controlling for other factors Often, we cannot manipulate all exogenous variables Then, we need to make sure they are sampled randomly Randomization averages out their affect This can be difficult e.g.,, suppose we are trying to relate gender and math We control for effect of # of siblings by random sampling But # of siblings may be related to gender: Parents continue to have children hoping for a boy (Beal 1994) Thus # of siblings tied with gender Must separate results based on # of siblings

Factorial Experiment Designs • Every combination of factor values is sampled • Hope is to exclude or reveal interactions • This creates a combinatorial number of experiments • N factors, k values each = kN combinations • Strategies for eliminating values: • Merge values, categories. Skip values. • Focus on extremes, to get a general trend • But may hide behavior at intermediate values

Tips for Factorial Experiments For “numerical” variables, 2 value ranges are not enough Don't give a good sense of the function relating variables. Measure, measure, measure. Piggybacking measurements on planned experiments: cheaper than re-running experiments Simplify comparisons: Use same number of data points (trials) for all configurations

Experiment Validity Type of validity: Internal and External validity Internal validity: Experiment shows relationship (independent causes dependent) External validity: Degree to which results generalize to other conditions Threats: uncontrolled conditions threatening validity

Internal validity threats: Examples Order effects Practice effects in human or animal test subjects E.g. user performance improves in user interface tasks Solution: randomize order of presentation to subjects Bug or side-effects in testing system leaves system “unclean” for next trial – need to “clean” system between experiments If treatment/control given in two different orders E.g. run with/without new algorithm operating, for same users Order may be good for treatment, bad for control (or vice versa) Solution: counter-balancing (all possible orders) Demand effects Experimenter influences subject e.g., guiding subjects Confounding effects – variable relations aren’t clear See “fly with no wings cannot hear”

External threats to validity Sampling bias: Non-representative samples e.g., non-representative external factors Floor and ceiling effects Problems tested too hard, too easy Regression effects Results have no way to go but up or down Solution approach: Run pilot experiments

Sampling Bias Setting prefers measuring specific values over others For instance: “Random” selection of mice from cage for experiment Specific values: slow, doesn’t bite (not aggressive), … Including results that were found by some deadline Solution: Detect, and remove e.g., by visualization, looking for non-normal distributions e.g., surprising distribution of dependent data, for different values of independent variable.

Baselines: Floor and Ceiling Effects How do we know A is good? Bad? Maybe the problems are too simple? Too hard? For example New machine learning algorithm has 95% accuracy Is this good? Controlling for Floor/Ceiling Establish baselines Show that a “silly” approach achieves close result Comparison to strawman (easy), ironman (hard) May be misleading if not chosen appropriately

Regression Effects General phenomenon: “Regression towards the mean” Repeated measurement converges towards mean values Example threat: Run a program on 100 different inputs Problems 6, 14, 15 get a very low score We now fix the problem that affected only these inputs, and want to re-test If chance has anything to do with scoring, then must re-run all Why? Scores on 6, 14, 15 has no where to go but up. So re-running these problems will show improvement by chance Solution: Re-run complete tests, or sample conditions uniformly

Summary Defensive thinking If I were trying to disprove the claim, what would I do Then think ways to counter any possible attack on claim Strong Inference, Popper's falsification ideas Science moves by disproving theories (empirically) Experiment design: Ideal independent variables: easy to manipulate Ideal dependent variables: measurable, sensitive, and meaningful Carefully think through threats Next week: Hypothesis testing

Statistical Methods in Computer Science