1 / 27

Statistical Methods in Computer Science

The Basis for Experiment Design Ido Dagan. Statistical Methods in Computer Science. Experimental Lifecycle. Model/Theory. Analysis. Hypothesis. Experiment. Proving a Theory?. We've discussed 4 methods of proving a proposition Everyone knows it Someone specific says it

willardmoss
Télécharger la présentation

Statistical Methods in Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Basis for Experiment Design Ido Dagan Statistical Methods in Computer Science

  2. Experimental Lifecycle Model/Theory Analysis Hypothesis Experiment

  3. Proving a Theory? We've discussed 4 methods of proving a proposition Everyone knows it Someone specific says it An experiment supports it We can mathematically prove it Some propositions cannot be verified empirically: “This compiler has linear run-time” Infinite possible inputs --> cannot prove empirically But they may still be disproved: e.g., code that causes the compiler to run non-linearly

  4. Karl Popper's Philosophy of Science Popper advanced a particular philosophy of science: Falsifiability For a theory to be considered scientific, it must be falsifiable There must be some way to refute it, in principle Not falsifiable <==> Not scientific Examples: “All crows are black” falsifiable by finding a white crow “Compile in linear time” falsifiable by non-linear performance Theory tested on its predictions

  5. Proving by disproving... Platt (“Strong Inference”, 1964) offers a specific method: Devise alternative hypotheses for observations Devise experiment(s) allowing elimination of hypotheses Carry out experiments to obtain a clean result Go to 1. The idea is to eliminate (falsify) hypotheses

  6. Forming Hypotheses So, to support theory X, we: Construct falsifiability hypotheses X1,.... Xn, .... Systematically experiment to disprove X, by proving Xi If all falsification hypotheses eliminated, then this lends support to the theory Note that future falsification hypotheses may be formed Theory must continue to hold against “attacks” Popper: Scientific evolution, “survival of the fittest theory” E.g. Newton’s theory How does this view hold in computer science?

  7. Forming Hypotheses in CS Carefully identify the theoretical object we are studying: e.g., “the relation between input-size and run-time is linear” e.g., “the display improves user performance” Identify falsification hypothesis (null hypothesis) H0 e.g., “there is an input-size for which run-time is non-linear” e.g., “the display will have no effect on user performance” Now, experiment to eliminate H0

  8. The Basics of Experiment Design Experiments identify a relation between variables X, Y, ... Simple experiments: Provide indication of a relation Better/worse, linear or non-linear, .... Advanced experiments: help identify causes, interactions Linear in input size but constant factor depends on type of data

  9. Types of Experiments and Variables Manipulation experiments Manipulate (= set value of) independent variables(input size) Observe (measure value of) dependent variables(run time) Observation experiments Observe predictor variables(person height) Observe response variables(running speed) Also running time – if observing system in actual use Other variables: Endogenous: On causal path between independent and dependent Exogenous: Other variables influencing dependent variables

  10. An example of observation experiment Theory: Gender affects score performance Falsifying hypothesis: Gender does not affect performance I.e. Men & women perform the same Cannot use manipulation experiments Cannot control gender Must use observation experiments

  11. An example observation experiment(ala “Empirical methods in AI”, Cohen 1995) # Siblings: 2 Teacher's attitude Mother: artist Test score: 650 Gender: Male Child confidence Height: 145cm Independent (Predictor) Variables # Siblings: 3 Teacher's attitude Mother: Doctor Test score: 720 Gender: Female Child confidence Height: 135cm

  12. An example observation experiment(ala “Empirical methods in AI”, Cohen 1995) # Siblings: 2 Teacher's attitude Mother: artist Test score: 650 Gender: Male Child confidence Height: 145cm Dependent (Response) Variables # Siblings: 3 Teacher's attitude Mother: Doctor Test score: 720 Gender: Female Child confidence Height: 135cm

  13. An example observation experiment(ala “Empirical methods in AI”, Cohen 1995) # Siblings: 2 Teacher's attitude Mother: artist Test score: 650 Gender: Male Child confidence Height: 145cm Endogenous Variables # Siblings: 3 Teacher's attitude Mother: Doctor Test score: 720 Gender: Female Child confidence Height: 135cm

  14. An example observation experiment(ala “Empirical methods in AI”, Cohen 1995) # Siblings: 2 Teacher's attitude Mother: artist Test score: 650 Gender: Male Child confidence Height: 145cm Exogenous Variables # Siblings: 3 Teacher's attitude Mother: Doctor Test score: 720 Gender: Female Child confidence Height: 135cm

  15. Experiment Design: Introduction Different experiment types explore different hypotheses For instance, a very simple design: treatment experiment Sometimes known as a lesion study treatment Ind1 Ex1 Ex2 .... ExnDep1 control Not(Ind1) Ex1 Ex2 .... ExnDep2 Treatment condition: Independent variable set to “with treatment” Control condition: Independent var set to “no treatment” Dependent Variable Variables: V0 V1 V2 ... Vn

  16. Single-Factor Treatment Experiments A generalization of treatment experiments Allow comparison of different conditions treatment1Ind1 Ex1 Ex2 .... ExnDep1 treatment2 Ind2 Ex1 Ex2 .... ExnDep2 [control Not(Ind) Ex1 Ex2 .... ExnDep3 ] Compare performance of algorithm A to B to C .... Control condition: Optional (e.g., to establish baseline) Determine relation of categorical var V0 and the dependent var Vn Dependent Variable V1 V2 V0

  17. Careful ! An effect on the dependent variable may not be as expected Example: An experiment Hypothesis: fly's ear is on its wings Fly with two wings. Make loud noise. Observe flight. Fly with one wing. Make loud noise. No flight. Conclusion: Fly with only one wing cannot hear! What's going on here? First, interpretation by the experimenter But also, lack of sufficient falsifiability: There are other possible explanations for why fly wouldn't fly.

  18. Controlling for other factors Often, we cannot manipulate all exogenous variables Then, we need to make sure they are sampled randomly Randomization averages out their affect This can be difficult e.g.,, suppose we are trying to relate gender and math We control for effect of # of siblings by random sampling But # of siblings may be related to gender: Parents continue to have children hoping for a boy (Beal 1994) Thus # of siblings tied with gender Must separate results based on # of siblings

  19. Factorial Experiment Designs • Every combination of factor values is sampled • Hope is to exclude or reveal interactions • This creates a combinatorial number of experiments • N factors, k values each = kN combinations • Strategies for eliminating values: • Merge values, categories. Skip values. • Focus on extremes, to get a general trend • But may hide behavior at intermediate values

  20. Tips for Factorial Experiments For “numerical” variables, 2 value ranges are not enough Don't give a good sense of the function relating variables. Measure, measure, measure. Piggybacking measurements on planned experiments: cheaper than re-running experiments Simplify comparisons: Use same number of data points (trials) for all configurations

  21. Experiment Validity Type of validity: Internal and External validity Internal validity: Experiment shows relationship (independent causes dependent) External validity: Degree to which results generalize to other conditions Threats: uncontrolled conditions threatening validity

  22. Internal validity threats: Examples Order effects Practice effects in human or animal test subjects E.g. user performance improves in user interface tasks Solution: randomize order of presentation to subjects Bug or side-effects in testing system leaves system “unclean” for next trial – need to “clean” system between experiments If treatment/control given in two different orders E.g. run with/without new algorithm operating, for same users Order may be good for treatment, bad for control (or vice versa) Solution: counter-balancing (all possible orders) Demand effects Experimenter influences subject e.g., guiding subjects Confounding effects – variable relations aren’t clear See “fly with no wings cannot hear”

  23. External threats to validity Sampling bias: Non-representative samples e.g., non-representative external factors Floor and ceiling effects Problems tested too hard, too easy Regression effects Results have no way to go but up or down Solution approach: Run pilot experiments

  24. Sampling Bias Setting prefers measuring specific values over others For instance: “Random” selection of mice from cage for experiment Specific values: slow, doesn’t bite (not aggressive), … Including results that were found by some deadline Solution: Detect, and remove e.g., by visualization, looking for non-normal distributions e.g., surprising distribution of dependent data, for different values of independent variable.

  25. Baselines: Floor and Ceiling Effects How do we know A is good? Bad? Maybe the problems are too simple? Too hard? For example New machine learning algorithm has 95% accuracy Is this good? Controlling for Floor/Ceiling Establish baselines Show that a “silly” approach achieves close result Comparison to strawman (easy), ironman (hard) May be misleading if not chosen appropriately

  26. Regression Effects General phenomenon: “Regression towards the mean” Repeated measurement converges towards mean values Example threat: Run a program on 100 different inputs Problems 6, 14, 15 get a very low score We now fix the problem that affected only these inputs, and want to re-test If chance has anything to do with scoring, then must re-run all Why? Scores on 6, 14, 15 has no where to go but up. So re-running these problems will show improvement by chance Solution: Re-run complete tests, or sample conditions uniformly

  27. Summary Defensive thinking If I were trying to disprove the claim, what would I do Then think ways to counter any possible attack on claim Strong Inference, Popper's falsification ideas Science moves by disproving theories (empirically) Experiment design: Ideal independent variables: easy to manipulate Ideal dependent variables: measurable, sensitive, and meaningful Carefully think through threats Next week: Hypothesis testing

More Related