1 / 18

Statistical Methods in Computer Science

Hypothesis Life-cycle Ido Dagan. Statistical Methods in Computer Science. Why to experiment?. W. Tichy, “Should Computer Scientists Experiment More?” (on course web page) System/Model/theory testing Identify incorrectness, incompleteness in your “theory”/assumptions

mac
Télécharger la présentation

Statistical Methods in Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hypothesis Life-cycle Ido Dagan Statistical Methods in Computer Science

  2. Why to experiment? • W. Tichy, “Should Computer Scientists Experiment More?” • (on course web page) • System/Model/theory testing • Identify incorrectness, incompleteness in your “theory”/assumptions • This can save money and lives! • e.g. underlying assumptions that are violated by reality • Can lead to revising model and/or system • Exploration • Find new phenomena • E.g. unknown user behaviors in using systems

  3. Empirical Research Cycle • Established methodology, with very long tradition • Natural sciences, social sciences • Cycle: • Form theory/model • E.g. search engine ranking function • Hypothesize based on theory • More relevant pages higher than less relevant ones • Experiment (when possible) • Ask people to judge relevance (binary, score, relative, …) • Observe results • Find discrepancies between hypothesized predictions and results • Revise theory (and publish results) • This course covers especially [hypothesis .... discrepancy] • Heavy use of statistics and analytical skills (a bit of art)

  4. Common Practice • Vague idea • No preliminary investigation • No articulation of precise hypothesis • Bad experimental design • No iterations

  5. Lots of Ways to Attack Experimentation Not general – only applies to the “system/setting under test”. E.g. general claims on user behavior true only for one system Not forward-looking motivations and observations based on the past. Lack of representative comparison inadequate benchmarks (users are happy with my system…) difficult/costly to implement comparisons Not enabling independent replication of experiments Real data can be messy – difficult to choose which data to gather E.g. which aspects of user behavior (speed, satisfaction, success,…)

  6. Experimental Lifecycle Vague idea Initialobservations “groping around” experiences 1. Understand the problem,frame the questions, articulate the goals.A problem well-stated is half-solved. Model/Theory Data, analysis, interpretation Hypothesis Results & finalPresentation Experiment

  7. A Systematic Approach Understand the problem, frame the questions, articulate the goals.A problem well-stated is half-solved. Be able to answer “why” as well as “what” E.g. why people search? Find website? / Find information? Select metrics that will help answer the questions. Rank of correct website / Percentage or relevant pages in top 10 Identify the parameters that affect behavior System parameters (e.g., HW config, search speed) Workload parameters (e.g., user request patterns) Data parameters (e.g. long/short documents) Decide which parameters to study (vary in experiment)

  8. Experimental Lifecycle Vague idea 2. Select metrics that will help answer the questions. 3. Identify the parameters that affect behavior Initialobservations “groping around” experiences Model/Theory Data, analysis, interpretation Hypothesis Results & finalPresentation Experiment

  9. Behavior Parameters/VariablesExample: software performance Hardware parameters CPU model and organization, cache organization, latencies in the system (these will affect running time) System parameters Memory availability, usage CPU running time (sometimes approximated by world-clock time) Communication bandwidth, usage Program characteristics requires floating-point, heavy disk usage, integer math, graphics

  10. Now build a model (theory) Mathematically precise Memory = 2*sizeof(input) + 3 Runtime = 500 + 30*sizeof(input) + 20 Asymptotically correct Memory = O(sizeof(input)) in worst case, Runtime = O(log (sizeof(input))) in best case Accuracy is proportional to run-time Qualitative User performance is increased with reduced cognitive load Number of bugs discovered is monotonically decreasing if the same programmer is used, otherwise it increases

  11. Now form hypothesis Translate qualitative into quantitative Use of new system will (these are different hypotheses): + Increase operator accuracy (compared to not using it) by X - Decrease failures by Y - Decrease performance time Z Introducing link information to relevance score will increase ranking quality by 10% ...... Operationalize the hypothesis

  12. What can go wrong at this stage? Wrong metrics (they don’t address the questions at hand) e.g., ads click through, rather than purchase Bad metrics: too difficult to measure, too costly Overlooking significant parameters that affect the system Not clear about where the “system under test” boundaries are E.g. poor ad content rather than poor ad matching Unrepresentative test-setting. Not predictive of real usage. Just what everyone else uses (adopted blindly) NOT what anyone else uses (no comparison possible)

  13. Experimental Lifecycle Vague idea Initialobservations “groping around” experiences Model/Theory Data, analysis, interpretation Hypothesis Decide which parameters to vary Select technique Select measurements Results & finalPresentation Experiment

  14. Decide which parameters to study (vary) Select measurement technique: Can we directly measure what we want? Intrusive (invasive) versus unobtrusive measurement How invasive? Can we quantify interference of monitoring? E.g. should user mark relevance, or we just follow clicks? Simulation – how detailed? Validated against what? Benchmarks Repeatability Experiment design Lesion studies / ablation tests (with and without component) Iron-man (e.g. human performance), straw-man Baseline, ceilings and floors Factorial design A Systematic Approach

  15. Experimental Lifecycle Vague idea Initialobservations “groping around” experiences Run experiments Analyze and interpret data Data presentation Hypothesis Data, analysis, interpretation Model Results & finalPresentation Experiment

  16. Run experiments How many trials? How many combinations of parameter settings? (e.g. users age groups) Practically limited Analyze and interpret data Descriptive statistics Dealing with variability, outliers Hypothesis testing: sample vs. population Potentially infinite population (e.g. software runs) Claims on variable values for population based on sample variables Statistical significance Data presentation A Systematic Approach

  17. Run experiments How many trials? How many combinations of parameter settings? Sensitivity analysis on other parameter values. Analyze and interpret data Statistics, dealing with variability, outliers Data presentation Where does it lead us next? New hypotheses, new questions, a new round of experiments A Systematic Approach

  18. Experimental Lifecycle Vague idea Initialobservations “groping around” experiences Model/Theory Data, analysis, interpretation Hypothesis Results & finalPresentation Experiment

More Related