1 / 69

Statistical Concepts and Methodologies for Data Analyses

Statistical Concepts and Methodologies for Data Analyses. Benilton Carvalho Computational Biology and Statistics Group Department of Oncology University of Cambridge. From Random Variables to Hypothesis Testing. Random Variables. Function that associates probability to:

khuyen
Télécharger la présentation

Statistical Concepts and Methodologies for Data Analyses

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Concepts and Methodologies for Data Analyses Benilton Carvalho Computational Biology and Statistics Group Department of Oncology University of Cambridge

  2. From Random Variables to Hypothesis Testing

  3. Random Variables • Function that associates probability to: • Countable items (discrete random variable); • Tumor vs. Normal; Yes vs. No; Head vs. Tail; • Uncountable items (continuous random variable): • Log-expression; weight; height; • Characterized by a distribution function: • Bernoulli; Binomial; Geometric; Negative-Binomial; Poisson; • Normal; Student’s t; Gamma;

  4. Examples – Discrete Distributions

  5. Examples – Continuous Distributions

  6. Common Uses ofDifferent Distributions • Bernoulli: probability of 1 success; • Binomial: probability of K successes; • Geometric: probability of K failures before 1st success; • Negative-Binomial: probability of K failures before R successes; • Poisson: probability of K rare events;

  7. The Questions • Investigation of populations or groups within a population leads to questions: • How does BRCAI behave across groups? • Can genotype predict drug response? • Does transcript abundance change as a function of time?

  8. The Experiment • A procedure used to answer the questions; • Comprised of multiple items: • Population; • Sample; • Hypotheses; • Test statistic; • Rejection criteria;

  9. Population • Superset of subjects of interest; • Ideally, every subject in the population is surveyed; • Issues with the “census approach”;

  10. Sample • Select some subjects from the population; • We refer to this subset as sample; • Subject in a sample can be called replicate; • Replicate: technical vs. biological;

  11. Hypotheses • Sets that define the “underlying truth”; • Null Hypothesis (H0): default situation. • Cannot be proven; • Reject (in favor of H1) vs. fail to reject; • Alternative Hypothesis (H1): alternative (duh!) • Complements H0 on the parametric space; • Assists on the definition of the rejection criteria.

  12. Examples of Hypotheses – P1 • Comparing expression: Tumor vs. Normal: • Expression on tumor is at most as high as on normal; • Expression on tumor is higher than on normal;

  13. Examples of Hypotheses – P2 • Comparing expression: Tumor vs. Normal: • Expression on tumor is at least as low as on normal; • Expression on tumor is lower than on normal;

  14. Examples of Hypotheses – P3 • Comparing expression: Tumor vs. Normal: • Expressions on tumor and normal are the same; • Expressions on tumor and normal are different;

  15. Test Statistic • Summary of the data; • Built “under H0”; • Independent of unknown parameters; • Known distributions; • Compatibility between data and H0;

  16. Test Statistic • What the statistician see…

  17. Rejection Criteria • Function of three factors: • Test statistic; • Hypotheses; • Type I Error (False Positive), α; • Determines thresholds used to reject H0: • One threshold: one-sided tests; • Two thresholds: two-sided tests; • Defines what is “extreme” for the experiment;

  18. Rejection Criteria

  19. From Rejection Criteria to P-value! p-value

  20. Rejection Criteria

  21. From Rejection Criteria to P-value! p-value

  22. Rejection Criteria

  23. From Rejection Criteria to P-value! p-value

  24. Sampling and testing Random sample of 10 balls from the box Discrete observations #red = 3 When do I think that I am not sampling from this box anymore? How many reds could I expect to get just by chance alone! 10% red balls and 90% blue balls 24

  25. Sample Random sample of 10 balls from the box Discrete observations #red = 3 Test statistic Rejection criteria (based on your observed sample, do you have evidence to reject the hypothesis that you sampled from the null population) 10% red balls and 90% blue balls Null hypothesis (about the population that is being sampled) 25

  26. Sample Continuous observations 4, 2.3, 5.2, 4.7, 2.1, 3.5, …….. mean = 3, sd = 0.6 Test statistic Rejection criteria (based on your observed sample, do you have evidence to reject the hypothesis that you sampled from the null population) Null hypothesis (about the population that is being sampled) 26

  27. Summary of the Experiment 4) decision 1) hypotheses 2) sample 3) test statistic

  28. Useful Facts • The Law of the Large Numbers guarantees that the larger the sample size is, the closer the sample average is to the actual mean; • Normality assumption isn’t that important with large sample size; • The Central Limit Theorem states that the average is asymptotically normal;

  29. Useful Facts • The Z-score depends on the precise knowledge of the variance term: • Estimating the variance changes the distribution of the test statistic:

  30. Useful Facts • The Student’s t distribution is similar to the Normal distribution, but has heavier tails; • Larger sample size, more d.f.; • More d.f., closer to Normal;

  31. Multiple Testing • We are doing high-throughput experiments; • Comparing thousands of units simultaneously; • At this scale, we can observe several instances of rare events just by chance: • Event A: 1 in 1000 chance of happening; • Event B: 999 in 1000 chance of happening; • And the experiment is tried 20,000 times; • We expect 20 occurrences of Event A to be observed, although Event B is much more likely;

  32. Multiple Testing • Similar scenario, for example, with DE; • Most genes are not differentially expressed; • High-throughput experiments; • Differential expression is tested for 20K genes; • Need to protect against false positives; • Suggestion: use non-specific filtering;

  33. Data Modeling

  34. What is a model?

  35. Statistical Models • There is no “correct model”; • Models are approximations of the truth; • There is a “useful model”; • Understand the mechanisms of the system for better choices of model alternatives;

  36. Revisiting Microarrays • Scanned images; • Fluorescence intensities; • Proportional to target abundances; • Restricted dynamic range; • Asymmetrical distribution; • Log-Intensities behave better;

  37. Revisiting Microarrays

  38. Intensities

  39. Log-Intensities

  40. Back to Data ModelingLinear Regression / ANOVA • Nature of the data: continuous; • Linear regression often used; • For subject i, known factors/covariates are candidates to predict log-intensities of a gene: • Residuals expected to be Normal;

  41. Interpreting Coefficients • Statisticians indicate that a parameter is estimated by using a “hat” on top of it: • Assuming that X = 0 for normal tissue: • Assuming that X = 1 for tumor tissue:

  42. Interpreting Coefficients Average log-intensity for normal tissue Change in average log-intensity associated to the tumor tissue Average log-intensity for tumor tissue

  43. GLM • Generalized Linear Models; • Generic framework; • Accommodates different types of data; • Special cases: Linear regressions and ANOVAs;

  44. Example – GLM Binomial Family • Responses: yes/no; dead/alive; sick/healthy; • Predictors: Gene expression / genotype / age; • Example: • Response: Cytogenetic abnormalities (Yes/No); • Predictors: Log-expression of probeset 1059_at;

  45. Log-Expression vs. Abnormalities

  46. Modeling a Binary Response • Response in the previous example: • Observed cytogenetic abnormalities; • Did not observe cytogenetic abnormalities; • Linear regression does not work:

  47. Modeling a Binary Response • Instead of modeling the actual response, we model the probability of that response; • Linear regression still fails; Valid Results

  48. Logistic Regression - Rationale • Probability is restricted to the [0, 1] interval; • Linear regression isn’t; • Need to transform probability;

  49. Logistic Regression - Rationale • Instead of probability, model the odds: • Odds range from 0 to Infinity; • A linear regression approach would still fail;

  50. Logistic Regression - Rationale • Instead of odds, model the log-odds: • Log-odds range from -Infinity to Infinity; • An approach like linear regression, using the log-odds scale, would work fine;

More Related