110 likes | 279 Vues
Suppose we conduct a t-test of the difference between two means and obtain a p-value < .05. Does this mean: There is less than a 5% chance that the results are due to chance.
E N D
Suppose we conduct a t-test of the difference between two means and obtain a p-value < .05. Does this mean: • There is less than a 5% chance that the results are due to chance. • If there really is no difference between the population means, there is less than a 5% chance of obtaining a difference this large or larger. • There is a 95% chance that if the study is repeated, the result will be replicated. • There is a 95% chance that there is a real difference between the two population means. Adapted from: Wulff HR, Andersen B, Brandenhoff P, Guttler F (1987): What do doctors know about statistics? Statistics in Medicine 6:3-10
What is a p-value? The probability of obtaining a test statistic (data) that departs as much as or more than the observed test statistic (data) if the null hypothesis were true.
Which Null Hypotheses are Meaningful and Testable? Those that precisely specify a probability model for the data.
A Perspective Samples Populations • We study: • We wish to obtain knowledge about: Data Nature
Gene Family-Based Hypothesis Testing • Sketch of Typical (outmoded and inappropriate) Approach: • For Genes 1 to K, define a vector, R, of length K that contains the values of a categorical variable denoting group membership. • For Genes 1 to K, define a vector, C, of length K that contains the values of a binary variable denoting whether or not the gene was ‘significant’ or ‘interesting’ by some standard. • Conduct some frequentist significance test for an association between R and C.
Gene Family-Based Hypothesis Testing • Which Null Hypothesis is Being Tested? • None of the genes in family c are differentially expressed (associated, methylated, etc.). • The proportion of genes in family c that are differentially expressed is equal to the proportion of genes in the remainder of the genome that are differentially expressed (beware of ‘anti-Bayesian’ element). • The proportion of genes in family c that are differentially expressed to an extent greater than is equal to the proportion of genes in the remainder of the genome that are differentially expressed. • Note: These can all be subsumed under the general: • H0:
Union-Intersection The compound hypothesis is rejected if any one of the individual hypotheses are rejected Multiplicity adjustment procedure is required to control type I error rate The rejection region for this test is the union of rejection regions corresponding to the individual tests Intersection-Union The compound hypothesis is rejected only if all of the individual hypotheses are rejected Overall type I error rate of α is maintained without multiplicity adjustment The rejection region for this test is the intersection of the rejection regions corresponding to the individual tests Union-Intersection vs Intersection-Union Tests When P << N, methods are well established (e.g., multiple regression. When P >> N optimal methods are not yet clear. Methods not yet well established. Bayesian methods involving posterior probabilities in place of p-values may be especially useful.
What assumptions are being made? • Normality? • Exchangeability? • Independence? • Other? • Non-Parametric: Non-Panacea (Cohen, J.) • Asymptotic Exact
Major Issues to Ask About in Selecting a Method for Gene Family or Pathway Testing • What is the null? • Does the method assume that all components (e.g., SNPs or gene expression levels) are independent? • Is the method ‘anti-Bayesian’? • Does the method use the continuity of information (not simply significant or not)?