The Problem of Detecting Differentially Expressed Genes

The Problem of Detecting Differentially Expressed Genes

Sample 1 Sample 2 Sample M Gene 1 Gene 2 . . . . . . . . Gene N

Sample 1 Sample 2 Sample M Gene 1 Gene 2 . . . . . . . . Gene N Class 2 Class 1

Fold Change is the Simplest Method Calculate the log ratio between the two classes and consider all genes that differ by more than an arbitrary cutoff value to be differentially expressed. A two-fold difference is often chosen. Fold change is not a statistical test.

Test of a Single Hypothesis (1) For gene consider the null hypothesis of no association between its expression level and its class membership. (2) Decide on level of significance (commonly 5%). (3) Perform a test (e.g Student’s t-test) for each gene. (4) Obtain P-value corresponding to that test statistic. (5) Compare P-value with the significance level. Then either reject or retain the null hypothesis.

Two-Sample t-Statistic Student’s t-statistic:

Two-Sample t-Statistic Pooled form of the Student’s t-statistic, assumed common variance in the two classes:

Two-Sample t-Statistic Modified t-statistic of Tusher et al. (2001):

Types of Errors in Hypothesis Testing PREDICTED TRUE

MultiplicityProblem When many hypotheses are tested, the probability of a false positive increases sharply with the number of hypotheses. Further: Genes are co-regulated, subsequently there is correlation between the test statistics.

Example Suppose we measure the expression of 10,000 genes in a microarray experiment. If all 10,000 genes were not differentially expressed, then we would expect for: P=0.05for each test,500 false positives. P= 0.05/10,000for each test,.05false positives.

Controlling the Error Rate • Methods for controlling false positives e.g. Bonferroni are too strict for microarray analyses • Use the False Discovery Rate instead (FDR) • (Benjamini and Hochberg 1995)

Methods for dealing with the Multiplicity Problem • The Bonferroni Method • controls the family wise error rate (FWER) i.e. • the probability that at least one false positive error will be made • Too strict for gene expression data, tries to make it unlikely that even one false rejection of the null is made, may lead to missed findings • The False Discovery Rate (FDR) • emphasizes the proportion of false positives among the identified differentially expressed genes. • Good for gene expression data – says something about the chosen genes

False Discovery Rate Benjamini and Hochberg (1995) The FDR is essentially the expectation of the proportion of false positives among the identified differentially expressed genes.

Possible Outcomes for N Hypothesis Tests

where

Positive FDR

Lindsay, Kettenring, and Siegmund (2004). A Report on the Future of Statistics. Statist. Sci. 19.

Controlling FDR Benjamini and Hochberg (1995) Key papers on controlling the FDR • Genovese and Wasserman (2002) • Storey (2002, 2003) • Storey and Tibshirani (2003a, 2003b) • Storey, Taylor and Siegmund (2004) • Black (2004) • Cox and Wong (2004)

Benjamini-Hochberg (BH) Procedure Controls the FDR at level a when the P-values following the null distribution are independentand uniformly distributed. (1) Let be the observed P-values. (2) Calculate . (3) If exists then reject null hypotheses corresponding to . Otherwise, reject nothing.

Example: Bonferroni and BH Tests Suppose that 10 independent hypothesis tests are carried out leading to the following ordered P-values: 0.00017 0.00448 0.00671 0.00907 0.01220 0.33626 0.39341 0.53882 0.58125 0.98617 (a) With a = 0.05, the Bonferroni test rejects any hypothesis whose P-value is less than a / 10 = 0.005. Thus only the first two hypotheses are rejected. (b) For the BH test, we find the largest k such thatP(k) < ka / N. Here k = 5, thus we reject the first five hypotheses.

q-VALUE q-value of a gene j is expected proportion of false positives when calling that gene significant. P-value is the probability under the null hypothesis of obtaining a value of the test statistic as or more extreme than its observed value. The q-value for an observed test statistic can be viewed as the expected proportion of false positives among all genes with their test statistics as or more extreme than the observed value.

LIST OF SIGNIFICANT GENES Call all genes significant if pj < 0.05 or Call all genes significant if qj < 0.05 to produce a set of significant genes so that a proportion of them (<0.05) is expected to be false (at least for a large no. of genes not necessarily independent)

BRCA1 versus BRCA2-mutation positive tumours (Hedenfalk et al., 2001) BRCA1 (7) versus BRCA2-mutation (8) positive tumours, p=3226 genes P=.001 gave 51 genes differentially expressed P=0.0001 gave 9-11 genes

Using q<0.05, gives 160 genes are taken to be significant. It means that approx. 8 of these 160 genes are expected to be false positives. Also, it is estimated that 33% of the genes are differentially expressed.

Null Distribution of the Test Statistic Permutation Method The null distribution has a resolution on the order of the number of permutations. If we perform B permutations, then the P-value will be estimated with a resolution of 1/B. If we assume that each gene has the same null distribution and combine the permutations, then the resolution will be 1/(NB) for the pooled null distribution.

Using just the B permutations of the class labels for the gene-specific statistic Wj , the P-value for Wj = wj is assessed as: where w(b)0j is the null version of wj after the bth permutation of the class labels.

If we pool over all N genes, then:

Null Distribution of the Test Statistic: Example Class 1 Class 2 Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1) B6(1) Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2) B6(2) Suppose we have two classes of tissue samples, with three samples from each class. Consider the expressions of two genes, Gene 1 and Gene 2.

Class 1 Class 2 Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1) B6(1) Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2) B6(2) To find the null distribution of the test statistic for Gene 1, we proceed under the assumption that there is no difference between the classes (for Gene 1) so that: Gene 1 A1(1) A2(1) A3(1) A4(1) A5(1) A6(1) And permute the class labels: Perm. 1 A1(1) A2(1) A4(1) A3(1) A5(1) A6(1) ... There are 10 distinct permutations.

Ten Permutations of Gene 1 A1(1) A2(1) A3(1) A4(1) A5(1) A6(1) A1(1) A2(1) A4(1) A3(1) A5(1) A6(1) A1(1) A2(1) A5(1) A3(1) A4(1) A6(1) A1(1) A2(1) A6(1) A3(1) A4(1) A5(1) A1(1) A3(1) A4(1) A2(1) A5(1) A6(1) A1(1) A3(1) A5(1) A2(1) A4(1) A6(1) A1(1) A3(1) A6(1) A2(1) A4(1) A5(1) A1(1) A4(1) A5(1) A2(1) A3(1) A6(1) A1(1) A4(1) A6(1) A2(1) A3(1) A5(1) A1(1) A5(1) A6(1) A2(1) A3(1) A4(1)

As there are only 10 distinct permutations here, the null distribution based on these permutations is too granular. Hence consideration is given to permuting the labels of each of the other genes and estimating the null distribution of a gene based on the pooled permutations so obtained. But there is a problem with this method in that the null values of the test statistic for each gene does not necessarily have the theoretical null distribution that we are trying to estimate.

Suppose we were to use Gene 2 also to estimate the null distribution of Gene 1. Suppose that Gene 2 is differentially expressed, then the null values of the test statistic for Gene 2 will have a mixed distribution.

Class 1 Class 2 Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1) B6(1) Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2) B6(2) Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2) B6(2) Permute the class labels: Perm. 1 A1(2) A2(2) B4(2)A3(2) B5(2) B6(2) ...

Example of a null case: with 7 N(0,1) points and 8 N(0,1) points; histogram of the pooled two-sample t-statistic under 1000 permutations of the class labels with t13 density superimposed. ty

Example of a null case: with 7 N(0,1) points and 8 N(10,9) points; histogram of the pooled two-sample t-statistic under 1000 permutations of the class labels with t13 density superimposed. ty

The SAM Method Use the permutation method to calculate the null distribution of the modified t-statistic (Tusher et al., 2001). The order statistics t(1), ... , t(N) are plotted against their null expectations above. A good test in situations where there are more genes being over-expressed than under-expressed, or vice-versa.

The FDR and other error rates PREDICTED TRUE R FNDR ~ FDR ~

The FDR and other error rates PREDICTED TRUE R FNR = FDR ~

Toy Example with 10,000 Genes FDR = B / (B + D) = 475 / 875 = 54% FNDR = C / (A + C) = 100 / 9125 = 1%

Two-component mixture model is the proportion of genes that are not differentially expressed, and

Use of the P-Value as a Summary Statistic (Allison et al., 2002) Instead of using the pooled form of the t-statistic, we can work with the value pj, which is the P-value associated with tj in the test of the null hypothesis of no difference in expression between the two classes. The distribution of the P-value is modelled by the h-component mixture model , where a11 = a12 = 1.

Use of the P-Value as a Summary Statistic Under the null hypothesis of no difference in expression for the jth gene, pjwill have a uniform distribution on the unit interval; ie the b1,1 distribution. The ba1,a2 density is given by where

Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment.JASA 96,1151-1160. • Efron B (2004) Large-scale simultaenous hypothesis testing: the choice of a null hypothesis.JASA 99, 96-104. • Efron B (2004) Selection and Estimation for Large-Scale Simultaneous Inference. • Efron B (2005) Local False Discovery Rates. • Efron B (2006) Correlation and Large-Scale Simultaneous Significance Testing.

McLachlan GJ, Bean RW, Ben-Tovim Jones L, Zhu JX. Using mixture models to detect differentially expressed genes.Australian Journal of Experimental Agriculture45, 859-866. • McLachlan GJ, Bean RW, Ben-Tovim Jones L. A simple implentation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 26. To appear.

Two component mixture model π0 is the proportion of genes that are not differentially expressed. The two-component mixture model is: Using Bayes’ Theorem, we calculate the posterior probability that gene j is not differentially expressed:

The Problem of Detecting Differentially Expressed Genes

The Problem of Detecting Differentially Expressed Genes

Presentation Transcript

Discovery of differentially expressed genes by statistical methods

Detecting Differentially Expressed Genes

Expressed Emotion

So where are these genes expressed in the fruit fly brain?

Identifying differentially expressed sets of genes in microarray experiments

Biological question Differentially expressed genes Sample class prediction etc.

Identifying differentially expressed genes from RNA- seq data

Identification and analysis of differentially expressed genes in Saccharomyces cerevisiae .

A Microarray-Based Screening Procedure for Detecting Differentially Represented Yeast Mutants

Overlap in Differentially Expressed Genes during Transition from HSC to CMP

Idenfitied Differentially Expressed Genes in Keratoconus

Silence of the Genes

Conclusion: Successfully used the wheat microarray to detect expressed oat genes.

Many similarly expressed genes are coregulated by the same transcription factor(s) …

Research problem: Detecting clusters of activity in Catalonia

Differentially expressed genes

Identifying Differentially Expressed Genes in Time Series Microarrays

Searching for Differentially Expressed Genes

Table 2: Genes expressed only in 0 m M Al and also expressed in the YEPD (yellow array)

A Bioinformatics Meta-analysis of Differentially Expressed Genes in Colorectal Cancer

Identifying Differentially Regulated Genes

In previous lectures: -- Identifying differentially expressed genes from replicates