Differentially expressed genes

Differentially expressed genes 09/19/07

Identify differentially expressed genes

Fold Change • Based on the expression index, select genes with high fold change (e.g., R/G > 3) • Advantage: • Intuitive • Larger fold change may indicate greater biological impact. • Drawback • Reliable estimates are difficult to get.

Fold change is noisy Log-transformed expression in replicate #2 Noise is very high at low intensity. Log-transformed expression in replicate #1

SAM • Significance Analysis of Microarrays (SAM) considers a signal-to-noise ratio. where • d(i) can large if either the signal is large or the noise is low. Therefore, it is different from fold change. • Genes are ranked by d(i). The top candidates genes correspond to most positive or negative d(i). (Tusher et al. 2001)

Permutation test 2 1 “I” 5 2 I 3 3 4 4 6 5 U “U” 1 6 If a gene expresses at the same level in I and U conditions, then then relabeling the arrays will not affect the result of the value of d.

SAM • To test for statistical significance, arrays are randomly permuted. • For each permutation, compute and rank the result dp(i). • Calculate • Idea is that for truly differentially expressed genes, d(i) should be greater than dE(i). • Select those d(i) that are different from dE(i) more than a threshold level D.

Null hypothesis H0: there is no association between the expression levels and the sample groups. • Alternative hypothesis H1: there is association. • Differentially expressed genes Rejection of null hypothesis. • Genes are selected regardless of fold change. Statistical hypothesis testing

Single Hypothesis Testing • Calculate the value a test statistic. • IF the value is very unlikely given the null hypothesis H0, THEN • H0 is rejected and H1 is accepted. • The gene is differentially expressed. • ELSE • H0 is not rejected. • The gene is not differentially expressed.

Rejection Region Density t-value

Two type of errors Density t-value

p-value The p-value is the probability of obtaining a result at least as extreme as a given data point. It is also the minimum significance level required to reject H0.

Choice of test statistic Standard t-test • Assume that yij are Gaussian distributed, then ti is given by the student-t distribution. • A p-value is calculated from t-distribution with the 2n-2 degree of freedom. Issues: • When n is small the denominator is an unreliable estimate of the variance. • The assumption that yij are Gaussian is often violated in real data.

Variance shrinkage • Basic idea: The variance at different genes should be correlated. If the data are noisy, then they are likely to be noisy everywhere. Thus one can use the information from other genes to estimate the variance at a given gene.

Variance shrinkage (Smyth 2004) Assume and where d0 and s02 correspond to the pooled data. Then Modify the t-statistic by replacing si2 with The new statistic obeys t-distribution with d0 + di degrees of freedom.

Permutation test 2 1 “normal” 5 2 normal 3 3 4 4 6 5 cancer “cancer” 1 6 If H0 is correct, then relabeling the arrays will not affect the result of the test statistic.

Permutation p-value • Permutation-test • For the b-th permutation, b = 1, …, B, • Permute the n columns (array labels) of the data matrix X. • Compute test statistics t1,b, …, tm,b for each hypothesis (whether the m-th gene is not differentially expressed). • The permutation distribution of the test statistic Ti for hypothesis Hi, ti,1, …, ti,B. For two-sided alternative hypotheses, the permutation p-value for hypothesis Hi is where I(.) is the indicator function, equaling 1 if the condition in parenthesis is true, and 0 otherwise.

Permutation p-value permutationdistribution t-distributionscaled H0 is rejected H0 is correct

Multiple hypothesis testing • Microarray experiments measure expression levels of thousand of genes. • The hypothesis testing procedure is applied once for each gene. • A large number of false positives may result. Cutoff at p = 0.05 for 6000 genes 6000 X 0.05 = 300 genes falsely rejected If number of real target ~ 100, then most rejected genes are false targets.

Bonferroni correction • Let m be the total number of tests. Reject hypothesis at a/m instead of a. • Strong control of FWER. • Too conservative.

Adjusted p-value • The adjusted p-value for a single hypothesis Hj is the nominal level of the entire test procedure at which Hj would just be rejected, given the values of all test statistics involved. • Example: pi = 0.001. If rejecting all hypotheses with cutoff p < pi leads to FDR = 0.2, then the adjusted p-value is 0.2. • The adjusted p-value is dependent on the specific test procedure.

Adjusted p-value The adjusted p-value for Bonferroni correction is.

False Discovery Rate • FWER aims at requiring no false positive at all. This is often too stringent in practice. • False discovery rate (FDR) is proposed by Benjamini and Hochberg (1995). The idea is to allow a few false positives while enhancing the power.

Control of FDR, BH-procedure • Find ordered observed p-values, and • Let k be the largest i for which • Reject all H1, …, Hk. (Benjamini and Hochberg, 1995)

Control of FDR, BH-procedure • Find ordered observed p-values, and • Let k be the largest i for which • Reject all H1, …, Hk. • Strongly controls FDR • Also weakly controls FWER (Benjamini and Hochberg, 1995)

Positive false discovery rate (pFDR) • Better power than FDR procedure. • Estimate

Estimation of p0(t) Under the null hypothesis, p-value is uniformly distributed.

l Estimation of p0(t) Procedure: Choose 0 < l < 1 Assume pi is uniformly distributed at p > l. Then estimate as

(Streinsland)

SAM • To test for statistical significance, arrays are randomly permuted. • For each permutation, compute and rank the result dp(i). • Calculate • Idea is that for truly differentially expressed genes, d(i) should be greater than dE(i). • Select those d(i) that are different from dE(i) more than a threshold level D.

Estimation of FDR in SAM • R ≈ #(genes called significant) • V ≈ #(genes called significant in permutation tests) • FDR ≈ V/R • Power of SAM is better than fold change criteria.

Data: Apo AI experiment • 8 mice in treatment group (apo AI knockout); 8 mice in control group (normal) • 16 arrays: Cy5 – mRNA from trt or control mice; Cy3 – mRNA from pooled control mice. • 6356 genes. • Want to detect differentially (trt vs control mice) expressed genes.

Cutoff value vs top genes • Each metric can be viewed as a monotonic transformation of another. • The only difference is the cutoff values are different. • All statistical hypothesis testing methods are equivalent in terms of selecting the top k genes, for a fixed k.

Differentially expressed genes

Differentially expressed genes

Presentation Transcript

The Problem of Detecting Differentially Expressed Genes

Discovery of differentially expressed genes by statistical methods

Detecting Differentially Expressed Genes

Expressed Emotion

So where are these genes expressed in the fruit fly brain?

Identifying differentially expressed sets of genes in microarray experiments

Biological question Differentially expressed genes Sample class prediction etc.

Differentially Private Recommendation Systems

Identifying differentially expressed genes from RNA- seq data

Identification and analysis of differentially expressed genes in Saccharomyces cerevisiae .

Expressed Emotions

Overlap in Differentially Expressed Genes during Transition from HSC to CMP

Idenfitied Differentially Expressed Genes in Keratoconus

Conclusion: Successfully used the wheat microarray to detect expressed oat genes.

Differentially Constrained Dynamics

Identifying Differentially Expressed Genes in Time Series Microarrays

Searching for Differentially Expressed Genes

Table 2: Genes expressed only in 0 m M Al and also expressed in the YEPD (yellow array)

A Bioinformatics Meta-analysis of Differentially Expressed Genes in Colorectal Cancer

Hope Expressed

Identifying Differentially Regulated Genes

In previous lectures: -- Identifying differentially expressed genes from replicates