Analysis of gene expression data (Nominal explanatory variables)

Analysis of gene expression data(Nominal explanatory variables) Shyamal D. PeddadaBiostatistics Branch National Inst. Environmental Health Sciences (NIH)Research Triangle Park, NC

Outline of the talk • Two types of explanatory variables (“experimental conditions”) • Some scientific questions of interest • A brief discussion on false discovery rate (FDR) analysis • Some existing statistical methods for analyzing microarray data

Types of explanatory variables

Types of explanatory variables (“experimental conditions”) • Nominal variables: • No intrinsic order among the levels of the explanatory variable(s). • No loss of information if we permuted the labels of the conditions. • E.g. Comparison of gene expression of samples from “normal” tissue with those from “tumor” tissue.

Types of explanatory variables (“experimental conditions”) • Ordinal/interval variables: • Levels of the explanatory variables are ordered. • E.g. • Comparison of gene expression of samples from different stages of severity of lessions such as “normal”, “hyperplasia”, “adenoma” and “carcinoma”. (categorically ordered) • Time-course/dose-response experiments. (numerically ordered)

Focus of this talk: Nominal explanatory variables

Types of microarray data • Independent samples • E.g. comparison of gene expression of independent samples drawn from normal patients versus independent samples from tumor patients. • Dependent samples • E.g. comparison of gene expression of samples drawn from normal tissues and tumor tissues from the same patient.

Possible questions of interest • Identify significant “up/down” regulated genes for a given “condition” relative to another “condition” (adjusted for other covariates). • Identify genes that discriminate between various “conditions” and predict the “class/condition” of a future observation. • Cluster genes according to patterns of expression over “conditions”. • Other questions?

Challenges • Small sample size but a large number of genes. • Multiple testing – Since each microarray has thousands of genes/probes, several thousand hypotheses are being tested. This impacts the overall Type I error rates. • Complex dependence structure between genes and possibly among samples. • Difficult to model and/or account for the underlying dependence structures among genes.

Multiple Testing:Type I Errors - False Discovery Rates …

The Decision Table The only observable values

Strong and weak control of type I error rates • Strong control: control type I error rate under any combination of true • Weak control: control type I error rate only when all null hypotheses are true Since we do not know a priori which hypotheses are true, we will focus on strong control of type I error rate.

Consequences of multiple testing • Suppose we test each hypothesis at 5% level of significance. • Suppose n = 10 independent tests performed. Then the probability of declaring at least 1 of the 10 tests significant is 1 – 0.9510 = 0.401. • If 50,000 independent tests are performed as in Affymetrix microarray data then you should expect 2500 false positives!

Types of errors in the context of multiple testing Per-Family Error “Rate” (PFER): E(V ) Expected number of false rejection of Per-Comparison Error Rate (PCER): E(V )/m Expected proportion of false rejections of among all m hypotheses. Family-Wise Error Rate (FWER): P( V > 0 ) Probability of at least one false rejection of among all m hypotheses

Types of errors in the context of multiple testing • False Discovery Rate (FDR): • Expected proportion of Type I errors among all rejected hypotheses. • Benjamini-Hochberg (BH): Set V/R = 0 if R = 0. • Storey: Only interested in the case R > 0. (Positive FDR)

Some useful inequalities

Conclusion • It is conservative to control FWER rather than FDR! • It is conservative to control pFDR rather than FDR!

Some useful inequalities

Some useful inequalities However, in most applications such as microarrays, one expects In general, there is no proof of the statement

Some popular Type I error controlling procedures • Let denote the ordered p-values for the ‘m’ tests that are being performed. • Let denote the ordered levels of significance used for testing the ‘m’ null hypotheses, respectively.

Some popular controlling procedures • Step-down procedure:

Some popular controlling procedures • Step –up procedure:

Some popular controlling procedures • Single-step procedure A stepwise procedure with critical same critical constant for all ‘m’ hypotheses.

Some typical stepwise procedures: FWER controlling procedures • Bonferroni: A single-step procedure with • Sidak: A single-step procedure with • Holm: A step-down procedure with • Hochberg: A step-up procedure with • minP method: A resampling-based single-step procedure with where be the α quantile of the distribution of the minimum p-value.

Comments on the methods • Bonferroni: Very general but can be too conservative for large number of hypotheses. • Sidak: More powerful than Bonferroni, but applicable when the test statistics are independent or have certain types of positive dependence.

Comments on the methods • Holm: More powerful than Bonferroni and is applicable for any type of dependence structure between test statistics. • Hochberg: More powerful than Holm’s procedure but the test statistics should be either independent or the test statistic have a MTP2 property.

Comments on the methods • Multivariate Total Positivity of Order 2 (MTP2)

Some typical stepwise procedures: FDR controlling procedure • Benjamini-Hochberg: A step-up procedure with

An Illustration • Lobenhofer et al. (2002) data: • Expose breast cancer cells to estrodial for 1 hour or (12, 24 36 hours). • Number of genes on the cDNA 2 spot array - 1900. • Number of samples per time point 8., • Compare 1 hour with (12, 24 and 36 hours) using a two-sided bootstrap t-test.

Some Popular Methods of Analysis

1. Fold-change

1. Fold-change in gene expression • For gene “g” compute the fold change between two conditions (e.g. treatment and control):

1. Fold-change in gene expression • : pre-defined constants. • : gene “g” is “up-regulated”. • : gene “g” is “down-regulated”.

1. Fold-change in gene expression • Strengths: • Simple to implement. • Biologists find it very easy to interpret. • It is widely used. • Drawbacks: • Ignores variability in mean gene expression. • Genes with subtle gene expression values can be overlooked. i.e. potentially high false negative rates • Conversely, high false positive rates are also possible.

2. t-test type procedures

2.1 Permutation t-test For each gene “g” compute the standard two-sample t-statistic: where are the sample means and is the pooled sample standard deviation.

2.1 Permutation t-test Statistical significance of a gene is determined by computing the null distribution of using either permutation or bootstrap procedure.

2.1 Permutation t-test • Strengths: • Simple to implement. • Biologists find it very easy to interpret. • It is widely used. • Drawback: • Potentially, for some genes the pooled sample standard deviation could be very small and hence it may result in inflated Type I errors and inflated false discovery rates.

2.2 SAM procedure(Significance Analysis of Microarrays) (Tusher et al., PNAS 2001) For each gene “g” modify the standard two-sample t-statistic as: The “fudge” factor is obtained such that the coefficient of variation in the above test statistic is minimized.

3. F-test and its variations for more than 2 nominal conditions • Usual F-test and the P-values can be obtained by a suitable permutation procedure. • Regularized F-test: Generalization of Baldi and Long methodology for multiple groups. • It better controls the false discovery rates and the powers comparable to the F-test. • Cui and Churchill (2003) is a good review paper.

4. Linear fixed effects models • Effects: • Array (A) - sample • Dye (D) • Variety (V) – test groups • Genes (G) • Expression (Y)

4. Linear fixed effects models(Kerr, Martin, and Churchill, 2000) • Linear fixed effects model:

4. Linear fixed effects models • All effects are assumed to be fixed effects. • Main drawback – all genes have same variance!

5. Linear mixed effects models(Wolfinger et al. 2001) • Stage 1 (Global normalization model) • Stage 2 (Gene specific model)

5. Linear mixed effects models • Assumptions:

5. Linear mixed effects models(Wolfinger et al. 2001) • Perform inferences on the interaction term

Analysis of gene expression data (Nominal explanatory variables)

Analysis of gene expression data (Nominal explanatory variables)

Presentation Transcript

Producing Data: Experiments

Categorical Data Analysis

Simulating Data for Basic Regression Models

Clustering analysis of microarray gene expression data

Quantitative Analysis

Two-way Analysis of Variance

Assumptions of Regression Analysis

Intro to Bivariate Data

Econometric Analysis

Microarray Gene Expression Data Analysis

Analysis of Gene Expression Data

Analysis of Variance (ANOVA)

Gene expression: Microarray data analysis

Analysis of time-course gene expression data

Lecture 22 – Thurs., Nov. 25

4. Gene Expression Data Analysis

CHAPTER 3

More Analysis of Gene Expression Data

Categorical Data Analysis

Clustering analysis of microarray gene expression data

Bioinformatics : Gene Expression Data Analysis