Bayesian Mixture Models for Gene Expression Analysis

Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with Alex Lewin , Sylvia Richardson, BAIR Consortium Imperial College London, UK

Introduction We use a fully Bayesian approach to model data and MCMC for parameter estimation. • Models all parameters simultaneously. • Prior information can be included in the model. • Variances are automatically adjusted to avoid unstable estimates for small number of observations. • Inference is based on the posterior distribution of all parameters. • Use the mean of the posterior distribution as an estimate for all parameters.

Differential expression Condition 1 Condition 2 Distribution of expression index for gene g , condition 1 Distribution of expression index for gene g , condition 2 Distribution of differential expression parameter

Bayesian Model 2 conditions: Number of replicates in each condition yg1r ~ N( g - ½ dg , σg1), r = 1, … R1 yg2r ~ N( g + ½ dg , σg2 ), r = 1, … R2 Mean Difference (log fold change) Prior model: σ2gk ~ IG(ak, bk), k=1,2 E(σ2gk|s2gk) = [(Rk-1) s2gk + 2bk]/(Rk-1+2ak) Non-informative priors ong , ak ,bk. Prior distribution on dg? (Assume data is background corrected, log-transformed and normalised)

Modelling differential expression Prior information / assumption: Genes are either differentially expressed or not(of interest or not) Can include this in the model via modelling the difference as a mixture How to choose H? dg ~ (1-p) δ0(dg) + p H(dg | θg) Advantages: H1 H0 • Automatically selects threshold as opposed to specifying constants as in the non-informative prior model for differences • Interpretable: can use Bayesian classification to select differentially expressed genes: • P{g in H1| data} > P{g in H0| data}. • Can estimate false discovery and non-discovery rates (Newton et al 2004).

Considered mixture models We choose several distributions as the non-zero part in the mixture distribution for dg: double gamma, Student t distribution, the conjugate model (Lonnstedt and Speed (2002)) and the uniform distribution in a fully Bayesian context. LS model:H is normal with variance proportional to variance of the data: dg ~ (1-p)δ0 + p N (0, c σg2) Gamma model:H is double gamma distribution: σg2 = σg12/R1+ σg22/R2 T model: H is Student t distribution: dg ~ (1-p)δ0 + p T (ν, μ, τ) Uniform model:H is uniform distribution: dg ~ (1-p)δ0 + p U(-m1, m2) Priors on hyperparameters are either non-informative or weakly informative G(1,1) for parameters with support on positive semiline. (-m1, m2) - slightly widened range of observed differences

Simulated data We compare performance of the four models on simulated data. For simplicity we considered a one group model (or a paired two group model). We simulate a data set with 1000 variables and 8 replicates: Plot of the simulated data set Variance Hyperparameters of variance a=1.5, b=0.05 are chosen close to Bayesian estimates of those in a real data set Difference

Differences Mixture estimates vs true values • Gamma, T and LS models estimate differences well • Uniform model shrinks values to zero • Compared to empirical Bayes, posterior estimates in the fully Bayesian approach do not shrink large values of the differences Posterior mean Posterior mean Posterior mean Posterior mean

Bayesian estimates of variance Blue: variance estimate based on Bayesian model with non-informative prior on differences. • T and Gamma models have very similar variance estimates • Uniform model produces similar estimates for small values and higher estimates for larger values compared with T and Gamma models • LS model has more pertubation at both higher and lower values compared to T and Gamma models Gamma model Uniform model E(σ2|y) E(σ2|y) sample variance sample variance T model LS model E(σ2|y) Mixture estimate of the variance can be larger than the sample variance E(σ2|y) sample variance sample variance

Classification Diff. Expressed genes (200) • T, LS and Gamma models perform similarly • Uniform model has a smaller number of false positives but also a smaller number of true positives Non D. Expressed genes (800) Uniform prior is more conservative

Wrongly classified by mixture: truly dif. expressed, truly not dif. expressed Classification errors are on the borderline: Confusion between size of fold change and biological variability

Another simulation 2628 data points Many points added on borderline: classification errors in red Can we improve estimation of within condition biological variability ?

p zg a1, b1 1 ,2 yg1.- yg2. δg 2g1 s2g1 2g2 s2g2 ½(yg1.+ yg2.) g a2, b2 DAG for the mixture model The variance estimates are influenced by the mixture parameters Use onlypartial information from the replicates to estimate 2gs and feed forward in the mixture? g = 1:G

Estimation • Estimation of all parameters combines information from biological replicates and between condition contrasts • s2gs = 1/RsΣr (ygsr - ygs. )2 , s = 1,2 Within condition biological variability • 1/RsΣr ygsr = ygs. , Average expression over replicates • ½(yg1.+ yg2.) Average expression over conditions • ½(yg1.- yg2.) Between conditions contrast

In 46 data points with improved classification when ‘feed back from mixture is cut’ In 11 data points with changed but new incorrect classification Mixture, full vs partial Classification altered for 57 points: Work in progress

Difference: cut and no cut Different classification: Truly diff.expressed Truly not diff.expressed Variance Posterior probability Sample st.dev. vs diff. Cut Full Full

Microarray data Variance Posterior probability Sample st.dev. vs diff. Cut Pooled sample st.dev. Sample difference Full Full Genes classified differently by the full model and the model with feedback cut follow a curve.

Since variance is overestimated in full mixture model compared to mixture model with cut, the number of False Negatives is lower for model with cut than for the full model.

LS model: empirical vs fully Bayesian Compare the Lonnstedt and Speed (LS) model in fully Bayesian model (FB) and empirical Bayes (EB) model. Estimated parameters Classification • If parameter p is specified correctly, empirical and fully Bayesian models do not differ • If parameter p is misspecified, estimate of the parameter c changes which leads to misclassification

Small p (p=0.01) Cut No Cut

Bayesian Estimate of FDR • Step 1: Choose a gene specific parameter (e.g. δg ) or a gene statistic • Step 2:Model its prior distribution using a mixture model -- with one component to model the unaffected genes (null hypothesis) e.g. point mass at 0 for δg -- other components to model (flexibly) the alternative • Step 3:Calculate the posterior probability for any gene to belong to the unmodified component : pg0 | data • Step 4: Evaluate FDR (and FNR) for any list assuming that all the gene classification are independent (Broët et al 2004) : Bayes FDR (list) | data = 1/card(list) Σg  list pg0

Multiple Testing Problem • Gene lists can be built by computing separately a criteria for each gene and ranking • Thousands of genes are considered simultaneously • How to assess the performance of such lists ? Statistical Challenge Select interesting genes without including too many false positives in a gene list A gene is afalse positiveif it is included in the list when it is truly unmodified under the experimental set up Want an evaluation of theexpected false discovery rate (FDR)

Bayes rule FDR (black) FNR (blue) as a function of 1- pg0 Observed and estimated FDR/FNR correspond well Post Prob (g H1) = 1- pg0

Summary • Mixture models estimate differences and hyperparameters wellon simulated data. • Variance is overestimated for some genes. • Mixture model with uniform alternative distribution is more conservative in classifying genes than structured models. • Lonnstedt and Speed model: performs better in fully Bayesian framework because parameter p is estimated from data. • Estimates of false discovery and non-discovery rates are close to the true values

Bayesian Mixture Models for Gene Expression Analysis