Model checks for complex hierarchical models

Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Background and Aims • Many complex models used in bioinformatics • Classification/clustering can be greatly affected by choice of distributions • Our approach: exploit the structure of the model to perform predictive checks • hierarchical models generally involve exchangeability assumptions • mixture models are partially exchangeable

Outline of Talk • Mixture model for gene expression data • Model checks for mixture model • distribution for gene-specific variances • different mixture priors • Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)

ηj wj μ,τ g σg ybarg Sg variance for each gene differential effect for gene g Data: paired log differences between 2 conditions Hierarchical mixture model for gene expression data w ~ Dirichlet(1,…,1), various priors for δg, g δg | η~ Σwjhj(ηj), g2 | μ,τ f(μ,τ) ygr | δg, g  N(δg, g2) g = gene r = replicate j = mixture component

Mixture model for gene expression data • Many mixture models have been proposed for gene expression data • Set-up is similar to variable selection prior: point mass + alternative distribution • Particular choices for alternative: • Normal (Lönnstedt and Speed) • Uniform (Parmigiani et al) • many others …

Mixture model for gene expression data Allow for asymmetry in over-and under-expressed genes  3-component mixture model δg | η~ w1h1(η1) + w2h2(η2) + w3h3(η3) 6 knock-out and 5 wildtype mice MAS5.0 processed data

Mixture model for gene expression data Classify each gene into mixture components using posterior probabilities

Choice of mixture prior affects classification results

Outline of Talk • Mixture model for gene expression data • Models checks for mixture model • distribution for gene-specific variances • different mixture priors • Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)

Predictive model checks • Predict new data from the model • Use posterior predictive distribution • Condition on hyperparameters (‘mixed predictive’ *  not very conservative) • Get Bayesian p-value for each gene/marker/sample • Use all p-values together (100’s or 1000’s) to assess model fit * Gelman, Meng and Stern 1995; Marshall and Spiegelhalter 2003

μ,τ Sgobs posterior Smpred g σg σpred post. pred. Sgppred mixed pred. Smpred ybarg Sgobs Checking distribution for gene variances Bayesian p-value for gene g: pg = Prob( Smpred > Sgobs | data ) All genes are exchangeable  histogram of p-values for all genes together

‘Mixed’ v. ‘posterior’ predictive • Predictive p-values for data simulated from the model • Histograms should be Uniform • Mixed predictive distribution much less conservative than posterior predictive Using gene-specific distributions Using global distribution

Checking different variance models g2 | μ,τ Gam(μ,τ), μ fixed g2 = 2 for all genes Model differential expression between 3 transgenic and 3 wildtype mice g2 | μ,τ Gam(μ,τ) g2 | μ,τ logNorm(μ,τ)

μ,τ g σg σpred mixed pred. Smpred ybarg Sgobs Implementation (MCMC) niter = no. MCMC iterations m = (no. replicates – 1)/2 pg= 0 for t = 1,…,niter { σtpred  f(μt,τt) Stmpred  Gam( m, m(σtpred)-2 ) pg pg + I[ Stmpred > Sgobs ] } pg pg / niter Just two extra parameters predicted at each iteration

Checking mixture prior δg | η~ w1h1(η1) + w2h2(η2) + w3h3(η3) OR δg | η, zg = j~ hj(ηj) j = 1,…,3 P(zg= j) = wj Model checking: focus on separate mixture components

Issues for mixture model checking δg | η, zg = j~ hj(ηj) j = 1,…,3 Think about MCMC iterations … • Mixture component is estimated from genes currently assigned to that component • Can only define p-value for given gene and mix. component when the gene is assigned to that component (i.e. condition on zg in p-value) • So check each component using only the genes currently assigned (i.e. condition on zg in histogram)

μ,τ ηj wj σg g jpred ybargjmpred Sg ybarg Predictive checks for mixture model Bayesian p-value for gene g and mix. component j: pgj = Prob( ybargjmpred > ybargobs | data, zg=j ) • Genes assigned to the same mix. component are exchangeable • histogram of p-values for each mix. component separately • histogram for component j made only from genes with large P(zg= j)

Condition on classification to check separate components Predictive p-values for data simulated from the model All genes with P(zg = j) > 0 Only genes with P(zg = j) > 0.5 Effectively we condition on a best classification

Checking different mixture distributions • Outer mix. components skewed too much away from zero • Null component too narrow w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+)

Checking different mixture distributions • Outer components skewed opposite • Null still too narrow? w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+)

Checking different mixture distributions • Better fit for all components w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+)

wj μ,τ ηj σg g jpred ybargjmpred ybarg Sg Implementation pgj= 0 for t = 1,…,niter { • δjtpred~ hjt(ηjt) j = 1,…,3 • ybargtmpred N( δjtpred , g2/nrep ) for j = zgt • pgj pgj + I[ ybargtmpred > ybargobs ] for j = zgt } pgj pgj / niter(zg=j) Need ≈ngenes extra parameters at each iteration

Summary of model checking procedure • Find part of model where individuals are assumed to be exchangeable (so information is shared) • Choose test statistic T (eg. sample mean or variance) • Predict Tpred from distribution for exchangeable individuals (whole posterior for Tpred) • Compare observed Ti for each individual i to distribution of Tpred • For checking mixture components, condition on the best classification

Clustering and variable selection (Tadesse et al. 2005) • yivector of gene expression for each sample i = 1,…,n • Multi-variate mixture model for clustering samples: • yi| zi= j  MVN(ζj, Λj) j = 1,…,J • P(zi= j) = wj • No. of mix. components (J) is estimated in the model • Aim to select genes which are informative for clustering the samples

Clustering and variable selection (Tadesse et al. 2005) Likelihood conditional on allocation to mixture: γ’ = vector of indices of variables not used to cluster samples γ = vector of indices of selected variables Conjugate priors on multivariate means and covariance matrices P(γg= 1) = φ i = sample g = gene j = mix. component

J μj(γ) , Σj(γ) wj η(γ), Ω(γ) φ yi y(γ)jpred Clustering and variable selection (Tadesse et al. 2005) Model checking: want to check the distribution for each mixture component separately (conditional on J) In addition, need to condition on a given variable selection Clearly impossible computationally i = sample g = gene j = mix. component

Computing predictive p-values • Run model with no prediction • Find the best configuration: • set of selected variables (γ) • no. mixture components J • allocation of samples to mixture components zi • Re-run model, with (γ), J and zi fixed, calculated predictive p-values pij = Prob( Tjpred > Tiobs | data, zi=j, J, (γ) ) where T = |y|2 (for example)

Conclusions • Choice of model distributions can greatly influence results of clustering and classification • For models where information is shared across individuals, predictive checks can be used as an alternative to cross-validation • Should be possible to do this even for quite complex models (if you can fit the model, you can check it)

Acknowledgements Collaborators on BBSRC Exploiting Genomics Grant Natalia Bochkina,Clare Marshall Peter Green Meeting on model checking in Cambridge David Spiegelhalter Shaun Seaman BBSRC Exploiting Genomics Grant Paper and software at http://www.bgx.org.uk/

Model checks for complex hierarchical models