Bayesian Inference for Genome Structure: Analysis of Base Composition Variation

Bayesian inference of genome structure and application to base composition variationNick Smith and Paul Fearnhead, University of Lancaster

Plan of talk • Examples of genome structure • Patterns of base composition in vertebrate genomes • Recursive segmentation • Bayesian approach to segmentation inference • Application to simulated data • Application to real data • Conclusions and future directions

Examples of genome structure • Variation in base composition (GC%) in the human genome

Variation in substitution rate, polymorphism, repeat element density, recombination rate across human chromosome 22

General research goals • How can we analyse structure in the genome? Could partition genome into uniform windows, but better to partition genome according to structure in the data. • Base composition: Partition genome according to GC content, testing differences in GC structure among chromosomes and species

Base composition in the vertebrate genome • 1970’s: Physical analysis of large regions of DNA (~100 kb) revealed striking heterogeneity in GC content, especially in warm-blooded vertebrates; idea of isochores (>300 kb) • Full genome sequences: Sliding window analyses show significant variation in GC at a variety of scales. Long-range autocorrelations up to many Mb.

Recursive segmentation for “isochore” identification • Current standard method is recursive segmentation, e.g.Isofinder which generates isochore maps shown in UCSC genome browser. • Isofinder: Apply “coarse graining” to filter out short-scale variation (exons, repetitive elements, CpG islands), choose potential cutting point as maximum t-statistic between right and left means, test significance of t-test by Monte Carlo methods, make cut if p value below threshold. Repeat process for resultant sub-sequences. 10’s Mb 1-10kb

Statistical methods for segmentation • Problems with recursive segmentation: (1) depends on significance threshold, (2) no idea of uncertainty in segmentation, (3) possible loss of information. • Various methods for DNA segmentation have been considered in the statistics literature. • Have adapted a Bayesian model for simultaneously inferring multiple changepoints which divide the data into segments. Model uses a combination of perfect simulation with recursions and MCMC.

Bayesian model • Apply coarse graining to get data (y1:n), number of GC bases in non-overlapping windows • 2-level model • (1) Model data within segments as N(i,2), where i is segment mean and 2 gives variance within segments. • (2) Model segment means as N(, 2 2), where  is grand mean and  2 is ratio of variance among segment means to variance within segments. • Can calculate P(t,s) = Pr(yt:s|t,s in same segment). • Model distribution of segment lengths as negative binomial, g(r,k), r is inverse of mean segment length, k is shape parameter. • Define recursions with Q(t) = Pr(yt:n|changepoint at t-1) using P, Q and g functions. Fundamental assumption: independence of segments. Proceed from Q(n) to Q(1) = Pr(y1:n). • Posterior distributions of changepoints are given by formulae with P, Q and g functions, other information easily follows.

Tuning priors and MCMC algorithm • Problem is we don’t know correct values for underlying priors. • Priors tuned by maximising product of Q(1) and hyperpriors. • MCMC algorithm designed to overcome dependence on priors and possible mis-specifications of model. • Start with priors at tuned values (reference priors). • Store N simulations of changepoints based on reference priors. • Three updates at each new step of MCMC algorithm (N steps) (1) Metropolis-Hastings update changepoints dependent on current values of priors, if (2) Update , r, and i’s given changepoints and  and  (Gibbs) (3) Update  and  given i’s (Gibbs)

Simulated data 1: Idealised simulations • Idealised simulations: strong segmentation signal and correct model. • Simulated discretised data (5000 data points, 5 kb windows) using r = 0.01, k = 2,  = 140,  = 2,  = 2000. • For 100 simulations, compared true values of parameters from simulations with estimated parameter values (mean of posterior distribution): highly accurate inference of r (mean estimate 0.0104, mean error 0.0005), (140.03, 0.23) and  (2.02, 0.043). • Inference of k = 2 was supported by Q(1) values: average 2DlogQ(1) was 3 to k = 1 and 10 to k = 3. • Compared Bayesian and Isofinder results for a single example. • True number of changepoints 58. Bayesian: mean r implied 59.9, best set of changepoints had 53. Isofinder: number of changepoints varied with threshold probability from 96 with p = 0.05 to 64 with p = 0.01 to 52 with p = 0.001.

Idealised simulation example

Simulated data 2: Correlations between segments • In real DNA data there appear to correlations between isochores so important to test sensitivity of method given fundamental assumption of recursions. • Simulated data in which there are correlations between segments, positive 1-lag autocorrelations with segment GC and length (r = 0.30), negative correlation between GC and length (r = -0.30), other simulation details as before. • Inference of parameters remains good: r (mean estimate 0.0096, mean true 0.0098), (140.16, 140.04) and  (1.93, 1.94). • Correlations between segments are underestimated: GC (0.203, 0.248), length (0.114, 0.270), GC-length (-0.297,-0.378).

Simulated data 3: multimodal distribution • The model segment means as a single normal. • Simulated data according to model of four isochore classes, reduced  to 70 and increased  to 3.5. Parameters chosen so that distribution of segment means is multimodal. • Inference according to Bayesian model recreated the multimodal structure of segment means. But acceptance probability very low.

Non-parametric test of changepoint inference • Want some n-p way of evaluating the method, the most fundamental issue is whether changepoints are inferred correctly. • Compare two sets of changepoints to get mean minimum distance between both sets of changepoints (I to nearest T and vice versa). • For each of set of inferred changepoints, compare to true changepoints: get inferred-true distance. For randomised sets of changepoints get random-true distance. • Ratio of inferred-true/random-true gives performance: idealised ratio = 0.238, correlations ratio = 0.256, multimodal ratio = 0.263. Performance robust to model mis-specification. • Comparison of methods for idealised simulation example: Isofinder with p = 0.005 ratio = 0.279, Bayesian best set of changepoints ratio = 0.116.

Real data: human chromosome 1 • 36.8 Mb, 5 kb windows, 7358 data points, mean 2068, standard deviation 277. • Q(1) analysis supports k = 1, geometric distribution. • Means and 95% CIs from posterior distributions: r 0.062 (0.055-0.069),  1.91 (1.77-2.07),  138 (135-140). • Mean segment length 80 kb, lots of short segments. • Strong neighbouring segment correlations: GC 0.21, length 0.11, GC-length –0.29

Real data: human chromosome 1

Human chromosome 1: levels of structure Higher level structure: • Can enforce higher k and lower r, k = 5 and r = 0.02 gives mean segment size of 185 kb. • Significant partial autocorrelations between segment means up to 9-lag. • Applying Bayesian model to inferred segment mean output gives three super-changepoints at 10 Mb (2008 in 5kb data), 15 Mb (3018 in 5kb data) and 17 Mb (3419 in 5kb data). Effect of “coarse graining” window size: • Segment size positively correlated with window size (3 kb - 60 kb; 5 kb - 80 kb; 10 kb - 115 kb). • Same process for Isofinder, with p = 0.95, 3 kb windows give 85 kb segments and 5 kb windows gives 115 kb segments. • Three super-changepoints at 10, 15 and 17 Mb robust to window size.

Real data: mouse chromosome 1 • 23.89 Mb, 5kb windows, 4777 data points, mean 2050, standard deviation 180. • Means and 95% CIs from posterior distributions: r 0.042 (0.034-0.047),  1.71 (1.57-2.00),  103 (101-105). • Less variation in base composition in mouse genome than in human genome: due to longer segments (r 0.042 vs. 0.0062, 120 kb vs. 80 kb), less variation within segments ( 103 vs. 138) and less variation among segments (  176 vs. 264).

Preliminary conclusions on GC analyses • Methods which partition genome according to data are preferable to sliding windows methods. • Bayesian method does not require the threshold parameter of recursive segmentation methods, allows improved statistical analyses, seems to perform better than recursive segmentation. • Bayesian method appears to be robust to model mis-specification.

Future work - correlated genome structure • One exciting possibility of the Bayesian method is extendsion to multiple genomic features, e.g. the issue of GC-substitution rate (K) covariation. • Accounting for uncertainty in inference of K. • Do GC and K changepoints tend to be close together, and do the changes between segments covary? • Higher level: What about super-changepoints in GC and K? • Lower level: Do GC and covary K within segments?

Bayesian Inference for Genome Structure: Analysis of Base Composition Variation

Bayesian Inference for Genome Structure: Analysis of Base Composition Variation

Presentation Transcript

Structure of talk

Plan of the talk

Outline of Talk

Plan of the talk

Plan of talk

Title of Talk

Plan of the talk

Organization of Talk

talk, talk ,talk

Overview of Talk

Plan of the talk

Plan of the Talk

Plan of Talk

Plan of the talk

Outline of Talk

Plan of the Talk

Plan of talk

Plan of the talk

Types of talk

Outline of talk

Structure of talk

Plan of the talk