620 likes | 734 Vues
School of Computer Science. Nonparametric Bayesian Methods for Genetic Inference. Eric P. Xing MLD, LTI and CSD School of Computer Science Carnegie Mellon University. Genome Polymorphisms. ancestors. Time. ancestors. Time. Present. ancestors. Time. ancestors. Time. Present.
E N D
School of Computer Science Nonparametric Bayesian Methods for Genetic Inference Eric P. Xing MLD, LTI and CSD School of Computer Science Carnegie Mellon University
ancestors Time
ancestors Time
Present ancestors Time
ancestors Time Present
TCGAGGTATTAAC The ancestral chromosome
* ** * * The SNPs TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC Single Nucleotide Polymorphisms • Each DNA site is call a "locus" • Each variant is called an “allele”
The haplotypes TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC TC█AGG██T█A█C TC█AGG██T█A█C TC█AGG██T█A█C TC█AGG██T█A█C TC█AGG██T█A█C TC█AGG██T█A█C • useful markers for studying disease association or genome evolution: • -- landmarks, indicators, co-variates, causes …
Genetic Inference • Determine genetic markers • Haplotype inference • Reveal genome inheritance events • Recombination hotspot identification • Deconvolve population structure • Ancestral spectrum analysis
Outline • Haplotype Inference • Dirichlet Process for phasing single population • Hierarchical DP for phasing multiple population • Linkage-disequilibrium analysis • Hidden Markov DP for identifying recombination hotspots • Population structure analysis • Admixture model • HMDP models
Genetic Inference • Determine genetic markers • Haplotype inference • Reveal genome inheritance events • Recombination hotspot identification • Deconvolve population structure • Ancestral spectrum analysis
T Cp G A C sequencing Cm T ATGC A Heterozygous diploid individual TC TG AA T Genotype g pairs of alleles, whose associations to chromosomes are unknown G A C T A ??? T T A C G A haplotype hº(h1, h2) possible associations of alleles to chromosomes Haplotype Ambiguity
0 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 1 This solution seems ‘better’ since it uses fewer haplotypes 0 0 1 0 0 1 1 1 1 1 Haplotype Inference The Rationale: parsimony • Many haplotypes are shared in a population • Data for many individuals allows phasing SNP genetypes 1/0 0/1 1/1 0/1 0/1 1/1 0/1 0/1 0/1 1/1
A Finite (Mixture of ) Allele Model • The probability of a genotype g: • Standard settings: • |H| = K << 2Jfixed-sized population haplotype pool • p(h1,h2)= p(h1)p(h2)=f1f2Hardy-Weinberg equilibrium • Problem: K ? H ? Hn1 Hn2 Gn Genotyping model Population haplotype pool Haplotype model
The PAC Model • The joint probability of all haplotypes h1, h2, … hn: • Problem: • Ordering? • Ancestor? H1,1 H1,2 H2,1 H2,2 Hn,1 Hn,2 Gn Gn Gn
¥ An Infinite (Mixture of ) Allele Model • How? • Via a nonparametric hierarchical Bayesian formalism ! Ak qk Hn1 Hn2 Gn N
A CDF, G, on possible worlds of random partitions follows a Dirichlet Process if for any measurable finite partition(f1,f2, .., fm): (G(f1), G(f2), …, G(fm) ) ~ Dirichlet(aG0(f1), …., aG0(fm) ) where G0is the base measure and ais the scale parameter Dirichlet Process Possible worlds of partitions Thus a Dirichlet Process G defines a distribution of distribution
Joint: Marginal: DP – a Pólya urn Process • "Infinite" • Self-reinforcing property • exchangeable partition of samples
Clustering and DP Mixture • We can associate ancestors (i.e., mixture components) with the colors in the Pólya urn and thereby define an infinite clustering of the haplotypes (i.e., balls) … 1 3 2 4 5 6 … {a1,q1} {a2,q2} {a3,q3}
a G0 G ¥ DP infinite mixture components (for population haplotypes) Dirichlet Process Mixture of Haplotypes (Xinget al. ICML 2004) • A Hierarchical Bayesian Infinite Allele model Ak qk Hn1 Hn2 Likelihood model (for individual haplotypes and genotypes) Gn
Infinite mixtures "Star" genealogies ¥ N Population Genetic Basis of IAM • Kingman coalescent process with fixed (large) population size • New population haplotype alleles emerge along all branches of the coalescence tree at rate a/2 per unit length Ewens Sampling Formula: an exchangeable random partition of individuals Coalescent with mutation Þ Dirichlet Process Mixture
Single-locus mutation model Noisy observation model Inheritance and Observation Models … Ancestral pool Haplotypes Genotype
Posterior Prior x Likelihood Pólya urn MCMC for Haplotype Inference • Gibbs sampling for exploring the posterior distribution under the proposed model • Integrate out the parameters such as or , and sample and • Gibbs sampling algorithm: draw samples of each random variable to be sampled given values of all the remaining variables
Results - HapMap Data • DP vs. Finite Mixture via EM DP EM
Multi-population Genetic Demography • Inference done separately, or jointly?
Multi-population Genetic Demography • Pool everything together and solve 1 hap problem? • --- ignore population structures • Solve 4 hap problems separately? • --- data fragmentation • Co-clustering … solve 4 coupled hap problems jointly
g H .... Hierarchical DP Mixture (Xinget al. ICML 2006) 2 3 G2 G3 G1 4 G4 1
Draw from stock urn define Dirichlet Process DP(g,H) Conditioning on DP(g,H), the mjth draw from the mth bottom-level urn also form a Dirichlet measure A Hierarchical Pólya Urn Sampler Two level Pólya urn scheme
Results - Simulated Data • 5 populations with 20 individuals each (two kinds of mutation rates) • 5 populations share parts of their ancestral haplotypes • the sequence length = 10 Haplotype error
Results on Simulated Data Estimation of K
Results - International HapMap DB • Different sample sizes, and different # of sub-populations
Genetic Inference • Determine genetic markers • Haplotype inference • Reveal genome inheritance events • Recombination hotspot identification • Deconvolve population structure • Ancestral spectrum analysis
Inhomogeneous transition model 10 12 8 11 11 8 14 5 10 1 Underlying haplotypes Modeling Haplotype Structure • Open issues: • Where are the boundaries? How many haplotypes per block? • Genetically unreasonable to assume different # of haplotypes for different blocks
x x x Individual chromosomes x x x x x x x x Inheritance Model Each individual haplotype is a mosaic of ancestral haplotypes Ancestral chromosomes (K=5)
Transition process:recombination Ct+1 1 2 … K 1 • Emission process: mutation 2 Ct : . K The Hidden Markov Model How many recombining ancestors?
a G0 G ¥ DP infinite mixture components (for population haplotypes) Recall DP Mixture (Xinget al. ICML 2004, 2006) • A Hierarchical Bayesian Infinite Allele model Ak qk Hn1 Hn2 Likelihood model (for individual haplotypes and genotypes) Gn
Hidden Markov Dirichlet Process (Xingand Sohn. Bayesian Analysis, 2007, Sohn and Xing, ISMB 2007) • Hidden Markov Dirichlet process mixtures • Extension of HMM model to infinite ancestral space • Infinite dimensional transition matrix • Each row of the transition matrix is modeled with a DP: Ct+1 Ct …
Ancestor allele reconstruction Inferring population structure Inferring recombination hotspot A ¥ C1 C2 C3 CN ¥ H HMDP as a Graphical Model
MCMC Inference • Gibbs sampling • Block sampler: • Pólya urn sampler: posterior transition probability
Recombination Analysis Recombination hotspot detection threshold for hotspot detection
Recombination Analysis CEU YRI HCB+JPT HapMap4
Genetic Inference • Determine genetic markers • Haplotype inference • Reveal genome inheritance events • Recombination hotspot identification • Deconvolve population structure • Ancestral spectrum analysis
Ancestral proportion Genetic Population Structure • How to display population structure? • Structure Genetic structure of Human Populations (Rosenberg et al. 2002)
0.4 0.2 0.3 0.3 0.7 0.2 0.3 0.1 0.5 The Admixture Model • Admixture of "ancestral frequency profiles (AP)" • No distinction between ancestral and current alleles • Does not model mutation and chromosomal recombination Ancestral populations represented as allele frequency profiles Structure 2.1 0.02 0.08 0.90
From Structure to mStruct (Shringarpure and Xing, ICML 2008) • From admixture of APs to admixture of MIMs • MiM: population-specific Mixture of Inheritance Models • The inheritance model: • Microsatellite: SNPs:
Variational Inference • The joint: • We can sample z, c, and q as in Structure --- slow • Alternatively, we approximatep(z, c,q | x) by q(z, c,q ) = q(z)q(c)q(q) • Minimizing KL(q|p): • Fixed-point iteration …