310 likes | 445 Vues
Reconstructing Genealogies: a Bayesian approach. Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas. 4.1.2006. Department of Mathematics and Statistics. We all are related … but to different degrees …. Consider a population evolving in time Inverse problem:
E N D
Reconstructing Genealogies: a Bayesian approach Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas 4.1.2006 Department of Mathematics and Statistics
We all are related … but to different degrees … • Consider a population evolving in time • Inverse problem: • Suppose the current state of the process is known • individuals alive at the moment • What was the path leading to this state? • family structures (pedigree) • inheritance patterns
Pedigrees • Specify relationship categories • Parent-offspring, full siblings / half siblings, first cousins etc. • In graphs • Circles for females, squares for males • Black nodes represent nuclear families • Time runs downwards
Gene flow • Alleles (i.e. different variants of the same gene) flow through the pedigree • Gene flow gives us a means to quantify the degree of relatedness between individuals • How much of their genome do two individuals share? • At what loci do they have identical alleles? chromosome allele DNA
Gene flow • Two alleles may be identical • by-state (IBS) • They have the same DNA-sequence • by-descent (IBD) • They descend from the same ancestral allele within a given reference frame Here the children share allele 1 IBS, but not IBD (w.r.t their parents’ generation).
Meiosis • When gametes are formed the paternal and the maternal chromosomes (haplotypes) may cross-over and recombine
Haldane’s model of recombination • Recombination fractionθbetween two loci on the same chromosome is the proportion of meioses in which a recombination event (i.e., an odd number of cross-overs) takes place between the loci • Haldane’s model assumes that crossovers occur independently along each chromosome • a Poisson process model follows chromosome 17% 9% 9.5%
The frame for study • From now on we assume that we have fixed • A population whose size we know for T-1 (non-overlapping) generations backwards in time • N sampled individuals from the current generation • Marker map with M markers and known recombination fractions • Allele frequencies at the population level for each of the markers
A (prior) model for a possible history • A configuration C consists of • a pedigree • allelic paths • Specify probabilities for • Pedigree graph, Pg(C) • Recombination events, Pr(C) • Founder alleles, Pa(C) • The total probability for C is P(C) = Pg(C) x Pr(C) x Pa(C)
A probability model for pedigrees • For fixed • number of generations,T-1, backwards in time • population size in each generation (number of ♂ and ♀) • sample of size N from the current generation • mating parameters α and β • To simulate a pedigree from the distribution we se • Proceed generation by generation from 0,…,T-1. • Let children choose parents according to Pólya urn scheme, where αaffects the correlation of choices of fathers and βaffects the correlation of choices of mothers given the choices of fathers. • Gasbarra D, Sillanpää M, Arjas E (2005) Backward Simulation of Ancestors of Sampled Individuals. Theor Pop Biol 67:75-83.
Examples with different parameters • Left: a few dominant males + monogamy • Middle: a few dominant males • Right: Random mating
Probability for allelic paths • For each non-founder haplotype in the pedigree form the expression • Take the product of these over all haplotypes to obtain Pr(C) • Consider all founder alleles and take the product of the corresponding population allelle frequencies to get Pa(C)
Data • Assume that we also have • Genotype data of the sampled individuals on M markers • The (posterior) probability in our model is π(C) ~ Pg(C) x Pr(C) x Pa(C) x 1(C cons. with the data) • We are able to sample efficiently from the prior but not from the posterior
Markov chain Monte Carlo sampling • We generate a Markov chain whose state space consists of all configurations consistent with the data and whose stationary distribution is our posterior (Metropolis-Hastings algorithm) • If this chain is irreducible then the expected values of functions defined on the space of configurations can be approximated with sample averages • haplotype configurations • IBD-sharing between individuals
Metropolis-Hastings algorithm • M-H algorithm produces a chain of configurations, where at each step of the chain a new value is proposed (from some proposal distribution) and this value is either accepted ( ) or rejected ( ) (according to some rules depending on ). • Good proposals that will be accepted quite often are needed so that the chain moves around within a reasonable amount of time.
Proposals • Highly dependent variables (close relatives and linked markers) require large block updates • Different versions of proposals • A (randomly chosen) group of children chooses (possibly new) parents transmitting their alleles to these parents • All children of a particular father/mother choose a (possibly new) mother/father transmitting their alleles to her/him • One child at a time chooses new parent(s) transmitting alleles to them • All children within the group jointly choose new parents and transmit alleles • Pedigree is not changed but new allele paths are proposed
An example • Simulated pedigree • 10 generations • Youngest generation • 39 individuals divided into • 13 nuclear families • Population • 200 founders • growing exp. by 1.2
Example continues… • Simulated gene flow on the pedigree • 20 markers • 10 equally frequent alleles at each locus in the founder generation • Haldane’s model of recombination (no interference) • Spacing between adjacent markers 5.3 cM (i.e. recombination fraction 0.05)
Reconstruction • We gave the algorithm • The genotype data on the youngest generation • The (correct) marker map • The (correct) allele frequencies • The population structure • The algorithm was run for 500,000 iterations
Reconstructing the haplotypes • Each individual (in diploid species) carries two copies of each chromosome • One is inherited from the father (mother) and is called a paternal (maternal) haplotype • Genotyping does not (usually) determine which multilocus allelic combination is inherited from the same parent • from lab {1,2}x{4,3} • true haplotypes may be either (13,24) or (14,23) • There exist two kinds of haplotyping methods • Pedigree based (SimWalk2, Merlin, Genehunter) • Population based (PHASE, HAPLOFREQ)
Reconstructing the haplotypes • The accuracy of the haplotype reconstruction can be measured with the concept of switch distance (SD) • SD between two pairs of haplotypes is the number of phase relations between neighboring loci that need to be changed in order to turn the first pair of haplotypes to the other • If correct haplotypes were (111111,222222) then • (111222,222111) has SD=1 • (112211,221122) has SD=2 • (121212,212121) has SD=5
Reconstructing the haplotypes • The SDs between the reconstructed and the true haplotype pairs of the youngest generation (sum over all 39 individuals)
Reconstructing the IBD sharing • We consider those alleles IBD (identical by descent) that trace back to a common ancestral allele at the founder level (9 generations backwards in time) • It is possible to calculate a single quantity that measures the proportion of the genome that two individuals share (coefficient of relatedness r) • It is also possible to compare the IBD sharing more accurately along the chromosome
Reconstructing IBD • The reconstructed relatedness coefficients of each of the 741 pairs of the individuals belonging to the youngest generation were compared with the true values (sum of squared errors shown)
Comparison with IBS-based estimators Distribution of L_2 errors (741values) Sums: 1.93 3.25 3.27 3.51
Another example of pedigree reconstruction • Population with 200 individuals, 50 markers / 9 alleles
Future work • Possibility of fixing some parts of the pedigree • Extending partially known genotype data to the known pedigree in accordance with the Mendelian rules of inheritance is in general an NP-complete problem a/b c/d b/e b/c d/a d/f a/c e/f a/b f/c a/f b/e d/d f/c a/f e/c a/b e/c c/b b/e d/d c/b
Future work with the reconstruction algorithm • Adding a QTL (quantitative trait locus) model to the algorithm • Does phenotype correlate with IBD-sharing at some chromosomic region(s)? • Running many chains in parallel ”in different temperatures”
Thanks Dario Gasbarra, Mikko Sillanpää and Matti Pirinen