330 likes | 455 Vues
The distribution of the IBD sharing and applications. Shai Carmi Itsik Pe’er’s lab Department of Computer Science Columbia University. Tel Aviv University July 23, 2012. Outline. IBD: introduction Coalescent theory of IBD Distribution of pairwise sharing. The variance.
E N D
The distribution of the IBD sharing and applications Shai Carmi ItsikPe’er’s lab Department of Computer Science Columbia University Tel Aviv University July 23, 2012
Outline • IBD: introduction • Coalescent theory of IBD • Distribution of pairwise sharing. • The variance. • The variance of the cohort-averaged sharing. • Applications • Imputation by IBD • Sequencing study design. • Siblings. • Demographic inference. • Jewish genetics. • Summary
Identity-by-descent (IBD) L. Macmillan, UNC • In isolated, small populations all individuals have recent common ancestor. • Abundance of long haplotypes which are IBD. B A A B A shared segment
IBD detection • Until last decade, IBD usually defined for single markers. • Genome-wide SNP arrays enable detection of long segments. • GERMLINE (Gusevet al., Genome Res., 2009):A fast algorithm for detection of IBD segment in large cohorts. • Divide the chromosomes into small windows. • For each window, hash the genotypes of each individual and search for perfect matches. • Extend seeds, as long as match is good enough. • Record matches longer than a cutoff m. • Other methods exist. A B
Questions • How much IBD is expected in model populations? • Consider the fraction of genome shared between all possible pairs. • Mean? • Variance? • Distribution? • Applications • Demographic inference • Study design • Positive selection detection • Phasing and imputation • Pedigree reconstruction
Sequencing study design • A large genotyped cohort. • A subset is selected for sequencing. • Look for IBD segments between sequenced and not-sequenced individuals. Select A • Impute variants along IBD segments. • To maximize utility, select individuals with most sharing (Gusev at al., Genetics, 2012 (INFOSTIP)).
Sequencing study design • A large genotyped cohort. • A subset is selected for sequencing. • Look for IBD segments between sequenced and not-sequenced individuals. Select A • Is the strategy useful? Is it worth prioritzing? • How is the average sharing of each individual to the rest of the cohort distributed?
Wright-Fisher model • Non-overlapping, discrete generations. • A population of constant size of N haploid individuals. • Ignore mutations (when studying IBD). • Recombination is a Poisson process. • Each pair of individuals (linages) has probability 1/N to coalesce in the previous generation. • In the limit of continuous-time and large population size, approximated by the coalescent. • (Scaled) Time to MRCA is exponential with rate 1. N=10
Mosaic of segments ℓT=ℓ1+ℓ5+ℓ9 A m B ℓ11 ℓ5 ℓ1 ℓ9 ℓ7 ℓ3 ℓ10 ℓ8 ℓ6 ℓ2 ℓ4 0 L coordinate • Consider two (unrelated) chromosomes. • The total sharing fT: The fraction of the chromosome in shared segments of length ≥m. • Observation:All sites are in shared segments, but length can be small due to ancient common ancestor.
Mosaic of segments ℓT=ℓ1+ℓ5+ℓ9 A m B ℓ11 ℓ5 ℓ1 ℓ9 ℓ7 ℓ3 ℓ10 ℓ8 ℓ6 ℓ2 ℓ4 0 L coordinate • Assume the (scaled) coalescence time at a site is t. • A segment of length ℓ is shared if there is no recombination event in the history of the two linages. • Number of meioses: 2Nt. t A B A B
Mosaic of segments ℓT=ℓ1+ℓ5+ℓ9 A m B ℓ11 ℓ5 ℓ1 ℓ9 ℓ7 ℓ3 ℓ10 ℓ8 ℓ6 ℓ2 ℓ4 0 L coordinate • Li and Durbin (Nature, 2011) found that at the end of a segment, • Therefore,
Renewal theory tS =τ1+τ5+τ9 A m B τ1 τ11 τ9 τ5 τ7 τ3 τ10 τ6 τ8 τ4 τ2 T 0 time • Distribution of waiting times:
Renewal theory: solution • Laplace transform T→s, tS→u
Mean IBD sharing • Can be derived in many ways. • (1) • (2) • The average number of segments ≥mis 2NL·P(ℓ≥m). • (3) Palamara, …, Pe’er, AJHG, 2012. • At the end of the talk (time-permitting).
Varying population size • Use results of Li and Durbin (Nature, 2011). and then proceed as before. • The mean IBD sharing:
The variance of the IBD sharing • Can also be calculated in a number of ways. • (1) • (2) Define I(s), the indicator, with probability π (=<fT>) , that site s is in a shared segment between two given chromosomes. • Define the number of sites as M. • The variance requires calculating two-sites probabilities. • Almost-exact solution at the end of the talk (time-permitting).
The variance: simplified • (3) Idea: • Two distant sites will always be on a shared segment if there was no recombination event in their history. • If there was, treat sites as independent. • Neglect some small terms. • The probability of no recombination: • The variance: d≥m For the human genome,
The cohort-averaged sharing • The distribution is close to normal. • But with variance that approaches a constant even for large sample size n. Why? • Scales as 1/n for small n. • Approaches a constant for large samples. • For the human genome,
The tail of the cohort-averaged sharing- `hyper sharing’ • Even for large cohorts, the distribution of the cohort-averaged sharing retains a constant width. • Some individuals will be in the tails of this distribution! ‘hyper sharing’. • Can be taken advantage of in sequencing studies.
Imputation by IBD • Our results can be used to calculate the expected imputation power when sequencing a subset of a cohort. • Assume a cohort of size n, ns of which are sequenced. • Random selection of individuals: • Selection of highest-sharing individuals: • where
Increase in association power • The imputed genomes can be thought of as increasing the effective number of sequences. • A simple model (Shen et al., Bioinformatics, 2011): • Variant appears in cases only. • Carrier frequency in cases equal β. • Dominant effect. • Association detected if P-valuebelow a threshold. • For a fixed budget, trade-off in the number of cases/controls to sequence.
Siblings • Siblings share, on average, 50% of their genomes. • What is the variance? • A classic problem. • (Visscher et al. PLoS Genet. 2006). • Used the variance to estimate heritability from siblings studies. • Genome-wide SD 5.5%. • But what if parents are inbred? • Assume shared segments are either from parents or are more remote.
Estimator of population size • Given one genome, estimate the population size N. • Calculate the total sharing fT. We know that • Invert to suggest an estimator: • Not very useful: estimator is biased • and has SD • Compared to for Watterson’s estimator (based on the number of het sites).
Ashkenazi Jews • In recent years, shown to be a genetically distinct group. • Close to Middle-Easterns and Europeans (particularly Italians and Adygei). • (Atzmon et al., Am. J. Hum. Genet., 2010) • Very large amounts of IBD (Gusev et al., Mol. Biol. Evol., 2011), likely due to a recent, severe bottleneck.
IBD in Ashkenazi Jews • 2,600 Ashkenazi Jews, 1M SNP array (Guha et al., Genome Biol. 2012). • Use Germline to detect IBD segments. • Compare the total sharing to simulations of inferred demography based on mean IBD in different length ranges (Palamara et al., AJHG, 2012). Excess of `hyper sharing’ in AJ
Admixture in AJ • Most plausible explanation: correct for admixture. When considering only individuals with close to median AJ ancestry, most of the unexplained variance disappears. The AJ component was calculated in comparison to CEU.
Summary • We calculated the distribution of the total IBD sharing in the Wright-Fisher model using renewal theory. • We obtained explicit expressions for the variance of both the pairwise sharing and the cohort-averaged sharing. • We calculated the expected gain in imputation and association power if individuals at the tail of the cohort-averaged sharing distribution are selected for sequencing. • The variance/distribution of IBD has many applications, some of which we presented, some are left for future work. • In the AJ population, individuals differ in cohort-averaged sharing by up to 30%. Admixture explains some of the variance.
The end • Thanks to: • Pier Francesco Palamara. • Vladimir Vacic • ItsikPe’er • Todd Lencz, Ariel Darvasi(for AJ genotypes) • Human Frontiers Science program Cross-Disciplinary Fellowship.
Identity-by-descent Identity-by-descent (IBD) founder chromosomes contemporary chromosomes
Mean IBD (Palamara et al.) • See (Palamara et al., AJHG, 2012). • Assume shared segments must have length at least m. • Define I(s): the indicator, with probability π, that site s is in a shared segment between two given chromosomes. • Define fT: the mean fraction of the chromosome found in shared segments, or the total sharing. • Given g, the number of generations to the MRCA: • In the coalescent, g→Nt: • Then, <fT>=π.
The variance of the total sharing (1) • The variance requires calculating two-sites probabilities. • Idea: • For one site, PDF of the coalescence time is Φ(t)~Exp(1). • For two sites, calculate the joint PDF Φ(t1,t2). • Φ(t1,t2) takes into account the interaction between the sites. • Given t1, t2, calculate π2 as if sites are independent.
The variance of the total sharing (2) • Express π2in terms of the Laplace transform of Φ(t1,t2). • π2 • Use the coalescent with recombination to findwhere A-E are defined in terms of q1, q2, and the scaled recombination rate ρ.
IBD in AJ Are `hyper-sharing’ individuals sharing more with everyone else, or just with other `hyper-sharing’ individuals? Each curve represents average of 1/7 of the individuals in order of their cohort-averaged sharing. Highest sharing Lowest sharing Highest sharing Lowest sharing