Lecture 10: Linkage Analysis III

Lecture 10: Linkage Analysis III Date: 9/26/02 Revisit segregation ratio distortion. Haplotype coding Three point analysis Multipoint analysis

Additive Segregation Ratio Distortion • Systematic genotype classification error occurs. • Power and estimates of recombination fraction are unaffected by additive distortion in the backcross configuration. • Estimates of recombination fraction are not affected for F2, but the false positive rate increases.

Additive Segregation - Backcross • Suppose the frequency of genotype Aa is increased because a fraction u of aa genotypes are misclassified. • Similarly, assume the frequency of genotype Bb is independently increased by fraction v. • We need to recalculate the expected frequencies under the new model with additional parameters u and v.

Additive Segregation – Backcross (contd)

Additive Segregation – Backcross (contd) • The number of unknown parameters equals the number of degrees of freedom. • Use Bailey’s method to find the MLEs of the parameters (q,u, v).

Bailey’s Method • Set the expected frequencies equal to the observed proportions and solve the system of equations for the unknown parameters. These are the MLEs. • Example: Suppose you observe 5 successes from a Binomial(10, p) distribution. Then pmle = 5/10

Additive Segregation – Backcross (contd) • What do you notice about the MLE for recombinant fraction? • Is the MLE for recombinant fraction biased?

Additive Segregation – F2-CC

Penetrance Distortion - Backcross • Selection, penetrance, linkage to selected markers all can result in penetrance distortion, thus it is quite common. • Suppose (100xu)% of the genotype aa is misclassified as Aa. Similarly, assume that bb has (100xv)% misclassified as Bb independently.

Penetrance Distortion - Backcross

Penetrance Distortion - Backcross • Is the estimate for recombination fraction biased? • The power to detect linkage is decreased.

Cost of Assuming Non-Distortion Model • The estimate for recombination fraction is biased. By how much?

Overall Impact of Segregation Distortion

First Project • This slide marks the end of the material that will be needed to complete the first project.

Linkage Analysis for Multiple Loci • The haplotype is the sequence of alleles along one of the chromosomes in an individual. • In multipoint linkage analysis we are not concerned with the alleles at each locus, rather its parental origin.

Recoding Haplotypes • Suppose there are k loci. Recode each haplotype as a string of k-1 of 0’s and 1’s • If the ith position is 0, it indicates the (i+1)th locus is noit recombinant with respect to the ith locus. • If the ith position is 1, it indicates the (i+1)th locus is recombinant with respect o the ith locus.

Recoding Haplotypes (contd)

Recoded Haplotypes and Recombination Fractions

Sample Problem • Calculate the probabilities of the four haplotype classes (i.e. g00, g10, g01, g11) when qAB = 0.1 and qBC = 0.2 and qAC is unknown. Assume the Sturt map function with L = 1.

Plan of Attack • Transform recombination fractions to genetic map units using the inverse map function. • Sum the genetic map units to obtain length of AC interval. • Calculate the recombination fraction between AC using the map function. • Solve the set of simultaneous equations for the haplotype frequencies.

Step 1

Step 2

Step 3

Step 4

Phase Known Three Point Analysis • When all gametes in sample are fully informative, then the likelihood is simple. How would you test for interference?

Multipoint Analysis – A Difficulty • Suppose there are k loci. • How many haplotypes are possible? • How many recombination fractions are there?

Recombination Value • Definition: The recombination value of a set of intervals is the probability of an odd number of crossovers occurring in the intervals. • How many sets of intervals are there?

Sample Problem – Four Point Analysis • Suppose loci A, B, C, and D are in syntenic order and qAB = 0.1, qBC = 0.2, and qCD = 0.3. • What are the probabilities of the haplotype classes given the Kosambi map function.

The Linear Equations

Multipoint Likelihood • Can be written in terms of the 2k-1-1 recombination values or haplotype frequencies. • Can be reparameterized as k-1 recombination fractions and 2k-1-k interference parameters. • Then tests for interference are possible. • An alternative is to assume a map function with possibly unknown parameters which constrains the gamete probabilities as functions of the k-1 recombination fractions.

Multilocus-Infeasible Map Functions • Kosambi, Carter-Falconer, and Felsenstein map functions are multilocus-infeasible because they can produce negative gametic frequencies. • The Morgan, Haldane, Sturt and generalized map functions are multilocus-feasible. • Haldane is most often used for its simplicity except when linkage is tight, e.g. m << 0.5.

Map Building • How many possible orders are there for k loci? • 10 loci can be ordered in over 1 million ways. • The solution is to generate a small number of probably orders and then analyze these few in depth.

Stepwise Approximate Ordering • Use likelihood analysis to order a few markers, say l. • Add each additional marker one at a time by considering all l-1 positions for it. Choose the location that results in the highest likelihood. • Number of likelihood evaluations: 3+4+5...+k = (k-2)(k+3)/2.

Pairwise Approximate Ordering • Two point linkage analysis on all pairs of loci to obtain a recombination fraction estimate. • Multidimensional scaling analyses (multivariate exploratory analysis) to find approximate orders.

Final Step – Perfecting Order • Test the likelihood of various reorderings of neigboring groups of loci. • If an tested order has higher likelihood, keep it. • etc...

Disease Mapping • Condition on an ordering of all markers except disease locus. • Calculate a multilocus likelihood for each possible position of the disease locus, call this lx. • Calculate the location score 2(lx - l¥) at point x, where l¥is the log-likelihood with disease locus unlinked to other markers.

Disease Mapping • Can also calculate multipoint LOD scores by dividing locations scores by 2ln(10). • Plot location score or multipoint LOD score by position x. The peak is the likely position of the disease locus and if the peak exceeds some cut-off criteria linkage to that region is significant.

Multipoint vs. Single Point Disease Mapping • Information from every sampled individual, even those who may be homozygous at the single marker. • Single marker can only provide information about crossovers on one side of the disease gene. • The more markers, the sharper the peak. • The disease gene is ultimately mapped to the smallest interval where there is no observed crossover between marker and disease gene in entire sample.

Sample Size • Assuming no interference, crossovers are distributed exponentially with mean 1 per Morgan. • Sample n individuals and the mean rate is n. • Therefore, the expected distance to the nearest crossover on either side of the disease locus is 1/n. • The interval containing disease gene has length distributed as gamma distribution with mean 2/n. • Example: You want to localize disease gene to 1 cM = 1/100 M. Therefore, you need n>200.

Summary • Modeling of segregation distortion and the impact on linkage analysis. • Haplotying coding. • The use of map functions. • Overview of likelihood formulation for multipoint analysis.

Lecture 10: Linkage Analysis III