Yufeng Wu UC Davis RECOMB 2007

Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007

Cases Controls Diploid: two sequences per individuals Association Mapping of Diseases 0 1 SNPs Problem: Where are (unobserved) disease mutations? This talk: Genealogy-based approach

Disease mutation Genealogy: Evolutionary History of Genomic Sequences • Tells how individuals in a population are related • Helps to explain diseases: disease mutations occur on branches and all descendents carry the mutations • Problem: How to determine the genealogy for “unrelated” individuals? • Not easy with recombination Diseased (case) Healthy (control) Individuals in current population

Suffix Prefix 11000 0000001111 Breakpoint Recombination • One of the principle genetic forces shaping sequence variations within species • Two equal length sequences generate a third new equal length sequence in genealogy 110001111111001 000110000001111

00 1 0 0 1 10 1 1 Ancestral Recombination Graph (ARG) Mutations Recombination 10 01 00 10 11 01 00 S1 = 00 S2 = 01 S3 = 10 S4 = 11 Assumption: At most one mutation per site S1 = 00 S2 = 01 S3 = 10 S4 = 10

Mapping Disease Gene with Inferred Genealogy • “..the best information that we could possibly get about association is to know the full coalescent genealogy…” – Zollner and Pritchard, 2005 • But we do not know the true ARG! • Goal: infer ARGs from sequences for association mapping • Not easy and often approximation is used (e.g. Zollner and Pritchard)

The ARG Approaches • First practical ARG association mapping method (Minichiello and Durbin, 2006) • Use plausible ARGs: heuristic • My work: Generate ARGs with a provable property, and works on a well-defined complex disease model • minARGs: Most parsimonious ARGs that use the minimum number of recombinations. • Uniform sampling of minARGs: generate one minARG from the space of all minARGs with equal probability. (Sampling is a scheme often used in genealogy-based approaches)

N1=124 N2=32 Recursion N = 124*1 + 32*2 = 188 00000 01000 01100 01101 11100 00010 00011 00000 01000 01100 11100 00010 11011 00011 It turns out no other row choices contribute to the minARG space. 11011 01101 Counting minARGs by Dynamic Programming(This paper) 00000 01000 01100 01101 11100 00010 11011 00011 Assume only input sequences are generated. 1 2

N1=124 N2=32 00000 01000 01100 01101 11100 00010 00011 00000 01000 01100 11100 00010 11011 00011 11011 01101 2. Pick 11011 as last row to derive 3. Move to reduced matrix 188 minARGs 00000 01000 01100 01101 11100 00010 11011 00011 Idea: Use counting of minARGs in selecting the order of sequences to generate. 1 2 Can be easily extend to weighted sampling, e.g. generate less frequent sequences later. 1. Random value Rnd = 0.3 < 0.66 Select 11011 with prob = 124/188 = 0.66, and 01101 with prob = 32*2/188 = 0.34

Possible disease mutation ARGs Represent a Set of Marginal Trees • Clear separation of cases/controls: NOT expected for complex diseases! Case Control

1 2 Multiple disease mutations! Cases Controls Diploid: two sequences per individuals Incomplete penetrance Realities of Mappping Complex Diseases Trying to find one tree branch which clearly separate cases and controls may not work for complex diseases! Solution: Inference on a well-defined disease model. SNPs

Probability of disease mutations occur at the branch (computed from mutation rate and branch length) Complex Disease Model: How A Disease Affects Population (Zollner & Pritchard, 2005) A formal model of the complex disease is needed to assess the significance of a chosen marginal tree for real data. 0.02 0.1 0.05 Disease mutations: Poisson Process Two alleles: wild-type and mutant 0.08 0.03 0.01 0.06 0.07

Disease Penetrance (Zollner & Pritchard) cAse PA,1: probability of a mutant sequence becomes a case PC,1 = 1.0 - PA,1 PA,0: probability of a wild-type sequence becomes a case PC,0 = 1.0 - PA,0 Control 0.02 0.1 0.05 0.08 0.03 0.01 0.06 0.07 PA,1 = 0.8, PC,1 = 0.2 PA,0 = 0.1, PC,0 = 0.9

Phenotype Likelihood: How Likely are Phenotypes Generated on a Marginal Tree? (Zollner and Pritchard) • The disease model specifies a probabilistic way of assigning phenotypes for a given tree. • But we have many trees and at which tree disease mutations occurs? • Given a tree T and case/control phenotypes  of its leaves, what is the probability of observing  on T? • High phenotype likelihood: disease mutations may occur in T • Computable in linear time and adopted in this work

This Paper: Expected Phenotype Likelihood • We need to assess statistical significance of computed phenotype likelihood. • Null model: randomly permute case/control status of leaves in the given tree. • P-value by permutation tests: computational bottleneck! • My result: O(n3) algorithm computing expected value (and variance) of phenotype likelihood. • Exact, fully deterministic method. • But, computing P-value precisely and efficiently remains open.

Case Control This Paper: Diploid PenetranceIs Hard Diploid (e.g. humans): two sequences per individual Diploid penetrance: PA,00: prob. Individual with two wild-type sequences becomes a case PA,01 : prob. Individual with one wild-type and one mutant becomes a case PA,11: … Efficient computation of phenotype likelihood: stated but unresolved in Zollner and Pritchard My result: computing phenotype likelihood with diploid penetrance is NP-hard

Simulation Results • Average mapping error for 50 simulated datasets from Zollner and Pritchard • Average over 50 genealogies • Date: January, 2007 Comparison: TMARG, LATAG (Z. P.),MARGARITA (M. D.). TMARG (my program) and MARGRITA are much faster (20 times or more) than LATAG. Important for whole genome scan.

Acknowledgement • Software available at: http://wwwcsif.cs.ucdavis.edu/~wuyu • I want to thank • Dan Gusfield • Dan Brown • Chuck Langley • Yun S. Song

Yufeng Wu UC Davis RECOMB 2007