1 / 20

Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut

Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits. Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut. Disease mutation. Genealogy: Evolutionary History of Genomic Sequences.

carson
Télécharger la présentation

Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut DIMACS 2008

  2. Disease mutation Genealogy: Evolutionary History of Genomic Sequences • Tells how sequences in a population are related • Helps to explain diseases: disease mutations occur on branches and all descendents carry the mutations • Genealogy: unknown. Only have SNP haplotypes (binary sequences). • Problem: Inference of genealogy for “unrelated” haplotypes • Not easy: partly due to recombination Diseased (case) Healthy (control) Sequences in current population

  3. Suffix Prefix Breakpoint Recombination • One of the principle genetic forces shaping sequence variations within species • Two equal length sequences generate a third new equal length sequence in genealogy • Spatial order is important: different parts of genome inherit from different ancestors. 110001111111001 1100 00000001111 000110000001111

  4. 00 1 0 0 1 10 1 1 Ancestral Recombination Graph (ARG) Mutations Recombination 10 01 00 10 11 01 00 S1 = 00 S2 = 01 S3 = 10 S4 = 11 Assumption: At most one mutation per site S1 = 00 S2 = 01 S3 = 10 S4 = 10

  5. Local tree near site 3 What is the Use of an ARG? May look at the ARG directly. But for noisy data, another way of using ARGs: an ARG represents a set of local trees! Data 0000 0101 0110 1110 1010 0000 0000 0100 0010 Local trees: evolutionary history for different genomic regions between recombination breakpoints. 1010 0110 0101 0110 1110 1010 0000

  6. Possible Disease mutation At which Local Tree Did Disease Mutations Occur? • Clear separation of cases/controls: notexpected for complex diseases Case Control

  7. How to infer ARGs? • But we do not know the true ARG! • Goal: infer ARGs from haplotypes • First practical ARG association mapping method (Minichiello and Durbin, 2006) • Use plausible ARGs: heuristic • Less complex disease model: implicitly assume one disease mutation with major effects. • My results (Wu, RECOMB 2007) • Generate ARGs with a provable property, and works on a well-defined complex disease model • Focus on parsimonious history

  8. Simulation Results (Wu, 2007) • TMARG/MARGARITA: sample ARGs, decompose to local trees and look for association signals. • LATAG: infer local trees at focal points. • Average mapping error for 50 simulated datasets from Zollner and Pritchard Comparison: TMARG (minARGs), TMARG (near minARGs), LATAG (Z. P.), MARGARITA (M. D.). TMARG (my program) and MARGRITA are much faster than LATAG.

  9. Preliminary Results: GAW16 Data SNP rs2476601 reported in Begovich et al., 2004 and Carlton et al., 2005 ? • GAW16 data from the North American Rheumatoid Arthritis Consortium (NARAC), 868 cases and 1194 controls. Chromosome one: 40929 SNPs. • Running TMARG on large-scale data • Break into non-overlapping windows • Run fastPHASE (Scheet and Stephens 06) to obtain haplotypes • Run TMARG with Chi-square mode Caution: more investigation needed.

  10. A Related ProblemInference of Local Tree Topologies Directly (Wu, 2008, Submitted)

  11. Inference of Local Tree Topologies • Recall ARG represents a set of local trees. • Question: given SNP haplotypes, infer local tree topologies (one tree for each SNP site, ignore branch length) • Hein (1990, 1993) • Song and Hein (2003,2005): enumerate all possible tree topologies at each site • Parsimony-based

  12. Local Tree Topologies • Key technical difficulty: enumerate all tree topologies • Brute-force enumeration of local tree topologies: not feasible when number of sequences > 9 • Trivial solution: create a tree for a SNP containing the single split induced by the SNP. • Always correct (assume one mutation per site) • But not very informative: need more refined trees! A: 0 B: 0 C: 1 D: 0 E: 1 F: 0 G: 1 H: 0 A C B E D F G H

  13. How to do better? Neighboring Local Trees are Similar! • Nearby SNP sites provide hints! • Near-by local trees are often topologically similar • Recombination often only alters small parts of the trees • Key idea: reconstruct local trees by combining information from multiple nearby SNPs

  14. RENT: REfining Neighboring Trees • Maintain for each SNP site a (possibly non-binary) tree topology • Initialize to a tree containing the split induced by the SNP • Gradually refining trees by adding new splits to the trees • Splits found by a set of rules (later) • Splits added early may be more reliable • Stop when binary trees or enough information is recovered

  15. A Little Background: Compatibility 1 2 3 a b c d e 0 0 0 1 0 0 0 0 1 1 0 1 0 1 1 Sites 1 and 2 are compatible, but 1 and 3 are incompatible. M • Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11. Otherwise, p and q are compatible. • Easily extended to splits. • A split s is incompatible with tree T if s is incompatible with any one split in T. Two trees are compatible if their splits are pairwise compatible.

  16. Fully-Compatible Region: Simple Case • A region of consecutive SNP sites where these SNPs are pairwise compatible. • May indicate no topology-altering recombination occurred within the region • Rule: for site s, add any such split to tree at s. • Compatibility: very strong property and unlikely arise due to chance.

  17. Split Propagation: More General Rule • Three consecutive sites 1,2 and 3. Sites 1 and 2 are incompatible. Does site 3 matter for tree at site 1? • Trees at site 1 and 2 are different. • Suppose site 3 is compatible with sites 1 and 2. Then? • Site 3 may indicate a shared subtreein both trees at sites 1 and 2. • Rule: a split propagates to both directions until reaching a incompatible tree.

  18. One Subtree-Prune-Regraft (SPR) Event • Recombination: simulated by SPR. • The rest of two trees (without pruned subtrees) remain the same • Rule: find compatible subtree Ts in neighboring trees T1 and T2, s.t. the rest of T1 and T2 (Ts removed) are compatible. Then joint refine T1- Ts and T2- Ts before adding back Ts. Subtree to prune More complex rules possible. ?

  19. Simulation • Hudson’s program MS (with known coalescent local tree topologies): 100 datasets for each settings. • Data much larger and perform better or similarly for small data than Song and Hein’s method. • Test local tree topology recovery scored by Song and Hein’s shared-split measure  = 15  = 50

  20. Acknowledgement • More information available at: http://www.engr.uconn.edu/~ywu • I want to thank • Dan Gusfield • Yun S. Song • Charles Langley • Dan Brown • And National Science Foundation and UConn Research Foundation

More Related