New Multiple Sequence Alignment Methods - PowerPoint PPT Presentation

new multiple sequence alignment methods n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
New Multiple Sequence Alignment Methods PowerPoint Presentation
Download Presentation
New Multiple Sequence Alignment Methods

play fullscreen
1 / 73
New Multiple Sequence Alignment Methods
72 Views
Download Presentation
arty
Download Presentation

New Multiple Sequence Alignment Methods

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. New Multiple Sequence Alignment Methods Tandy Warnow The Department of Computer Science The University of Texas at Austin

  2. The “Tree of Life”

  3. Avian Phylogenomics Project Erich Jarvis, HHMI G Zhang, BGI MTP Gilbert, Copenhagen T. Warnow UT-Austin S. Mirarab Md. S. Bayzid UT-Austin UT-Austin Plus many many other people… • Approx. 50 species, whole genomes • 8000+ genes, UCEs • Gene sequence alignments and trees computed using SATé Challenges: Maximum likelihood tree estimation on multi-million-site sequence alignments Massive gene tree incongruence

  4. 1kp: Thousand Transcriptome Project T. Warnow, S. Mirarab, N. Nguyen, Md. S.Bayzid UT-Austin UT-Austin UT-Austin UT-Austin N. Matasci iPlant J. Leebens-Mack U Georgia N. Wickett Northwestern G. Ka-Shu Wong U Alberta Plus many many other people… • Plant Tree of Life based on transcriptomes of ~1200 species • More than 13,000 gene families (most not single copy) Gene Tree Incongruence Challenge: Alignment of datasets with > 100,000 sequences Gene tree incongruence

  5. Multiple Sequence Alignment (MSA): another grand challenge1 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

  6. -3 mil yrs AAGACTT AAGACTT -2 mil yrs AAGGCCT AAGGCCT AAGGCCT AAGGCCT TGGACTT TGGACTT TGGACTT TGGACTT -1 mil yrs AGGGCAT AGGGCAT AGGGCAT TAGCCCT TAGCCCT TAGCCCT AGCACTT AGCACTT AGCACTT today AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT DNA Sequence Evolution

  7. Phylogeny Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

  8. The “real” problem U V W X Y TAGACTT TGCACAA TGCGCTT AGGGCATGA AGAT X U Y V W

  9. Indels (insertions and deletions) Deletion Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA…

  10. Deletion Substitution …ACGGTGCAGTTACCA… • The true multiple alignment • Reflects historical substitution, insertion, and deletion events • Defined using transitive closure of pairwise alignments computed on edges of the true tree Insertion …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA…

  11. Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

  12. Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

  13. Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

  14. S1 S1 S2 S4 S4 S2 S3 S3 Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA Compare True tree and alignment Estimated tree and alignment

  15. Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

  16. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. Two-phase estimation Alignment methods • Clustal • POY (and POY*) • Probcons (and Probtree) • Probalign • MAFFT • Muscle • Di-align • T-Coffee • Prank (PNAS 2005, Science 2008) • Opal (ISMB and Bioinf. 2007) • FSA (PLoS Comp. Bio. 2009) • Infernal (Bioinf. 2009) • Etc. RAxML: heuristic for large-scale ML optimization

  17. 1000-taxon models, ordered by difficulty (Liu et al., 2009)

  18. Problems with the two-phase approach • Current alignment methods fail to return reasonable alignments on large datasets with high rates of indels and substitutions. • Manual alignment is time consuming and subjective. • Systematists discard potentially useful markers if they are difficult to align. This issues seriously impact large-scale phylogeny estimation (and Tree of Life projects)

  19. SATé SATé (Simultaneous Alignment and Tree Estimation) • Liu et al., Science 2009 • Liu et al., Systematic Biology 2012 • Public distribution (open source software) and user-friendly GUI

  20. 1000-taxon models, ordered by difficulty (Liu et al., 2009)

  21. A C B D Re-aligning on a tree A B Decompose dataset C D Align subproblems A B C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

  22. Obtain initial alignment and estimated ML tree Tree SATéAlgorithm

  23. Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Alignment SATé Algorithm

  24. Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Estimate ML tree on new alignment Alignment SATé Algorithm

  25. Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Estimate ML tree on new alignment Alignment SATé Algorithm If new alignment/tree pair has worse ML score, realign using a different decomposition Repeat until termination condition (typically, 24 hours)

  26. 1000-taxon models, ordered by difficulty (Liu et al., 2009)

  27. 1000 taxon models, ordered by difficulty 24 hour SATé analysis, on desktop machines (Similar improvements for biological datasets)

  28. 1000 taxon models ranked by difficulty

  29. A C B D Limitations A B Decompose dataset C D Align subproblems A B C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

  30. A C B D Limitations A B Decompose dataset C D Align subproblems A B C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

  31. Obtain initial alignment and estimated ML tree Tree Estimate ML tree on new alignment New Re-Alignment Technique Alignment PASTA Algorithm (Improving SATé) If new alignment/tree pair has worse ML score, realign using a different decomposition Repeat until termination condition (typically, 24 hours)

  32. A C B D Each PASTA iteration A B Decompose dataset C D Align subproblems A B C D Estimate ML tree on merged alignment New technique for merging alignments ABCD

  33. PASTA vs. SATé: better alignments and trees, better scalability Analyses of Gutell’s 16S datasets with curated structural alignments and reference trees using maximum likelihood with bootstrapping. PASTA can analyze datasets with up to 200,000 sequences efficiently. SATé maxes out around 50,000 sequences.

  34. 1KP dataset: Large and Highly Fragmentary! >100,000 AA sequences Transcriptome data Cytochrome dataset sequence length distribution

  35. UPP: basic idea Input: set S of unaligned sequences Output: alignment on S • Select random subset X of S • Estimate “backbone” alignment A and tree T on X • Independently align each sequence in S-X to A • Use transitivity to produce multiple sequence alignment A* for entire set S

  36. Input: Unaligned Sequences S1 = AGGCTATCACCTGACCTCCAAT S2 = TAGCTATCACGACCGCGCT S3 = TAGCTGACCGCGCT S4 = TACTCACGACCGACAGCT S5 = TAGGTACAACCTAGATC S6 = AGATACGTCGACATATC

  37. Step 1: Pick random subset(backbone) S1 = AGGCTATCACCTGACCTCCAAT S2 = TAGCTATCACGACCGCGCT S3 = TAGCTGACCGCGCT S4 = TACTCACGACCGACAGCT S5 = TAGGTACAACCTAGATC S6 = AGATACGTCGACATATC

  38. Step 2: Compute backbone alignment S1 = -AGGCTATCACCTGACCTCCA-AT S2 = TAG-CTATCAC--GACCGC--GCT S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC—-GACCGACAGCT S5 = TAGGTAAAACCTAGATC S6 = AGATAAAACTACATATC

  39. Step 3: Align each remaining sequence to backbone First we add S5 to the backbone alignment S1 = -AGGCTATCACCTGACCTCCA-AT- S2 = TAG-CTATCAC--GACCGC--GCT- S3 = TAG-CT-------GACCGC—-GCT- S4 = TAC----TCAC--GACCGACAGCT- S5 = TAGG---T-A—CAA-CCTA--GATC

  40. Step 3: Align each remaining sequence to backbone Then we add S6 to the backbone alignment S1 = -AGGCTATCACCTGACCTCCA-AT- S2 = TAG-CTATCAC--GACCGC--GCT- S3 = TAG-CT-------GACCGC--GCT- S4 = TAC----TCAC—-GACCGACAGCT- S6 = -AG---AT-A-CGTC--GACATATC

  41. Step 4: Use transitivity to obtain MSA on entire set S1 = -AGGCTATCACCTGACCTCCA-AT-- S2 = TAG-CTATCAC--GACCGC--GCT-- S3 = TAG-CT-------GACCGC--GCT-- S4 = TAC----TCAC--GACCGACAGCT-- S5 = TAGG---T-A—CAA-CCTA--GATC- S6 = -AG---AT-A-CGTC--GACATAT-C

  42. UPP: details Input: set S of unaligned sequences Output: alignment on S • Select random subset X of S • Estimate “backbone” alignment A and tree T on X • Independently align each sequence in S-X to A • Use transitivity to produce multiple sequence alignment A* for entire set S

  43. UPP: details Input: set S of unaligned sequences Output: alignment on S • Select random subset X of S • Estimate “backbone” alignment A and tree T on X • Independently align each sequence in S-X to A • Use transitivity to produce multiple sequence alignment A* for entire set S

  44. How to align sequences to a backbone alignment? Standard machine learning technique: Build HMM (Hidden Markov Model) for backbone alignment, and use it to align remaining sequences We use HMMER (Eddy, HHMI) for this purpose

  45. 1: Build Hidden Markov Model (HMM) for backbone MSA 2: Use HMM to align remaining sequences (one by one) 3: Use transitivity to infer MSA on entire dataset

  46. Using HMMER Using HMMER works well… • …except when the dataset has a high evolutionary diameter.

  47. Using HMMER Using HMMER works well…except when the dataset is big!

  48. One Hidden Markov Model for the entire alignment?

  49. Or 2 HMMs?

  50. Or 4 HMMs?(or 8 or 16 or…)