Download
whole genome phylogenetic analysis n.
Skip this Video
Loading SlideShow in 5 Seconds..
Whole Genome Phylogenetic Analysis PowerPoint Presentation
Download Presentation
Whole Genome Phylogenetic Analysis

Whole Genome Phylogenetic Analysis

192 Vues Download Presentation
Télécharger la présentation

Whole Genome Phylogenetic Analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009

  2. Agenda • Introduction • Our method proposals • Datasets and experiments • Results • Discussion • Future work • Conclusion

  3. Whole Genome Phylogeny: Motivations • Currently the dominant method for phylogenetic analysis is based on a single gene or protein. • However different gene tells a different story • Recently more genomic sequences became available • We hope to resolve the above inconsistency by using the entire genome (or proteome) to reconstruct phylogenetic tree.

  4. Whole Genome Phylogeny: Methods • Major categories of methods are based on: • Shared gene (ortholog) content • Nucleotide and amino acid (string) composition • Genome Compression • Gene order • In our study, we focus on string composition and compression methods

  5. Complete Composition Vector (CCV) • The observed occurrence probability for a k-string: • The estimated background occurrence probability based on the Markov assumption is:

  6. Complete Composition Vector (CCV) • The occurrence probability due to selective pressure:The k-th composition vector:The Complete Composition Vector (CCV):

  7. Compression Methods • Kolmogorov Complexity • Lempel-Ziv complexity

  8. Agenda • Introduction • Our method proposals • Datasets and experiments • Results • Discussion • Future work • Conclusion

  9. A new term weighting scheme • CCV uses S(•) to weight each k-string, which • Utilizes only local information available within a single sequence • Estimates random background based on Markov model • Can we have a measure that use both local and global information without making the Markov assumption?

  10. Term and Document Frequency • Genomes are documents written in a language of four alphabets {A,T,C,G}; similarly, proteomes are documents written in a language of twenty alphabets. • Each k-string can be viewed as a word within a gnome (or proteome) document. • The collection of all genomes in the dataset is therefore a corpus.

  11. Term and Document Frequency • In statistical Natural Language Processing, a well-known term weighting scheme TF-IDF combines both term frequency and document frequency into a single weight.

  12. CCV meets Document Frequency • We can also combine the occurrence probability due to selection S(•) with the inverse document frequency into a single weight called CCV-IDF. • S(•) provides local information and dfi provides global information.

  13. Ensemble Measures Normalizing distances to same range Combining distance matrixes These parameters should be adjusted

  14. Tree Evaluation • We propose a new evaluation method for evaluating phylogenetic trees • A numeric measure • Shows how compatible the tree is with the given taxonomy

  15. Tree Evaluation (Cont.) • Labeling the inner nodes in the tree • For each species • A path in the tree •  sequence of inner node labels • A taxonomy description •  taxonomy sequence • There should be a many to many alignment between these two sequences

  16. Tree Evaluation (Cont.) • Finding alignment between these sequences for all the species • Using Bayesian Network • Finding the most probable alignments • Measuring the Log likelihood of these alignment • How probable is this tree given this taxonomy

  17. Tree Evaluation (Example) • Phylogenetic tree • Taxonomy • T1;T2; A • T1;T3; B • T1;T3; C • T1;T3; D D 1 2 3 A B C 1 <T1;T2,1> P1 1;2 <T1,1> <T3,2> P2 1;2;3 <T1,1><T3, 2;3> P3 1;2;3 <T1,1><T3, 2;3> P4

  18. Agenda • Introduction • Our method proposals • Datasets and experiments • Results • Discussion • Future work • Conclusion

  19. Dataset: influenza virus • Influenza virus genomes (flu) • 44 influenza A genomes (3 for H1-H13, 2 for H16) • 3 influenza B genomes • 1 influenza C genome (out group) • Coding gene sequences only • Collected and joined from individual gene sequences according to the following order: HA, NA, NP, M, NS, PA, PB1, PB2

  20. Dataset: Prokaryotes • Prokaryote genomes (bac) • 88 bacterial genomes • 11 archaean genomes • Uses Nanoarchaeum equitans as the out group. • Collected from NCBI according to the accession number provided in the CCV paper. • Genomeic DNA sequence including intergenic regions.

  21. Dataset: Mammal mitochondria • Mammal mitochondria (mito) • 425 mammal mitochondria • 1 Arabidopsis mitochondrion (out group) • Collected from the Organelle Genome Megasequencing Program website. • converted from NCBI format to fasta format. • Contains many duplicated entries for: • Bos taurus (cattle) • Sus scrofa (wild Boar) • Mus musculus (mouse) • Rattus norvegicus (rat)

  22. Experiments • We built a multiple sequence alignment tree for flu • We ran CCV, TF-IDF and CCV-IDF on all three datasets with the following k-string length: (we fixed K1 = 1 and only vary K2, L = K2 - K1 + 1 = K2) • Flu: L = 7, L = 15 • Bac and mito: L = 7 and L = 9 • Each run generates a pairwise distance matrix.

  23. Experiments • We ran GenCompress and LZ compression programs on flu and mito and calculate pairwise distance • We tried ensembling different measures [Reihaneh]

  24. Experiments • We converted pairwise distance matrices into phylogenetic trees using the Neighbor-Joining program in PHYLIP • We visualized resulting trees using DRAWGRAM and TreeView.

  25. Agenda • Introduction • Our method proposals • Datasets and experiments • Results • Discussion • Future work • Conclusion

  26. MSA tree H1, 2, 3 H5, 6, 9 H4, 15, 16, 13 H7, 10, 12, 8 B MSA trees versus HA tree HA tree by Suzuki et.al.

  27. MSA tree 1, 2, 3 5, 6, 9 4, 15, 16, 13 7, 10, 12, 8 B GenCompress MSA versus Compression 1, 2, 3 10, 12, 8 7 13, 16 15, 4, 5, 6, 9 B

  28. CCV L15 cos 1, 2, 3 7, 8, 10, 12 13, 16 15 4, 5, 6, 9 B MSA tree MSA versus CCV H1, 2, 3 H5, 6, 9 H4, 15, 16, 13 7, 8, 10, 12 B

  29. TF-IDF L15 cos 7, 8, 10, 12, 11 4, 5, 6, 9 15 13, 16 H1, 2, 3 B MSA tree MSA vs TF-IDF H1, 2, 3 H4, 15, 16, 13 H5, 6, 9 H7, 10, 12, 8 B

  30. CCV-IDF L15 cos MSA tree MSA vs CCV-IDF H1, 2, 3 H1, 2, 3 H4, 15, 16, 13 13, 16 H5, 6, 9 8, 10, 12 7 7, 8, 10, 12 4, 5, 6, 9 15 B B

  31. CCV L15 cos TF-IDF L15 cos 1, 2, 3 7, 8, 10, 12, 11 7, 8, 10, 12 4, 5, 6, 9 13, 16 15 13, 16 15 4, 5, 6, 9 H1, 2, 3 B B CCV vs TF-IDF

  32. CCV L15 cos CCV-IDF L15 cos 1, 2, 3 1, 2, 3 8, 10, 12 7, 8, 10, 12 7 13, 16 15 4, 5, 6, 9 4, 5, 6, 9 15 13, 16 B B CCV vs CCV-IDF

  33. Observations • All methods (MSA, CCV, GenCompress, TF-IDF, CCV-IDF) generate similar results. • Our results are significantly different from previous studies. • Most clades are intact while some are scattered around. • Most clades are pure while some are mixed with species from nearby clades. • CCV and CCV-IDF results are highly similar.

  34. CCV k1=3, k2=7 protein CCV k1=1, k2=7 DNA AA versus DNA

  35. CCV k1=1, k2=7 DNA CCV k1=1, k2=9 DNA CCV L=7 and L=9

  36. Observations • Most clades are intact. • For similar CCV length, the DNA tree is worse than the protein tree and unable to recognize Archaea as a distinctive clade. • CCV trees are similar for length 7 and length 9. • Similarly the L7, L15 and L21 tree for flu are almost identical

  37. Mito results • For the mito dataset, we have similar observations. • All methods failed to resolve fine branches of the tree by mixing in distant species.

  38. Mito: primates TF-IDF L9 cos CCV L9 cos

  39. Agenda • Introduction • Our method proposals • Datasets and experiments • Results • Discussion • Future work • Conclusion

  40. DNA versus AA Sequence • There are more k-strings for protein sequence than DNA sequence for the same length. • We need longer k-strings for DNA to achieve the same resolution as amino acid (AA) sequence. • Due to the redundant nature of the genetic code, different DNA k-strings may correspond to the same AA k-string. • AA k-strings can share information even though their DNA sequence might be different • DNA sequence may contain intergenic regions which do not response to selection pressure • Intergenic region may not contribute much to the resolution of the tree; they might even reduce such resolution.

  41. Thoughts on Document Frequency • We did not observe significant performance difference by adding in document frequency information. • For longer genome (e.g. bac), we need longer k-strings to see the effect of DF. • All bac genomes share 87.9% 9-strings and only 0.8% 11-strings

  42. Compression programs • Current compression programs are problematic • LZ could not handle large datasets • Kolmogorov is not applicable for large sequences • These method should be reimplemented

  43. Agenda • Introduction • Our method proposals • Datasets and experiments • Results • Discussion • Future work • Conclusion

  44. Future works • Run the same experiments on protein sequence • To investigate the effect of using AA versus DNA sequences. • We expect to see better results with protein sequences • New result may reveal subtle difference between different methods.

  45. Future works • Speed up the implementation for TF-IDF and run them on longer k-strings • Computational complexity is the bottle neck for achieving high resolution in a reasonable amount of time. • Initially the calculations for TF and IDF are separated: slow • We achieved significant speedup by integrating the calculation of TF and IDF into a two-pass algorithm • We may drop k-string with low TF-IDF values to further speed up the program.

  46. Future works • Perform bootstrapping analysis • We are unable to perform bootstrapping analysis due to time and computational resource constraints

  47. Future works • In our proposed evaluation method, we need a Many to many alignment which is not a trivial task • It is well studied in Machine translation and Natural Language Processing and those techniques could help here • This measure could also be used as a measure of similarity between trees

  48. Agenda • Introduction • Our method proposals • Datasets and experiments • Results • Discussion • Future work • Conclusion

  49. Conclusion • All string composition methods (CCV, TF-IDF, CCV-IDF) somewhat group most similar species together and produce consistent results. • However they all failed to resolve big branches as well as fine branches. • We did not observe significant improvement by adding document frequency. • But we will need further experiments (with longer k-strings on AA sequences) to fully understand the effect.

  50. Major Contributions • We proposed a novel term weighting scheme which achieves similar performance as CCV in our experiments • We proposed the notion of adding in global information in the form of document frequency • We discovered that using protein sequence may significantly improve performance for all methods • We proposed a novel evaluation method for phylogenetic trees