1 / 23

RNAsim/CRIMSON Algorithm Benchmark Suite

RNAsim/CRIMSON Algorithm Benchmark Suite. U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson, Steve Fisher, Sheng Guo U Texas : David Hillis, Lauren Meyers, Tracey Heath, Derrick Zwickl NC State: Spencer Muse Florida State: Mark Holder Yale: Paul Turner.

denim
Télécharger la présentation

RNAsim/CRIMSON Algorithm Benchmark Suite

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RNAsim/CRIMSON Algorithm Benchmark Suite U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson, Steve Fisher, Sheng Guo U Texas : David Hillis, Lauren Meyers, Tracey Heath, Derrick Zwickl NC State: Spencer Muse Florida State: Mark Holder Yale: Paul Turner

  2. Goal: Develop validated datasets of sufficient complexity and scale to realistically benchmark latest tree algorithms

  3. Benchmark Infrastructure Model Characterization Simulators Character Evolution Simulators Taxon Sampling Database Tree Topology Simulators Data Subset with Associated Subtree • Others • Tree/Char Combined • Experimental Evolution • Virtual Cell • etc Model Sampling Format Translators RNAsim CRIMSON PAUP*, etc

  4. Benchmark Scheme • Generate a very large dataset (>106 positions) over a very large tree (>106 taxa) using various models of evolution • Store the data in a database • Retrieve subsets of the data by various sampling schemes

  5. RNA macro-evolution simulation (Sheng Guo, Lisan Wang) • Incorporate 2ndary structure constraints, incorporate indels, using a simulator based on edit mutations. A set of edit operators are implemented, such as stem edit, each of which operate on evolving strings with a characteristic wait time. Ancestral molecule is based on known rRNA gene with putative known 2ndary structure. Evolution of the 2ndary structure is tracked. anc delete stem pair change base initiate new stem insert base delete base add stem pair desc

  6. Fixation probability as a function of fitness Parameters: Ne:effective population size  :neutral mutation rate s : fitness change Neutral Advantageous(s>0)/Deleterious(s<0) Compensatory Mutation

  7. One-step mutation ensemble of a RNA

  8. Weaker Selection

  9. Calibration on Empirical Data Simulated RNA 100 Eukaryotic ssRNA

  10. Example: Pairwise Similarity of 1000 locally optimal ML trees (MDS plot) Empirical Data RNAsim ROSE SeqGen

  11. CPU Time to reach local optimum (PAUP* ML, TBR)

  12. 1 Million Leaves (Tracey Heath; Birth-Death Model with variable rates)20 Data Replicate Partition Simulated and Stored at SDSC

  13. Crimson Stephen Fisher, Susan Davidson, Junhyong Kim • Facilitates the extraction of sub-trees from very large phylogenetic trees. • Trees loaded into a shared database (Oracle or MySQL) • Extensive tree sampling options • Save query output to NEXUS or phylip files • Include paup commands in query output files • Comprehensive graphical dialogs • Command line interface allowing python-like scripting • Display trees with Walrus 3D Viewer

  14. Query Options • Species Selection • Select All • Random Selection • Select By Temporal Depth • Same number of samples per sub-tree • Weight sampling of sub-trees by number of leaves • Select By Species Level • Same number of samples per sub-tree • Weight sampling of sub-trees by number of leaves • Manual Selection • Sequence Selection • Select All • Random Selection • Manual Selection

  15. Depth Threshold Distribution L-1 L-2 L-3 L-4 L-5 L-6 L-7 L-8

  16. Crimson Interface

  17. Current Benchmarking Effort • Sample #1 • 10 leaves per sampled tree • Repeat taxon sampling 40 times per replicate data partition • Sample #2 • 100 leaves per sampled tree • Repeat taxon sampling 30 times per replicate data partition • Sample #3 • 1,000 leaves per sampled tree • Repeat taxon sampling 20 times per replicate data partition • Sample #4 • 10,000 leaves per sampled tree • Repeat taxon sampling 10 times per replicate data partition

  18. Algorithms (to be expanded) • Neighbor Joining (paup) • breakties=random • Parsimony (paup) • set maxtrees=200 increase=no • hsearch timelimit=432000 • contree all /strict=no majrule=yes • RAxML (raxmlHPC) • -f a • -# 100 • -m GTRGAMMA

  19. Benchmarking Stats

  20. Distribution of False Positive Edges

  21. Computational Difficulty of Dataset Versus Accuracy sec hr hr

  22. RAxML Computation Time (Heuristic) Over 30 Random 100-taxon Trees Replicates

  23. Thanks to: Davidson, Susan Fisher, Steve Guo, Sheng Hillis, David Heath, Tracey Wang, Lisan Zhang, Yifeng Zwickl, Derrick Please Ask and Talk to: Steve Fisher Sheng Guo Lisan Wang Please See CRIMSON Demo by Steve Fisher

More Related