1 / 46

A search engine for phylogenetic tree databases

A search engine for phylogenetic tree databases. David Fernández-Baca Joint work with Mukul Bansal, Duhong Chen (Computer Science, ISU) and J. Gordon Burleigh (NESCent). PhyloFinder. http://pilin.cs.iastate.edu/phylofinder/. Outline. Introduction PhyloFinder queries Implementation

keelty
Télécharger la présentation

A search engine for phylogenetic tree databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A search engine for phylogenetic tree databases David Fernández-Baca Joint work with Mukul Bansal, Duhong Chen (Computer Science, ISU) and J. Gordon Burleigh (NESCent)

  2. PhyloFinder http://pilin.cs.iastate.edu/phylofinder/

  3. Outline • Introduction • PhyloFinder queries • Implementation • Future directions • Acknowledgements

  4. Outline • Introduction • PhyloFinder queries • Implementation • Future directions • Acknowledgements

  5. Issues in Phylogenetic Databases • Taxonomic consistency • Species may appear in multiple trees by different but synonymous names. • Homonyms • Misspellings • Querying capability • Storage/representation • Exploiting classification trees (e.g., NCBI) • Clustering capabilities • Distance measures • Aggregation (synthesis) capabilities • Supertrees • Visualization

  6. Classification trees and phylogenies

  7. Exploiting taxonomic classifications • The leaves in phylogenetic trees may represent different taxonomic levels • A classification tree can allows us to locate trees that contain a taxon, as well as its descendants or ancestors. • E.g., a “Pinaceae" query would identify trees that contain “Pinus thunbergii” or “Abies alba”

  8. TreeBASE (Piel, Donoghue, & Sanderson, 1996)

  9. TreeBASE capabilities • Search by • taxon • author • citation • study accession number • matrix accession number • structure (topology) • Tree surfing

  10. TreeBASE limitations • Taxonomic name consistency • Querying • Few options • Does not exploit classification • Can’t identify ancestors/descendants • Visualization • Clustering and aggregation (supertrees)

  11. PhyloFinder • A search engine for tree databases • Not a database • Allows powerful phylogenetic queries • Handles synonymous taxonomic names (via TBMap) • Handles misspellings. • Exploits taxonomic classification • Offers precise options for identifying different types of subtrees and metrics for identifying similar trees. • Provides a visualization tool with links to GenBank and TBMap. • Fast • Efficient storage and filtering

  12. PhyloFinder Design • Uses simple but powerful techniques • Inverted index for filtering • Nested-set representation of trees • Least common ancestor queries directly on database • Off-the-shelf spell-checking technology • Can be used with anyphylogenetic database • E.g., PhyLoTA browser • However, set-up is not (yet) automatic

  13. Outline • Introduction • Queries • Storage and querying • Acknowledgements

  14. PhyloFinder Queries • Taxonomic queries involve a single taxon or set of taxa. • Phylogenetic queries take as input a phylogenetic tree • Locate trees that match it in some specified way.

  15. Taxonomic Queries • Contains: Given a list of taxa, return all trees that contain all or any of these names. • Similar to Boolean “AND” and “OR” searches. • Automatically searches for synonymous taxa • Related: Given a taxon, find all trees involving it or any of its descendants in the NCBI taxonomy. • E.g., if the query taxon is “birds", identify all trees that contain bird taxa. • Pathlength: Given a pair of taxa, return all trees containing them, along with the distance between them in each tree.

  16. Taxonomic Queries: Contains

  17. Phylogenetic Queries • Tree mining: Given a query tree Q, find the database trees that exhibit Qin some way. Options: • Return the trees that have Qas an embeddedsubtree. • Return the trees that refine Q. • Similarity: Given a query tree Q and a specified similarity measure, return trees in database ranked by decreasing similarity from Q. • Requires at least 3 taxon overlap

  18. Phylogenetic queries: Notation 1 • T(A) is the minimal subtree of T that contains the leaves in A. a b c d e f g

  19. Phylogenetic queries: Notation 1 • T(A) is the minimal subtree of T that contains the leaves in A. a b c d e f g

  20. Phylogenetic queries: Notation 1 • T(A) is the minimal subtree of T that contains the leaves in A. a b c d e f g

  21. Phylogenetic queries: Notation 2 • T|A is obtained from T(A) by suppressing all internal nodes that have only one child. a b c d e f g

  22. Phylogenetic queries • Let Q be a query tree with leaf set A. • Q is an embedded subtree of T if and only if it is identical to T|A. • Q is refined by T (T refines Q) if T|A is a refinement of Q.

  23. Phylogenetic queries

  24. Phylogenetic queries: Embedded

  25. Phylogenetic queries: Refined by

  26. Phylogenetic queries: Refined by Q embedded in T  Q refined by T

  27. Phylogenetic queries: Embedded

  28. Similarity queries • Return trees ranked by a similarity score • Score is a percentage between 0 and 100% reflecting how similar query tree is to candidate tree. • PhyloFinder’s similarity measures: • Robinson-Foulds (RF) similarity • Least common ancestor (LCA) similarity • Score takes degree of taxon overlap into account.

  29. Outline • Introduction • PhyloFinder queries • Implementation • Future directions • Acknowledgements

  30. System architecture

  31. Least Common Ancestors (LCAs) a b c d e f g

  32. Least Common Ancestors (LCAs) a b c d e f g LCA(b,e)

  33. Storage: Nested intervals (4,4) (5,5) (6,6) (8,8) (9,9) (10,10) b c e a d f • Ancestor/descendant relationship is easy to determine • The between predicate defines subtrees • LCAs are easily computed • Find common ancestor with largest Node_ID (Node_ID,RMD_ID) (3,5) (7,9) (2,9) (1,10)

  34. Cornus 2 4 8 16 32 64 128 Spigelia 1 2 3 5 8 13 21 34 Hedera 13 16 Storage: Inverted index • For each taxon, store a list of all trees that contain it. • Easy to find trees containing any or all elements in a list of taxa • Used as a filter

  35. Building the inverted index • Input trees:1: (((man,pan),gorilla),pongo),2. (((human, coprinus),cryptomonas),zea_mays), . . . ,N: (((dogs,homo_sapiens),pig),lambs) • Convert trees into lists of taxa:man pan gorilla pongo, human coprinus . . .  . . . • Synonymy preprocessing: Replace names by TBMap name clusters: tc1 tc2 tc3 tc4 , tc1 tc5 . . .  . . . • Build index consisting of (i) dictionary (mapping of taxon names to name clusters) and (ii) postings (lists of tree IDs). tc1 1 2 3 4 tc2 1

  36. Schema

  37. Query Processing: Outline Candidate trees: Q: Consult inverted index Results: Compare against Q using LCA queries

  38. Implementing Phylogenetic Queries • Idea: Use LCA queries to compare ancestor-descendant relationships in Q with those in T. • M(x) and M(y) have the same relationship in T as x and y have in Q  Q can be embedded in T. • Advantage: Database trees need not be read into main memory.

  39. Implementing Taxonomic Queries • Use Boolean (union/intersection) operations on the inverted index • Example: Querying for “birds" • Find all bird species in the database trees using the NCBI taxonomy tree. • Use inverted index to retrieves the tree ID lists for each bird species. • Return the union of these lists.

  40. Tree visualization

  41. Tree visualization • Other tree visualization tools are available: • Hillis, Heath, & St. John 2005. Syst. Biol. 54: 471-482. • Sanderson 2006. Bioinformatics 22: 1004-1006. • Zmasek & Eddy. 2001. Bioinformatics 17: 383-384. • We developed our own to • avoid plug-ins, and • easily highlight query results and provide outlinks to GenBank and TBMap.

  42. Spelling • Suggestions come from TreeBASE and NCBI • Uses GNU Aspell • Modified to handle special characters (`-', `&', '.') and compound words.

  43. Outline • Introduction • PhyloFinder queries • Implementation • Future directions • Acknowledgements

  44. Under construction • Unrooted trees • Supertree methods • MRP, MRF, MMC • Desktop version • Automatic update • Suggestions?

  45. Outline • Introduction • PhyloFinder queries • Implementation • Future directions • Acknowledgements

  46. Thanks to • Rod Page for TBMap • Bill Piel for TreeBASE data • Mike Sanderson • Oliver Eulenstein • National Science Foundation (grant EF-0334832)

More Related