1 / 24

Jakub Koperwas Krzysztof Walczak Institute of Computer Science Warsaw University of Technology

Phylogenetic trees dissimilarity measure based on strict frequent splits set and its application for clustering. Jakub Koperwas Krzysztof Walczak Institute of Computer Science Warsaw University of Technology. Agenda. Background Introduction to phylogenetic trees

jeneva
Télécharger la présentation

Jakub Koperwas Krzysztof Walczak Institute of Computer Science Warsaw University of Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogenetic trees dissimilarity measure basedon strict frequent splits set and its applicationfor clustering Jakub Koperwas Krzysztof Walczak Institute of Computer Science Warsaw University of Technology

  2. Agenda • Background • Introduction to phylogenetic trees • Split representation and consensus methods • Frequent split set representation • motivation • defnition • Interpretation • Frequent split set based dissimilarity measure • Clustering • Motivation • results

  3. species2 species1 ancestor? species6 species5 species3 species4 Phylogenetic Tree ancestor species1 species5 species6 species2 species4 species3

  4. Tree Representation Splits: a b a|bcdef b|acdef c|abdef d|abcef c e|abcdf f|abcde ab|cdef abc|def abcd|ef d e f

  5. T1 T2 a b a b c c e d f d e f Robinson Foulds Distance Splits for tree T1: a|bcdef, b|acdef, c|abdef, d|abcef, e|abcdf, f|abcde, ab|cdef, abc|def, abcd|ef Splits for tree T2: a|bcdef, b|acdef, c|abdef, d|abcef, e|abcdf, f|abcde, ab|cdef, abc|def, abce|fd Uncommon splits: abcd|ef, abce|fd

  6. Common Information Extractions • Consensus Methods • Strict Consensus Tree • Majority-rule Consensus Tree • Many others(Aho, Adams, …) • Maximum Agreement Subtree • Maximum Compatible Tree • Many others

  7. T1 T2 a b Tc(T1,T2) a b a b c c c e f d d e f d e f Strict Consensus Tree Splits for tree T1: a|bcdef, b|acdef, c|abdef, d|abcef, e|abcdf, f|abcde, ab|cdef, abc|def, abcd|ef Splits for tree T2: a|bcdef, b|acdef, c|abdef, d|abcef, e|abcdf, f|abcde, ab|cdef, abc|def, abce|fd The common splits :a|bcdef, b|acdef, c|abdef, d|abcef, e|abcdf, f|abcde, ab|cdef, abc|def

  8. Why CT is not so good? abc|defgi ?=? abc|defgj

  9. Subsplit abc|defg⊆ abc|defgx, because abc⊆ abc and defg⊆ defgx, (is subsplit of) abc|defg⊆ abc|defgy, because abc⊆ abc and defg⊆ defgy, (is subsplit of)

  10. Frequent Subsplit • Frequent subsplits in a profile of trees is a split that is a subsplit of at least one split in minsup of trees. Minsup=100% T1: cd|abefghi, bcd|aefghi, abcd|efghi, hi|abcdefg, ghi|abcdef, fghi|abcde, T2: bc|adefghj, abc|defghj,abcd|efghj,hj|abcdefg,ghj|abcdef,fghj|abcde, Frequent subsplit: bc|aefgh……….but also bc|aefgbc|aefbc|ae ……

  11. Representative splitset • Representative splitset - a set thatcontains maximal frequent subsplits s, i.e. such that there is no other frequentsubsplit s2 that is also a supersplit of s. Minsup=100% (strict) T1: cd|abefghi, bcd|aefghi, abcd|efghi, hi|abcdefg, ghi|abcdef, fghi|abcde, T2: bc|adefghj, abc|defghj,abcd|efghj,hj|abcdefg,ghj|abcdef,fghj|abcde, RS: abcd|efgh, gh|abcdef, fgh|abcde, bc|aefgh,

  12. Frequent split-set interpretation • Property 1. For each distinct leafset z from frequent splitset (FS) with a support greater then 50% a tree can be built. The tree is built on those splits from FS having a leafset as a superset of z. Therefore the frequent splitset (minsup>50%) can be represented as a set of trees. In particular, it affects the strict and majority-rule frequent set. • Property 2. Each split from the frequent splitset discussed above will occur in at least one tree, in a restricted form.

  13. Frequent split-set interpretation • Property 3. Properties 1 and 2 are also true for a tree based on the intersection of all the distinct leafsets from frequent split-set. • Properties 4. The set of trees resulting from the frequent splitset will contain also a consensus tree, provided that the input dataset of trees were built on the same leafset.

  14. Example

  15. Clustering motivation • Phylogenetic trees reconstruction methods may produce many candidate trees • Hard to apply consensus methods to achieve one tree from profile of hundreds of trees • Clustering helps to designate small number of candidate trees form a large number of trees

  16. Dissimilarity Measure Example

  17. c b e a d a b e c d Information in Tree a a a b c b b e e c e d d d c a a a b b b e e e c c c d d d

  18. Information Loss • Cluster Information Loss – the amount of information that will be lost while replacing the cluster of trees with one representative tree • Clustering Information Loss – the amount of information that will be lost while replacing the input profile of trees with k representative trees

  19. Agglomerative clustering • Typical Merging Strategies: • Single linkage • Complete linkage • Average Linkage • Our Merging Strategy: minimize information loss after merging • For SFS as Representative Tree:

  20. Results (camp)

  21. Results (DT)

  22. Agg-inf vs others (camp)

  23. Future Work • More efficient FS generation algorithm • Frequency-based clustering algoirthm

  24. Thank You

More Related