1 / 30

mosaic

mosaic. exploring reticulate protein family evolution. UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn. traditional methods. phylogenetic tree inference gets increasingly complex is not suitable phylogenetic networks are even more complex and visualization is difficult. mosaic.

halil
Télécharger la présentation

mosaic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. mosaic exploring reticulate protein family evolution UQ, COMBIOAU, Brisbane 02-03-09Maetschke/Kassahn

  2. traditional methods • phylogenetic tree inference gets increasingly complex is not suitable • phylogenetic networks are even more complex and visualization is difficult mosaic • fast method to analyze and visualize (phylogenetic) sequence relationships • applied to identify and study non-tree like protein families • aim to perform whole proteome scans for reticulate proteins motivation • evolution is complex (horizontal gene transfer, hybridization, genetic recombination, ...) • describing reticulate (non-tree like) phylogenetic relationships as trees maybe an oversimplification the problem

  3. S1 A B MSKR SKRR S2 B A KRRM RRMS ... M S K R R M Q Q V T Q MSKRRMKR RM 4-grams n-gram dot plot n-grams & dot plots • "alignment free" methods • Split sequence in overlapping subsequences of length n • phylogenetics: alignment is corner stone • classical alignment may fail for reticulate proteins MSKRRMSVGQQTW...

  4. c=10n=4 c=10n=4 c=2n=1 some real n-gram dot plots • 4-grams are "unique" for a sequence • we talk about '4' later... >AR_Pt MEVQLGLGRVYPRPPSKTYRGAFQNLFQSVREVIQNPGPRHPEAASAAPPGASLLLQQQQQQQQQQQQQQQQQQQQQQETSPRQQQQQGEDGSPQAHRRGPTGYLVLDEEQQPSQPQSAPECHPERGCVPEPGAAVAASKGLPQQLPAPPDEDDSAAPSTLSLLGPTFPGLSSCSADLKDILSEASTMQLLQQQQQEAVSEGSSSGRAREASGAPTSSKDNYLGGTSTISDSAKELCKAV...

  5. nuclear receptors • DBD: DNA binding, two zinc finger motifs • LBD: Ligand binding domain • AF-1/AF-2: Transcriptional activation domains DBD LBD another n-gram dot plot

  6. number of shared n-grams S = set of n-grams, e.g. {AAGR, AGRK, GRKQ, ...} max: global alignment min: local alignment s  [0...1] {AAG,AGQ,GQQ}  { GQQ, QQQ} = { GQQ } n-gram sequence similarity s given two sequences and their n-gram sets S1 and S2:

  7. n-gram similarity • fast: linear wrt. size of n-gram sets(classical alignment is quadratic wrt. sequence length) • easy to interpret(0.5 = half of the n-grams are shared) • no parameters (gap penalty, gap extension penalty, ...) • can deal with shuffling of conserved segments and other "strange" cases (Are they actually strange?) • better or worse than BLAST/FASTA? Who knows?(Hoehl 2008: alignment free can be as good as classical alignment for inference of phylogeny, Edgar 2004: MUSCLE: n-gram based alignment method)

  8. MR, r=0.93 4 why 4 and not 42 • Hoehl 2008: n= 3...5 • correlation between n-gram sequence similarity and species divergence times • standard deviation of sequence similarities • maximum AUC when distinguish related and randomly shuffled sequences

  9. T-Rex NeighborNet/SplitsTree Newick Cardona et al. 2008 Bryant et al. 2004, Huson et al. 1998 Makarenkov et al. 2001 phylogenetic networks • different node and edge types • Identification of reticulate events (e.g. recombination) is error prone • computational expensive • larger networks become messy

  10. larger networks - example Huson et al. 2005 Bryant et al. 2004

  11. GR MR PR AR spring layout graph = ridiculugram nuclear receptors • layout dependent • distorted distances • random initialization • local minima • slow

  12. mosaic plot • point size is similarity • no distortions • no random initialization • preserve full information • automatic clustering(spectral rearrangement) • no hard decision about number of clusters

  13. Affinity matrix sij :n-gram similarity between sequences σ : defines neighborhood radius "Degree" matrix Laplacian matrix eigenvector decomposition e : eigenvalues v : eigenvectors v2: eigenvector for 2nd smallest eigenvalue (Fiedler vector) indicates clusters and how well they are separated spectral clustering A = exp(-(1-S)**2/sig) D = diag(A.sum(axis=0)) L = D-A e,v = eigh(L)

  14. spectral rearrangement

  15. recursive spectral rearrangement

  16. spectral clustering • takes "global" properties into account • fast and scales well • no random initialization => single run • global minimum => single, unique solution • few parameters: L, σσ <= mean of distance matrix • "better" than k-means (works for non-spherical clusters)or single linkage hierarchical clustering (no chaining problem) • clustering is NP-hard and spectral clustering is "just another approximation" • recursive spectral clustering to improve cluster quality

  17. mosaic - demo

  18. the end • fast technique to visualize/analyze reticulate protein family evolution • matrix representation • spectral clustering • n-gram similarity • many other applications Perl free!

  19. questions ? ?

  20. SCOP • SCOP • five families • randomly selected

  21. Nuclear receptors Ligand binding domain N-terminal section Zinc-finger domain

  22. mosaic - examples

  23. Full length sequence: MrBayes v3.1.2 106 generations, 4 chains 240 CPU-hrs GRMR PRAR

  24. Zinc finger domain MrBayes v3.1.2 106 generations, 4 chains 9 CPU-hrs AR GRMR PR

  25. Ligand-binding domain MrBayes v3.1.2 106 generations, 4 chains 27 CPU-hrs PRAR MRGR

  26. Upstream region MrBayes v3.1.2 106 generations, 4 chains 87 CPU-hrs ?

  27. quality q given two sequences and their n-gram dot plot: diag = set of dot sums along diagonals n = length of sequence max: global alignment min: local alignment q [0...1]

  28. q over s

  29. q-spectrum

  30. n-gram dot plots

More Related