Download
mosaic n.
Skip this Video
Loading SlideShow in 5 Seconds..
mosaic PowerPoint Presentation

mosaic

234 Vues Download Presentation
Télécharger la présentation

mosaic

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. mosaic exploring reticulate protein family evolution UQ, COMBIOAU, Brisbane 02-03-09Maetschke/Kassahn

  2. traditional methods • phylogenetic tree inference gets increasingly complex is not suitable • phylogenetic networks are even more complex and visualization is difficult mosaic • fast method to analyze and visualize (phylogenetic) sequence relationships • applied to identify and study non-tree like protein families • aim to perform whole proteome scans for reticulate proteins motivation • evolution is complex (horizontal gene transfer, hybridization, genetic recombination, ...) • describing reticulate (non-tree like) phylogenetic relationships as trees maybe an oversimplification the problem

  3. S1 A B MSKR SKRR S2 B A KRRM RRMS ... M S K R R M Q Q V T Q MSKRRMKR RM 4-grams n-gram dot plot n-grams & dot plots • "alignment free" methods • Split sequence in overlapping subsequences of length n • phylogenetics: alignment is corner stone • classical alignment may fail for reticulate proteins MSKRRMSVGQQTW...

  4. c=10n=4 c=10n=4 c=2n=1 some real n-gram dot plots • 4-grams are "unique" for a sequence • we talk about '4' later... >AR_Pt MEVQLGLGRVYPRPPSKTYRGAFQNLFQSVREVIQNPGPRHPEAASAAPPGASLLLQQQQQQQQQQQQQQQQQQQQQQETSPRQQQQQGEDGSPQAHRRGPTGYLVLDEEQQPSQPQSAPECHPERGCVPEPGAAVAASKGLPQQLPAPPDEDDSAAPSTLSLLGPTFPGLSSCSADLKDILSEASTMQLLQQQQQEAVSEGSSSGRAREASGAPTSSKDNYLGGTSTISDSAKELCKAV...

  5. nuclear receptors • DBD: DNA binding, two zinc finger motifs • LBD: Ligand binding domain • AF-1/AF-2: Transcriptional activation domains DBD LBD another n-gram dot plot

  6. number of shared n-grams S = set of n-grams, e.g. {AAGR, AGRK, GRKQ, ...} max: global alignment min: local alignment s  [0...1] {AAG,AGQ,GQQ}  { GQQ, QQQ} = { GQQ } n-gram sequence similarity s given two sequences and their n-gram sets S1 and S2:

  7. n-gram similarity • fast: linear wrt. size of n-gram sets(classical alignment is quadratic wrt. sequence length) • easy to interpret(0.5 = half of the n-grams are shared) • no parameters (gap penalty, gap extension penalty, ...) • can deal with shuffling of conserved segments and other "strange" cases (Are they actually strange?) • better or worse than BLAST/FASTA? Who knows?(Hoehl 2008: alignment free can be as good as classical alignment for inference of phylogeny, Edgar 2004: MUSCLE: n-gram based alignment method)

  8. MR, r=0.93 4 why 4 and not 42 • Hoehl 2008: n= 3...5 • correlation between n-gram sequence similarity and species divergence times • standard deviation of sequence similarities • maximum AUC when distinguish related and randomly shuffled sequences

  9. T-Rex NeighborNet/SplitsTree Newick Cardona et al. 2008 Bryant et al. 2004, Huson et al. 1998 Makarenkov et al. 2001 phylogenetic networks • different node and edge types • Identification of reticulate events (e.g. recombination) is error prone • computational expensive • larger networks become messy

  10. larger networks - example Huson et al. 2005 Bryant et al. 2004

  11. GR MR PR AR spring layout graph = ridiculugram nuclear receptors • layout dependent • distorted distances • random initialization • local minima • slow

  12. mosaic plot • point size is similarity • no distortions • no random initialization • preserve full information • automatic clustering(spectral rearrangement) • no hard decision about number of clusters

  13. Affinity matrix sij :n-gram similarity between sequences σ : defines neighborhood radius "Degree" matrix Laplacian matrix eigenvector decomposition e : eigenvalues v : eigenvectors v2: eigenvector for 2nd smallest eigenvalue (Fiedler vector) indicates clusters and how well they are separated spectral clustering A = exp(-(1-S)**2/sig) D = diag(A.sum(axis=0)) L = D-A e,v = eigh(L)

  14. spectral rearrangement

  15. recursive spectral rearrangement

  16. spectral clustering • takes "global" properties into account • fast and scales well • no random initialization => single run • global minimum => single, unique solution • few parameters: L, σσ <= mean of distance matrix • "better" than k-means (works for non-spherical clusters)or single linkage hierarchical clustering (no chaining problem) • clustering is NP-hard and spectral clustering is "just another approximation" • recursive spectral clustering to improve cluster quality

  17. mosaic - demo

  18. the end • fast technique to visualize/analyze reticulate protein family evolution • matrix representation • spectral clustering • n-gram similarity • many other applications Perl free!

  19. questions ? ?

  20. SCOP • SCOP • five families • randomly selected

  21. Nuclear receptors Ligand binding domain N-terminal section Zinc-finger domain

  22. mosaic - examples

  23. Full length sequence: MrBayes v3.1.2 106 generations, 4 chains 240 CPU-hrs GRMR PRAR

  24. Zinc finger domain MrBayes v3.1.2 106 generations, 4 chains 9 CPU-hrs AR GRMR PR

  25. Ligand-binding domain MrBayes v3.1.2 106 generations, 4 chains 27 CPU-hrs PRAR MRGR

  26. Upstream region MrBayes v3.1.2 106 generations, 4 chains 87 CPU-hrs ?

  27. quality q given two sequences and their n-gram dot plot: diag = set of dot sums along diagonals n = length of sequence max: global alignment min: local alignment q [0...1]

  28. q over s

  29. q-spectrum

  30. n-gram dot plots