310 likes | 594 Vues
mosaic. exploring reticulate protein family evolution. UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn. traditional methods. phylogenetic tree inference gets increasingly complex is not suitable phylogenetic networks are even more complex and visualization is difficult. mosaic.
E N D
mosaic exploring reticulate protein family evolution UQ, COMBIOAU, Brisbane 02-03-09Maetschke/Kassahn
traditional methods • phylogenetic tree inference gets increasingly complex is not suitable • phylogenetic networks are even more complex and visualization is difficult mosaic • fast method to analyze and visualize (phylogenetic) sequence relationships • applied to identify and study non-tree like protein families • aim to perform whole proteome scans for reticulate proteins motivation • evolution is complex (horizontal gene transfer, hybridization, genetic recombination, ...) • describing reticulate (non-tree like) phylogenetic relationships as trees maybe an oversimplification the problem
S1 A B MSKR SKRR S2 B A KRRM RRMS ... M S K R R M Q Q V T Q MSKRRMKR RM 4-grams n-gram dot plot n-grams & dot plots • "alignment free" methods • Split sequence in overlapping subsequences of length n • phylogenetics: alignment is corner stone • classical alignment may fail for reticulate proteins MSKRRMSVGQQTW...
c=10n=4 c=10n=4 c=2n=1 some real n-gram dot plots • 4-grams are "unique" for a sequence • we talk about '4' later... >AR_Pt MEVQLGLGRVYPRPPSKTYRGAFQNLFQSVREVIQNPGPRHPEAASAAPPGASLLLQQQQQQQQQQQQQQQQQQQQQQETSPRQQQQQGEDGSPQAHRRGPTGYLVLDEEQQPSQPQSAPECHPERGCVPEPGAAVAASKGLPQQLPAPPDEDDSAAPSTLSLLGPTFPGLSSCSADLKDILSEASTMQLLQQQQQEAVSEGSSSGRAREASGAPTSSKDNYLGGTSTISDSAKELCKAV...
nuclear receptors • DBD: DNA binding, two zinc finger motifs • LBD: Ligand binding domain • AF-1/AF-2: Transcriptional activation domains DBD LBD another n-gram dot plot
number of shared n-grams S = set of n-grams, e.g. {AAGR, AGRK, GRKQ, ...} max: global alignment min: local alignment s [0...1] {AAG,AGQ,GQQ} { GQQ, QQQ} = { GQQ } n-gram sequence similarity s given two sequences and their n-gram sets S1 and S2:
n-gram similarity • fast: linear wrt. size of n-gram sets(classical alignment is quadratic wrt. sequence length) • easy to interpret(0.5 = half of the n-grams are shared) • no parameters (gap penalty, gap extension penalty, ...) • can deal with shuffling of conserved segments and other "strange" cases (Are they actually strange?) • better or worse than BLAST/FASTA? Who knows?(Hoehl 2008: alignment free can be as good as classical alignment for inference of phylogeny, Edgar 2004: MUSCLE: n-gram based alignment method)
MR, r=0.93 4 why 4 and not 42 • Hoehl 2008: n= 3...5 • correlation between n-gram sequence similarity and species divergence times • standard deviation of sequence similarities • maximum AUC when distinguish related and randomly shuffled sequences
T-Rex NeighborNet/SplitsTree Newick Cardona et al. 2008 Bryant et al. 2004, Huson et al. 1998 Makarenkov et al. 2001 phylogenetic networks • different node and edge types • Identification of reticulate events (e.g. recombination) is error prone • computational expensive • larger networks become messy
larger networks - example Huson et al. 2005 Bryant et al. 2004
GR MR PR AR spring layout graph = ridiculugram nuclear receptors • layout dependent • distorted distances • random initialization • local minima • slow
mosaic plot • point size is similarity • no distortions • no random initialization • preserve full information • automatic clustering(spectral rearrangement) • no hard decision about number of clusters
Affinity matrix sij :n-gram similarity between sequences σ : defines neighborhood radius "Degree" matrix Laplacian matrix eigenvector decomposition e : eigenvalues v : eigenvectors v2: eigenvector for 2nd smallest eigenvalue (Fiedler vector) indicates clusters and how well they are separated spectral clustering A = exp(-(1-S)**2/sig) D = diag(A.sum(axis=0)) L = D-A e,v = eigh(L)
spectral clustering • takes "global" properties into account • fast and scales well • no random initialization => single run • global minimum => single, unique solution • few parameters: L, σσ <= mean of distance matrix • "better" than k-means (works for non-spherical clusters)or single linkage hierarchical clustering (no chaining problem) • clustering is NP-hard and spectral clustering is "just another approximation" • recursive spectral clustering to improve cluster quality
the end • fast technique to visualize/analyze reticulate protein family evolution • matrix representation • spectral clustering • n-gram similarity • many other applications Perl free!
questions ? ?
SCOP • SCOP • five families • randomly selected
Nuclear receptors Ligand binding domain N-terminal section Zinc-finger domain
Full length sequence: MrBayes v3.1.2 106 generations, 4 chains 240 CPU-hrs GRMR PRAR
Zinc finger domain MrBayes v3.1.2 106 generations, 4 chains 9 CPU-hrs AR GRMR PR
Ligand-binding domain MrBayes v3.1.2 106 generations, 4 chains 27 CPU-hrs PRAR MRGR
Upstream region MrBayes v3.1.2 106 generations, 4 chains 87 CPU-hrs ?
quality q given two sequences and their n-gram dot plot: diag = set of dot sums along diagonals n = length of sequence max: global alignment min: local alignment q [0...1]