370 likes | 631 Vues
EMBOSS – an application suite for Bioinformatics. Shahid Manzoor Adnan Niazi. E – European M – Molecular B – Biology O – Open S – Software S - Suite. All Information. EMBOSS info at http://emboss.sourceforge.net/ . wEMBOSS info at http://wemboss.sourceforge.net/ .
E N D
EMBOSS – an application suite for Bioinformatics • Shahid Manzoor • Adnan Niazi SLU Global Bioinformatics Centre
E – European M – Molecular B – Biology O – Open S – Software S - Suite SLU Global Bioinformatics Centre
All Information • EMBOSS info at http://emboss.sourceforge.net/. • wEMBOSS info at http://wemboss.sourceforge.net/. • E-mail martin.norling@slu.se to get a username and password for wEMBOSS at http://ebiokit.hgen.slu.se/. SLU Global Bioinformatics Centre
What is EMBOSS • Open Source molecular biology analysis package. • Handles a variety of common file formats. • Provides libraries for easy development • Software, licensed under GPL and LGPL • Developed by Martin Sarachu and Marc Colet • Available at http://emboss.sourceforge.net SLU Global Bioinformatics Centre
Features of EMBOSS • A comprehensive set of sequence analysis programs. • All sequence and many alignment and structural formats are Handled. • It runs on practically every UNIX you can think of (and likely some that you can't), plus Windows and OS X. • Each application has the same style of interface so master one and you've mastered them all. SLU Global Bioinformatics Centre
Uses for EMBOSS • Sequence alignment. • Protein motif identification (including domain analysis) • Nucleotide sequence pattern analysis (for example to identify CpG islands or repeats). • Presentation tools for publications. SLU Global Bioinformatics Centre
Programs in EMBOSS • Many small and large programs in package (>140). • All programs share a common look and feel. • Easy to run from command line. • Retrieval of sequence data from the web. SLU Global Bioinformatics Centre
The one Argument • help • the –help argument displays a short help for any EMBOSS program. SLU Global Bioinformatics Centre
The One Command • wossname • wossname searches the other programs short description for keywords. SLU Global Bioinformatics Centre
Large collection of gene and protein analysis tools Translation Protein domain searching Sequence retrieval Alignments Primer design Restriction Mapping SLU Global Bioinformatics Centre
DNA Sequence 1 DNA Sequence 2 protein Sequence 1 protein Sequence 2 translation dotplot protein local/global alignment multiple sequence alignment motif and domain searching physico-chemical properties SLU Global Bioinformatics Centre
>SEQ1.fasta >SEQ2.fasta AGTGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA AGTGCTCCTCCCTTAGAATCTTAG Dotplots For an exact match: Unix% dottup SEQ1.fasta SEQ2.fasta –window 10 & For a similarity match: Unix% dotmatcher SEQ1.fasta SEQ2.fasta –window 10 –threshold 17 & SLU Global Bioinformatics Centre
A T G C A 5 -4 -4 -4 T -4 5 -4 -4 G –4 -4 5 -4 C -4 -4 -4 5 Dotplots … Window Size is number of bases in a sliding window that is moved along each sequence and compared to generate a single data point on the plot. Window size must be an odd number. Identity Matrix Mismatch Limit determines how similar the two sequences in a window must be to "match". For example, if window size is 9 and mismatch limit is 2, then up to 2 mismatches in a 9 base window will still be classified as a match. SLU Global Bioinformatics Centre
Dotplots … 5 5 5 5 5 5 5 5 5 5 A T G C A 5 -4 -4 -4 T -4 5 -4 -4 G –4 -4 5 -4 C -4 -4 -4 5 Pro Leu 5 5 5 5 5 5 -4 5 5 -4 Pro Leu CCTCCTTTGG Score = 50 CCTCCTTTGG CCTCCTTTGG Score = 32 CCTCCCTTAG SLU Global Bioinformatics Centre
Dotplots • A dot plot is a simple graphical representation of identical residues between two sequences. • The X axis represents the first sequence (PHO5), • The Y axis represents the second sequence (PHO3) • A dot is plotted for each match between two residues of the sequences. • Diagonal lines reveal regions of identity between the two sequences. SLU Global Bioinformatics Centre
Dotplots … • The dot plot can be adapted to display only word matches, which correspond to a diagonal of dots in the letter-based dot plot. • Example: alignment of PHO5 and PHO3 coding sequences, with different word sizes. SLU Global Bioinformatics Centre
Detecting repeats with a dot plot • Sequence repeats are easily detected in a dot plot when a sequence is compared to itself. • The main diagonal is completely marked • (by definition, since the sequence is identical do itself) • Repeats appear as segments of lines parallel to the diagonal. SLU Global Bioinformatics Centre
>SEQ1.fasta >SEQ2.fasta ATGGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA ATGGCTCCTCCCTTAGAATCTTAG Plotorf Unix% plotorf SEQ1.fasta –stop TAA, TAG –out GA.plot & Unix% getorf SEQ1.fasta –minsize 5 –table 0 –find 1 –out GA.getorf & SLU Global Bioinformatics Centre
Frame -1 Frame -2 Frame -3 Frame 3 Frame 2 Frame 1 ATGGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA Start and stop codons are located according to the instructions to the program, and the area in between start and stop codons TACCCAGCACTTCTCTTACGAGGAGGAAACCTTAGAATT SLU Global Bioinformatics Centre
Indication of full coding sequence? Alternative splice form? SLU Global Bioinformatics Centre
Using getorf: >_1 [17 - 37] MLLLWNL >_2 [1 - 36] MGREENAPPLES* start methionine stop codon SLU Global Bioinformatics Centre
>GA.fasta GREENAPPLES Unix% transeq SEQ1.fasta –frame 1 –table 0 –sbegin 4 –send 33 -out GA.fasta & SLU Global Bioinformatics Centre
>GA.fasta >A.fasta GREENAPPLES APPLES Alignments For a global alignment: Unix% needle GA.fasta A.fasta –gapopen 10 –gapextend 0.5 –matrix EPAM250 & For a local alignment: Unix% water GA.fasta A.fasta –gapopen 10 –gapextend 0.5 –matrix EPAM250 & SLU Global Bioinformatics Centre
APPLES GREENAPPLES APPLES APPLES APPLES Alignments … To align two or more sequences in a biologically significant way. GREENAPPLES Gap penalty = 10; Extension penalty = 0.5 Local (water) Global (needle) SLU Global Bioinformatics Centre
APPLES pattern searching physicochemical properties GREENAPPLES looks like the “apples” motif may be part of a larger domain APPLES SLU Global Bioinformatics Centre
Physico-chemical properties Isoelectric point Unix% iep GA.fasta –plot -step 0.5 –out GA.IEP & General properties Unix% pepinfo GA.fasta –hwindow 8 –generalplot –hydropathyplot & SLU Global Bioinformatics Centre
Polar Positive Small Charged Tiny Hydrophobic Aromatic P G A Aliphatic S C V I N T D L Q E M K Y R H F W Physico-chemical properties The pepinfo graph of properties is based on this diagram SLU Global Bioinformatics Centre
Physico-chemical properties non-polar region with small residues polar region to one side of non-charged region SLU Global Bioinformatics Centre
Pattern searching >GL.fasta GREENLEAVES >GA.fasta GREENAPPLES >RL.fasta REDLEAVES >RA.fasta REDAPPLES GREENAPPL---ES -RE-DAPPL---ES GREEN---LEAVES -RE-D---LEAVES [G] (0,1)-R–[E] (1,2)–[ND]–X (3)–L–X (3) – E – S SLU Global Bioinformatics Centre
pattern.fruit [G] (0,1) - [R] – [E] (1,2) – [ND] –x (3) – [L] –x (3) – [E] – [S] Pattern searching Search a protein database: Unix% fuzzpro sptr:* pattern.fruit –mismatch 0 –out GA.fuzzpro & Nothing resembling this pattern is found in the database - But we could try scanning PRINTS (pscan) and PROSTIE (patmatmotifs) with one of our sequences. SLU Global Bioinformatics Centre
Some Programs SLU Global Bioinformatics Centre
Some Programs … SLU Global Bioinformatics Centre
More Information SLU Global Bioinformatics Centre