340 likes | 476 Vues
Packages intégrés d’applications d’analyses de séquences. Anciennement GCG : depuis 1985 Passage au commercial 1989_2013 : Développement d’un package avec les mêmes fonctionnalités : meme interface et options acces aux banques de données sorties graphiques open source
E N D
Packages intégrés d’applications d’analyses de séquences Anciennement GCG : depuis 1985 Passage au commercial 1989_2013 : Développement d’un package avec les mêmes fonctionnalités : meme interface et options acces aux banques de données sorties graphiques open source possibilité de délepper et contribuer EMBOSS (vers 6.5 : 1 par an en juillet) Compléments par plusieurs packages spécialités (PHYLIP, Vienna … Nombreux interfaces graphiques : PISE, Explorer, Mobyle … Autres packages : UGENE (Russ) récent (intègre le NGS)
Formats de séquences FORMAT STADEN (TEXTE) $ more zfmtsec SESLRIIFAGTPDFAARHLDALLSSGHNVVGVFTQPDRPAGRGKKLMPSPVKVLAEEKGL PVFQPVSLRPQENQQLVAELQADVMVVVAYGLILPKAVLEMPRLGCINVHGSLLPRWRGA APIQRSLWAGDAETGVTIMQMDVGLDTGDMLYKLSCPITAEDTSGTLYDKLAELGPQGLI TTLKQLADGTAKPEVQDETLVTYAEKLSKEEARIDWSLSAAQLERCIRAFNPWPMSWLEI EGQPVKVWKASVIDTATNAAPGTILEANKQGIQVATGDGILNLLSLQPAGKKAMSAQDLL NSRREWFVPGNRLV FORMAT FASTA >em|U03177|FL03177 Feline leukemia virus clone FeLV-69TTU3-16. AGATACAAGGAAGTTAGAGGCTAAAACAGGATATCTGTGGTTAAGCACCTG TGAGGCCAAGAACAGTTAAACCCCGGATATAGCTGAAACAGCAGAAGTTTC GCCAGCAGTCTCCAGGCTCCCCA >entête de la séquence 2. séquence 2 Formats d alignements Formats des banques : GenBank, EMBL, Uniprot ..
ORF : sequence entre 2 stop CDS : entre ATG et Stop
Alternative : Utilisation d’une liste de ID seqret @list
prophecy: Détermination d une matrice de profile à partir d’un alignement multiple • prophet : recherche de profile sur un set de séquences
Eléments d’équivalence (AN ou AA) dans la recherche de patterns
fuzznuc • Pattern specification • Patterns for fuzznuc are based on the format of pattern used in the PROSITE database, with the difference that the terminating dot '.' and the hyphens, '-', between the characters are optional. • The PROSITE pattern definition from the PROSITE documentation (amended to refer to nucleic acid sequences, not proteins) follows. • The standard IUPAC one-letter codes for the nucleotides are used. • The symbol `n' is used for a position where any nucleotide is accepted. • Ambiguities are indicated by listing the acceptable nucleotides for a given position, between square parentheses `[ ]'. For example: [ACG] stands for A or C or G. • Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the nucleotides that are not accepted at a given position. For example: {AG} stands for any nucleotides except A and G. • Each element in a pattern is separated from its neighbor by a `-'. (Optional in fuzznuc). • Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: N(3) corresponds to N-N-N, N(2,4) corresponds to N-N or N-N-N or N-N-N-N. • When a pattern is restricted to either the 5' or 3' end of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol. • A period ends the pattern. (Optional in fuzznuc). • All other characters, including spaces are not allowed. • For example, in the EMBL entry J01636 you can look for the pattern: • [CG](5)TG{A}N(1,5)C • This searches for "C or G" 5 times, followed by T and G, then anything except A, then any base (1 to 5 times) before a C. • You can use ambiguity codes for nucleic acid searches but not within [] or {} as they expand to bracketed counterparts. For example, “s" is expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is illegal. • Note the use of X is reserved for proteins. You must use N for nucleic acids to refer to any base. • The search is case-independent, so 'AAA' matches 'aaa'. • Other : (RY){5,10} alternance de purine pyrimidine au minimum 5x
fuzzpro • Pattern specification proteic • Patterns for fuzzpro are based on the format of pattern used in the PROSITE database, with the difference that the terminating dot '.' and the hyphens, '-', between the characters are optional. The PROSITE pattern definition from the PROSITE documentation follows. • The standard IUPAC one-letter codes for the amino acids are used. • The symbol `x' is used for a position where any amino acid is accepted. • Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses `[ ]'. For example: [ALT] stands for Ala or Leu or Thr. • Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met. • Each element in a pattern is separated from its neighbor by a `-'. (Optional in fuzzpro). • Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x. • When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol. • A period ends the pattern. (Optional in fuzzpro). • All other characters, including spaces are not allowed. • For example, in SWISSPROT entry 100K_RAT you can look for the pattern: • [DE](2)HS{P}X(2)PX(2,4)C • This means: Two Asps or Glus in any order followed by His, Ser, any residue other then Pro, then two of any residue followed by Pro followed by two to four of any residue followed by Cys. • The search is case-independent, so 'AAA' matches 'aaa'.
Recherche d'expression régulière (unix) : dreg, preg • Notes on dreg • A regular expression is a way of specifying an ambiguous pattern to search for. Regular expressions are commonly used in some computer programming languages and may be more familiar to some users than to others. • The following is a short guide to regular expressions in EMBOSS: • ^ use this at the start of a pattern to insist that the pattern can only match at the start of a sequence. (eg. '^AUG' matches a start codon at the start of the sequence) • $ use this at the end of a pattern to insist that the pattern can only match at the end of a sequence (eg. 'A+$' matches a poly-A sequence at the end of the sequence) • () groups a pattern. This is commonly used with '|' (eg. '(AUG)|(ATG)' matches either the DNA or RNA form of the initiation codon ) • | • This is the OR operator to enable a match to be made to either one pattern OR another. There is no AND operator in this version of regular expressions. • The following quantifier characters specify the number of time that the character before (in this case 'x') matches: • x? matches 0 or 1 times (ie, '' or 'x') • x* matches 0 or more times (ie, '' or 'x' or 'xx' or 'xxx', etc) • x+ matches 1 or more times (ie, 'x' or 'xx' or 'xxx', etc) • {min,max} Braces can enclose the specification of the minimum and maximum number of matches. A match of 'x' of between 3 and 6 times is: 'x{3,6}' • Quantifiers can follow any of the following types of character specification: • X any character (ie 'A') • \x the character after the backslash is used instead of its normal regular expression meaning. This is commonly used to turn off the special meaning of the characters '^$()|?*+[]-.'. It may be especially useful when searching for gap characters in a sequence (eg '\.' matches only a dot character '.') • [xy] match one of the characters 'x' or 'y'. You may have one or more characters in this set. • [x-z] match any one of the set of characters starting with 'x' and ending in 'y' in ASCII order (eg '[A-G]' matches any one of: 'A', 'B', 'C', 'D', 'E', 'F', 'G') • [^x-z] matches anything except any one of the group of characters in ASCII order (eg '[^A-G]' matches anything EXCEPT any one of: 'A', 'B', 'C', 'D', 'E', 'F', 'G') • . the dot character matches any other character (eg: 'A.G' matches 'AAG', 'AaG', 'AZG', 'A-G' 'A G', etc.) • Combining some of these features gives the example: '([AGC]+GGG)|(TTTGGG)' which matches one or more of any one of 'A' or 'G' or 'C' followed by three 'G's or it matches just 'TTTGGG'. Regular expressions are case-sensitive. The pattern 'AAAA' will not match the sequence 'aaaa'. For this reason, both your pattern and the input sequences are converted to upper-case. Attention : certains serveurs ne permettent pas l’utilisation de caractères spéciaux : utiliser emboss en local
Implémentations graphiques mEMBOSS (executable sur windows) JemBOSS (interface graphique ) Mobyle Nouvelle interface universel (développé à Pasteur) pour en particulier les programes d'analyse de séquences . Utilisation simple
Toutes les sorties de EMBOSS sont structurées Facile à "parser" et à inclure dans des scripts pour des analyses récurentes sur des sets de séquences