1 / 30

Sequence Similarity Searches

Sequence Similarity Searches. Subtitle. Three key questions. Query? Purpose? Database?. BLAST. >gi|77630012|ref|ZP_00792598.1| COG0442: Prolyl-tRNA synthetase [Yersinia pseudotuberculosis IP 31758] Length=572 Score = 1013 bits (2619), Expect = 0.0, Method: Composition-based stats.

aizza
Télécharger la présentation

Sequence Similarity Searches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Similarity Searches Subtitle

  2. Three key questions • Query? • Purpose? • Database?

  3. BLAST

  4. >gi|77630012|ref|ZP_00792598.1| COG0442: Prolyl-tRNA synthetase [Yersinia pseudotuberculosis IP 31758] Length=572 Score = 1013 bits (2619), Expect = 0.0, Method: Composition-based stats. Identities = 498/572 (87%), Positives = 537/572 (93%), Gaps = 0/572 (0%) Query 1 MRTSQYMLSTLKETPADAEVISHQLMLRAGMIRKLASGLYTWLPTGLRVLRKVENIVREE 60 MRTSQY+LST KETPADAEVISHQLMLRAGMIRKLASGLYTWLPTG+RVL+KVENIVREE Sbjct 1 MRTSQYLLSTQKETPADAEVISHQLMLRAGMIRKLASGLYTWLPTGVRVLKKVENIVREE 60 Query 61 MNNAGAIEVSMPVVQPADLWVESGRWDQYGPELLRFVDRGERPFVLGPTHEEVITDLIRN 120 MNNAGAIEVSMPVVQPADLW ESGRW+QYGPELLRFVDRGERPFVLGPTHEEVITDLIR Sbjct 61 MNNAGAIEVSMPVVQPADLWQESGRWEQYGPELLRFVDRGERPFVLGPTHEEVITDLIRG 120 Query 121 EVSSYKQLPLNFFQIQTKFRDEVRPRFGVMRSREFLMKDAYSFHTSQESLQATYDTMYAA 180 E++SYKQLPLNFFQIQTKFRDEVRPRFGVMR+REFLMKDAYSFHT+QESLQ TYD MY A Sbjct 121 EINSYKQLPLNFFQIQTKFRDEVRPRFGVMRAREFLMKDAYSFHTTQESLQETYDAMYTA 180 …………………………. Query 481 MNMHKSFRVKEVAEDIYQQLRAKGIEVLLDDRKERPGVMFADMELIGVPHTIVIGDRNLD 540 MNMHKSFRVKE+AE++Y LR+ GI+V+LDDRKERPGVMFADMELIGVPH IVIGDRNLD Sbjct 481 MNMHKSFRVKELAEELYTTLRSHGIDVILDDRKERPGVMFADMELIGVPHNIVIGDRNLD 540 Query 541 SEEIEYKNRRVGEKQMIKTSEIIDFLLANIIR 572 SEE+EYKNRRVGEKQMIKTSEI++FLL+ I R Sbjct 541 SEEVEYKNRRVGEKQMIKTSEIVEFLLSQIKR 572

  5. Global Alignment vs. Local Alignment • Global Methods find the best alignment of both sequences in their entirety • Local Methods find the best alignable subsections of both sequences

  6. Sequence Similarity Searches using BLAST BLAST: Basic Local Alignment Search Tool Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. J Mol Biol. 1990 Oct 5;215(3):403-10. Statistical basis: Karlin, S., and Altschul, S. F. (1990) ``Method for assessing the statistical significance of molecular sequence features by using general scoring schemes,'' Proceedings of the National Academy of Science, USA 87, 2264-2268.

  7. Comparing a Genome to Other Genes and Genomes BLAST = Basic Local Alignment Search Tool BLASTN DNA sequence vs. DNA sequence db BLASTP protein sequence vs. protein sequence db BLASTX DNA sequence translated in 6 reading frames vs. protein sequence db tBLASTX DNA sequence translated in 6 reading frames vs. DNA sequence db translated in 6 frames PSI-BLAST Iterative Search

  8. Comparing a Genome to Other Genes and Genomes BLAST = Basic Local Alignment Search Tool • Find a potential match in the database by finding a little seed (or seeds) of a match • 2. Extend that seed and score the resulting alignment based on co-occurance of amino acids (nucleotides) in “known” alignments • Determine whether the possible alignment looks better than you might expect by chance alone. • 4. Decide whether the match tells you anything about biology.

  9. Find a potential match in the database by finding a little seed (or seeds) of a match db query Your query is small relative to the universe of known sequences

  10. N Y A L L P W M T A Y E N V Y L A V D V F Q N E L L P WR N V Q D N V A F G 2. Extend the seed and score the resulting alignment based on co-occurance of amino acids (nucleotides) in “known” alignments

  11. How does BLASTP score an alignment? Substitution Matrix based on co-occurrence in related proteins BLOSUM = BLOcks Substitution Matrix Identify gap-free protein alignments in the BLOCKS database. BLOSUM# corresponds to % identity for inclusion Count co-occurrence of Aas Calculate log-odds

  12. How does BLASTP score an alignment? Substitution Matrix based on co-occurance in related proteins 62 means that contributions from proteins more than 62% identical are weighted to sum to one. Other matrices are available for comparisons of more or less divergent proteins.

  13. How does BLASTP score an alignment? Walk through the alignment and add up the score Query: AFGECDA AF C+A Sbjct: AFAFCEA 4+6+0+(-3)+9+2+4 = 22 Normalize  bit score

  14. Statistics of BLAST when no gaps are allowed • The number of matches (E) expected to occur with a score as good as S just by random chance, when you search a sequence the size of your query against a database as large as the one you chose (m and n), tends to follow an Extreme Values Distribution (K and lambda). • Simulation is used to estimate K and lambda for gapped BLAST

  15. How good is your BLAST hit? • The number of matches (E) expected to occur with a score as good as S just by random chance >gi|77630012|ref|ZP_00792598.1| COG0442: Prolyl-tRNA synthetase [Yersinia pseudotuberculosis IP 31758] Length=572 Score = 1013 bits (2619), Expect = 0.0, Method: Composition-based stats. Identities = 498/572 (87%), Positives = 537/572 (93%), Gaps = 0/572 (0%) Query 1 MRTSQYMLSTLKETPADAEVISHQLMLRAGMIRKLASGLYTWLPTGLRVLRKVENIVREE 60 MRTSQY+LST KETPADAEVISHQLMLRAGMIRKLASGLYTWLPTG+RVL+KVENIVREE Sbjct 1 MRTSQYLLSTQKETPADAEVISHQLMLRAGMIRKLASGLYTWLPTGVRVLKKVENIVREE 60

  16. Search one protein against a given database and most of the E values are zero

  17. Search one protein against a given database and most of the E values are zero Search the protein encoded by the gene next to it in the genome against the same database and all the E values are much higher.

  18. Search the same protein against two different databases and the E value is different for the same hit.

  19. InterPro

  20. InterPro release 16.0 contains Database All Signatures Integrated PANTHER 30128 2061 Pfam 8957 8957 PIRSF 1748 1499 PRINTS 1900 1898 ProDom 3538 1041 PROSITE 1319 1319 SMART 724 721 TIGRFAMs 2949 2933 Gene3D 2147 783 SUPERFAMILY 1538 463 15045 entries: Active sites 34 Binding sites 22 Domains 4676 Families 10060 PTMs 18 Repeats 235

  21. InterPro: www interface

  22. Sample InterPro Family

  23. Pfam • Database of protein domains and families available as multiple alignments and HMMs • Pfam-A is curated. Pfam-B is automated.

  24. A sample Pfam: MCPsignal

  25. Domain Structure of Members

  26. Pfam- Seed Alignment

  27. HMM-Logo Plots

  28. Pfam – scoring members • Trusted cut-off • Bit score for lowest scoring match included in the full alignment • Noise cut-off • Bit score for highest scoring match not included in the full alignment • Gathering cut-off

More Related