1 / 27

Welcome to Introduction to Bioinformatics

Welcome to Introduction to Bioinformatics. I. Scenario 4: Sequence alignment Bring up course web site Go to Scenario 4 Open the first sequence alignment notes. Scenario 3: Our Story. You: Our first defense at CDC. Outbreak:. . . . Anthrax?. Samples:. Confirm agent.

beaa
Télécharger la présentation

Welcome to Introduction to Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Welcome toIntroduction to Bioinformatics • I. Scenario 4: Sequence alignment • Bring up course web site • Go to Scenario 4 • Open the first sequence alignment notes

  2. Scenario 3: Our Story You: Our first defense at CDC Outbreak: . . . Anthrax? Samples: • Confirm agent • Identify strain

  3. Toxin gene-specific primers Scenario 3: Our Story

  4. PCR Scenario 3: Our Story If DNA from bacterium with toxin gene If DNANOTfrom bacterium with toxin gene?

  5. PCR Scenario 3: Our Story If DNA from bacterium with toxin gene If DNANOTfrom bacterium with toxin gene? (no product)

  6. AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATGAATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG >gi|16031490|emb|AJ413935.1|BAN413935 Bacillus anthracis partial lef gene, isolate Microsoft-6259 Length = 2417 Score = 155 bits (78), Expect = 2e-35 Identities = 138/158 (87%) Strand = Plus / Plus Query: 1 aatattgacgctttactacatcagtccatcggaagtacgttgtataataaaatatatctg 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1267 aatattgacgctttactacatcagtccatcggaagtacgttgtataataaaatatatctg 1326 Query: 61 tatgaaaacatgaatataaataacttaacagcaacgttaggtgccgatttagtagattcc 120 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1327 tatgaaaacatgaatataaataacctaacagcaacgttaggtgccgatttagtagattcc 1386 Query: 121 acagataatacaaaaattaatcgaggtatattcaatga 158 |||||||||||||||||||||||||||||||||||||| Sbjct: 1387 acagataatacaaaaattaatcgaggtatattcaatga 1424 Scenario 3: Our Story DG47

  7. PCR Toxin gene present Scenario 3: Our Story

  8. AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATGAATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Do it! Scenario 3: Our Story DG47

  9. Scenario 3: Our Story Maybe it’s not from the toxin gene??

  10. AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATGAATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Translate NIDALLHQSIGSTLYNKIYLYENMNINNLTATLGADLVDSTDNTKINRGIFNEFKKNFKYSIS Do it! Scenario 3: Our Story DG47

  11. DG47 nucleotide sequence: Matches nothing in GenBank DG47 amino acid sequence: 100% match to toxin gene

  12. Do it! Scenario 3: Our Story Compare nucleotide sequences by hand DG47vslef

  13. Scenario 3: Our Story Compare nucleotide sequences by hand DG47      1  AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTG          |||||||| |||||| ||||||| ||||| |||||||| ||||| |||||||| ||| || lef gene 1831  AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAAAATTTATTTG DG47       61  TATGAAAACATGAATATAAATAACTTAACAGCAACGTTAGGTGCCGATTTAGTAGATTCC         |||||||| |||||||| |||||| | ||||||||  ||||||| |||||||| ||||||lef gene 1891  TATGAAAATATGAATATCAATAACCTTACAGCAACCCTAGGTGCGGATTTAGTTGATTCC DG47      121  ACAGATAATACAAAAATTAATCGAGGTATATTCAATGAGTTCAAAAAAAATTTCAAATAC            || |||||||| ||||||||| ||||||| |||||||| |||||||||||||||||||||lef gene 1951  ACTGATAATACTAAAATTAATAGAGGTATTTTCAATGAATTCAAAAAAAATTTCAAATAT DG47      181  AGTATTTCTA       |||||||||| lef gene 2011  AGTATTTCTA 89% identical!

  14. AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATGAATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Sequence 1lcl|PCR Product DG47 Length190 Sequence 2lcl|M29081: Bacillus anthracis lethal factor (lef) gene, 1831-2020. Length190 No significant similarity was found Scenario 3: Our Story Compare nucleotide sequences by hand DG47 +lef gene

  15. DG47      1  AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTG   |||||||| |||||| ||||||| ||||| |||||||| ||||| |||||||| ||| || lef gene 1831  AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAAAATTTATTTG 89% identical! Sequence 1lcl|PCR Product DG47 Length190 Sequence 2lcl|M29081: Bacillus anthracis lethal factor (lef) gene, 1831-2020. Length190 No significant similarity was found Scenario 3: Our Story Why can’t Blast figure outwhat you can plainly see?

  16. Scenario 3: How does Blast work? • Clearly we need to understand more about how • sequence alignment really works! • Theory behind nucleotide vs nucleotide Blast • Working BlastN program • Theory behind protein-protein Blast • How to get Blast to do what you want

  17. “Flavours” of sequence alignment Global Alignment - Needleman-Wunsch algorithm - Compares two sequences across their whole length - Mostly only useful when you already know sequences might be similar - Not useful for comparing a short query to an entire genome. - Not discussed further in this class. Local Alignment - Allows alignment of subsequences of the target and the query • Usually what we want ; the query can be searched against entire genomes or large databases.

  18. Crude Local Alignment Methods The “Dot Matrix” method (Gibbs and McIntyre, 1970) Represents the query and target sequences as a matrix ( a two-dimensional array) using a sliding window of similarity The human eye can powerfully distinguish the identity line from the noise

  19. The “Dot Matrix” method (Gibbs and McIntyre, 1970) Normally a “window size” and “stringency” are specified i.e. if the window size is 8 and stringency is 6, a dot is only placed if at least 6 of the current 8 positions in the query match the target

  20. The “Dot Matrix” method (Gibbs and McIntyre, 1970) G G T A A T A G window = 2 stringency = 2 G T A A T A

  21. Problems with the Dot Matrix method • Requires human supervision! • A memory and processor time pig (a complete m*n matrix is calculated each time) • No explicit handling of gaps • No good quantitative score of alignment quality

  22. The Smith-Waterman Algorithm (no gaps version) G G T A A T A G 1 1 Match Extension = +1 NoMatch Penalty = -2 G 1 2 3 T 1 A 4 1 2 Negative values are reset to zero!! C 2 1 3 T Download SmithWaterman1.py A 2 1 4

  23. Smith Waterman – Dynamic Programming An optimal alignment can be found starting from the highest scoring box and working backwards. Dynamic Programming is a method for recording the solutions to subproblems, then working backwards to find an overall solution. If we incorporate gaps, we must start keeping track of this “traceback” pathway.

  24. 2 -2 -2 The Smith-Waterman Algorithm (with gaps) G G T A A T A Match Extension = +1 NoMatch Penalty = -2 Gap Penalty = -3 G 1 1 G 1 2 3 Take the Max of: 0;adding Query Gap; adding Target Gap; Match/No match; T A 4 1 C 1 T Download SmithWaterman2.py A

  25. Problems with Smith-Waterman Still a pig! Memory and processor time requirements are huge when the query and/or the database gets large….. (a complete m*n matrix is still calculated each time!!) Do we really need to calculate the whole matrix?

  26. BlastN – “word” based heuristics Notice that in a typical S-W matrix, most of the boxes are empty!!! What if we find exact matches of some seed words, then just work in the area surrounding these seeds trying to extend the alignment? This is exactly the heuristic that blast employs to avoid calculating the whole matrix! (see figure on page 6 of Alignment notes)

  27. BlastN Procedure Filter the query sequence for repetitive “low complexity” sequences Identify the subsequences of size word in the query Find the exact matches in the target of the all the words Use a modified S-W to extend the hits around the seed words Score and report on the best matches More on scoring on next class!!!

More Related