1 / 87

CS 5263 Bioinformatics

CS 5263 Bioinformatics. Lecture 3: Dynamic Programming and Sequence Alignment. Roadmap. Review of last lecture Biology Dynamic programming Sequence alignment. R. R. R. R. R. R. …. H2N. COOH. C-terminal. N-terminal. Carboxyl group. Amino group. Protein zoom-in.

caron
Télécharger la présentation

CS 5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment

  2. Roadmap • Review of last lecture • Biology • Dynamic programming • Sequence alignment

  3. R R R R R R … H2N COOH C-terminal N-terminal Carboxyl group Amino group Protein zoom-in • Composed of a chain of amino acids. R | H2N--C--COOH | H Side chain

  4. Genome, Chromosome, Gene

  5. DNA Replication • The process of copying a double-stranded DNA molecule • Semi-conservative 5’-ACATGATAA-3’ 3’-TGTACTAT-5’  5’-ACATGATAA-3’ 5’-ACATGATAA-3’ 3’-TGTACTATT-5’ 3’-TGTACTATT-5’

  6. Transcription (where genetic information is stored) • DNA-RNA pair: • A=U, C=G • T=A, G=C (for making mRNA) Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’ Template strand: 3’-TGCATCTGCATATCTCGGATC-5’ mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’ Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA.

  7. The Genetic Code Third letter

  8. Translation • The sequence of codons is translated to a sequence of amino acids • Gene: -GCT TGT TTA CGA ATT- • mRNA: -GCUUGUUUACGAAUU - • Peptide: - Alu - Cys - Leu - Arg - Ile – • Start codon: AUG • Also code Met • Stop codon: UGA, UAA, UAA

  9. Dynamic programming • What is dynamic programming? • Solve an optimization problem by tabulating sub-problem solutions (memorization) rather than re-computing them

  10. Elements of dynamic programming • Optimal sub-structures • Optimal solutions to the original problem contains optimal solutions to sub-problems • Solutions to sub-problems are independent • Overlapping sub-problems • Some sub-problems appear in many solutions • We should not solve each sub-problem for more than once • Memorization and reuse • Carefully choose the order that sub-problems are solved • Tabulate the solutions • Bottom-up

  11. Example • Find the shortest path in a grid 2 3 1 (0,0) s 1 5 1 1 3 3 2 3 3 2 2 2 1 1 2 1 2 3 4 g (3,3)

  12. Optimal substructure • If a path P(s, g) is optimal, any sub-path, P(s,x), where x is on P(s,g), is also optimal • Proof by contradiction • If the path between P(s,x) is not the shortest, i.e., P’(s,x) < P(s,x) • Construct a new path P’(s,g) = P’(s,x) + P(x, g) • P’(s,g) < P(s,g) => P(s,g) is not the shortest • Contradiction

  13. Overlapping sub-problems • Some sub-problems are used by many paths (0,0) -> (2,0) used by 3 paths

  14. Memorization and reuse • Easy to tabulate and reuse • Number of sub-problems ~ number of nodes • P(s, x), for x in all nodes except s and g • Find an order such that no sub-problems need to be recomputed • First compute the smallest sub-problems • Use solutions of small sub-problems to solve large sub-problems

  15. Example: shortest path 2 3 1 0 1 5 1 1 3 3 2 3 3 2 2 2 1 1 2 1 2 3 4

  16. Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 3 2 3 3 2 2 2 4 1 1 2 1 2 3 4 5

  17. Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 2 3 3 2 2 2 4 1 1 2 1 2 3 4 5

  18. Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 3 2 3 3 2 2 2 4 1 1 2 1 2 3 4 5

  19. Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 1 1 2 1 2 3 4 5

  20. Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 1 1 2 1 2 3 4 5

  21. Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 6 1 1 2 1 2 3 4 5

  22. Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 6 8 1 1 2 1 2 3 4 5

  23. Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 6 8 1 1 2 1 2 3 4 5 5

  24. Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 6 8 1 1 2 1 2 3 4 5 5 7

  25. Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 6 8 1 1 2 1 2 3 4 5 5 7 10

  26. Example: shortest path 2 3 1 0 2 5 6 1 5 1 1 3 1 2 3 6 3 2 3 3 2 2 2 4 4 6 8 1 1 2 1 2 3 4 5 5 7 10

  27. Analysis • For a nxn grid • Enumeration: • number of paths = (2n!)/(n!)^2 • Each path has 2n steps • Total operation: 2n * (2n!) / (n!)^2 = O(2^(2n)) • Recursive call: O(2^(2n)) • DP: O(n^2)

  28. Example: Fibonacci Seq • F(n) = F(n-1) + F(n-2), F(0) = F(1) = 1 Function fib(n) if (n == 0 or n == 1) return 1; else return fib(n-1) + fib(n-2);

  29. Time complexity: O(1.62^n)

  30. Example: Fibonacci Seq function fib(n) F[0] = 1; F[1] = 1; For i = 2 to n F[n] = F[n-1] + F[n-2]; End Return F[n];

  31. Time: O(n), space: O(n)

  32. What if it is not so easy to figure out an order to fill in the table? • Exercise

  33. Today’s lecture • Sequence alignment • Global alignment

  34. Why seq alignment? • Similar sequences often have similar origin or function • Two genes are said to be homologous if they share a common evolutionary history. • Evolutionary history can tell us a lot about properties of a given gene • Homology can be inferred from similarity between the genes • New protein sequences are always compared to sequence databases to search for proteins with same or similar functions • Most widely used computational tools in biology

  35. Evolution at the DNA level C …ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… Sequence edits: Mutation, deletion, insertion

  36. Evolutionary Rates next generation OK OK OK X X Still OK?

  37. Sequence conservation implies function

  38. Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • Definition • An alignment of two string S, T is a pair of strings S’, T’ (with spaces) s.t. • |S’| = |T’|, and (|S| = “length of S”) • removing all spaces in S’, T’ leaves S, T

  39. What is a good alignment? Alignment: The “best” way to match the letters of one sequence with those of the other How do we define “best”?

  40. S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • The scoreof aligning (characters or spaces) x & y is σ (x,y). • Scoreof an alignment: • An optimal alignment: one with max score

  41. Scoring Function • Sequence edits: AGGCCTC • Mutations AGGACTC • Insertions AGGGCCTC • Deletions AGG-CTC Scoring Function: Match: +m ~~~AAC~~~ Mismatch: -s ~~~A-A~~~ Gap (indel): -d

  42. More complex scoring function • Substitution matrix • Similarity score of matching two letters a, b should reflect the probability of a, b derived from same ancestor • It is usually defined by log likelihood ratio (Durbin book) • Active research area. Especially for proteins. • Commonly used: PAM, BLOSUM

  43. An example substitution matrix

  44. Match = 2, mismatch = -1, gap = -1 • Score = 3 x 2 – 2 x 1 – 1 x 1 = 3

  45. How to find it? • A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end output the retained alignment S = abcd A = cd T = wxyz B = xz -abc-d a-bc-d w--xyz -w-xyz

  46. Analysis • Assume |S| = |T| = n • Cost of evaluating one alignment: ≥n • How many alignments are there: • pick n chars of S,T together • say k of them are in S • match these k to the k unpicked chars of T • Total time: • E.g., for n = 20, time is > 240 >1012 operations

  47. Dynamic Programming • We will now describe a dynamic programming algorithm Suppose we wish to align x1……xM y1……yN Let F(i,j) = optimal score of aligning x1……xi y1……yj

  48. Dynamic Programming (cont’d) Notice three possible cases: • xM aligns to yN ~~~~~~~ xM ~~~~~~~ yN 2. xM aligns to a gap ~~~~~~~ xM ~~~~~~~ - • yN aligns to a gap ~~~~~~~ - ~~~~~~~ yN m, if xM = yN F(M,N) = F(M-1, N-1) + -s, if not F(M,N) = F(M-1, N) - d F(M,N) = F(M, N-1) - d

More Related