1 / 86

Computational Molecular Biology

Computational Molecular Biology. Multiple Sequence Alignment. Sequence Alignment. Problem Definition: Given: 2 DNA or protein sequences Find: Best match between them What is an Alignment: Given: 2 Strings S and S’

ricky
Télécharger la présentation

Computational Molecular Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Molecular Biology Multiple Sequence Alignment

  2. Sequence Alignment • Problem Definition: • Given: 2 DNA or protein sequences • Find: Best match between them • What is an Alignment: • Given: 2 Strings S and S’ • Goal: The lengths of S and S’ are the same by inserting spaces (--; sometimes denote as ∆) into these strings My T. Thai mythai@cise.ufl.edu

  3. Matches, Mismatches and Indels • Match: two aligned, identical characters in an alignment • Mismatch: two aligned, unequal characters • Indel: A character aligned with a space A A C T A C T -- C C T A A C A C T -- -- -- -- C T C C T A C C T -- -- T A C T T T 10 matches, 2 mismatches, 7 indels My T. Thai mythai@cise.ufl.edu

  4. Basic Algorithmic Problem • Find the alignment of the two strings that: • max m where m = (# matches – mismatches – indels) • Or min m where m is the SP-score of an alignment • m defines the similarity of the two strings, also called Optimal Global Alignment • Biologically: a mismatch represents a mutation, whereas an indel represents a historical insertion or deletion of a single character My T. Thai mythai@cise.ufl.edu

  5. Multiple Sequence Alignment • Problem Definition: • Similar to the sequence alignment problem but the input has more than 2 strings • Challenges: • NP-hard, MAX-SNP • Guarantee factor: 2 – 2/k where k is the number of the input sequences. • More work to reduce the time and space complexity My T. Thai mythai@cise.ufl.edu

  6. Sum of Pairs Score (SP-Score) • Given a finite alphabet and where ∆ denotes a space • Consider k sequences over that we want to align. After an alignment, each sequence has length l • A score d is assigned to each pair of letters: My T. Thai mythai@cise.ufl.edu

  7. SP-Score • The SP-Score of an alignment A is defined as: • Consider a matrix of l columns and k rows where the rows represents the sequences and columns represent the letters • SP-Score is the sum of the scores of all columns: • Score of each column is the sum of the scores of all distinct unordered pairs of letters in the column • Or we can view as sum of pairwise sequence alignment values. • Find an (optimal) alignment to minimize the SP-Score value My T. Thai mythai@cise.ufl.edu

  8. Proving MSA with SP-Score that is a Metric is NP-hard My T. Thai mythai@cise.ufl.edu

  9. Some Notations My T. Thai mythai@cise.ufl.edu

  10. Some Basic Properties • Lemma 1: Let s1, s2 be two sequences over Σ such that l1=|s1|, l2=|s2|, l2≥l1 and there are m symbols of s1 that are not in s2. Then every alignment of the set {s1,s2} has at least m+l2-l1 mismatches My T. Thai mythai@cise.ufl.edu

  11. My T. Thai mythai@cise.ufl.edu

  12. The construction • Reduce the vertex cover (or node cover) to MSA. • Vertex cover: • Instance: A graph G=(V,E) and an integer k≤|V| • Question: Is there a vertex cover V1 of G of size k or less? • MSA: • Instance: A set S={s1, …, sn} of finite sequences over a fixed alphabet Σ, an SP-score and an integer C • Question: Is there a multiple alignment of the sequences in S that is of value C or less? My T. Thai mythai@cise.ufl.edu

  13. SP-Score (alphabet of size 6) My T. Thai mythai@cise.ufl.edu

  14. The Reduction So, we have , T is a set of C2 sequences t and X contains C1 sequences x(k), where C1 and C2 will be determined later My T. Thai mythai@cise.ufl.edu

  15. An Example My T. Thai mythai@cise.ufl.edu

  16. Intuition • By the above construction, an optimal alignment A of S is obtained when A satisfies certain properties (called standard alignment) • The value of standard alignment is bounded by a given threshold C only where G has a vertex cover of size k • How to obtain: • Force d’s of the test sequences to be aligned with b’s of the edge sequences • Only one b of each edge sequence can be aligned to a d • The number of such alignment determines the value of the alignment My T. Thai mythai@cise.ufl.edu

  17. Standard Alignemnt My T. Thai mythai@cise.ufl.edu

  18. My T. Thai mythai@cise.ufl.edu

  19. My T. Thai mythai@cise.ufl.edu

  20. My T. Thai mythai@cise.ufl.edu

  21. My T. Thai mythai@cise.ufl.edu

  22. Let US and US,X denote the upper bounds of D(AS) and D(AS,X) respectively • By Corollary 8 and Lemma 9, we have the standard alignment has value not greater than DSD + US + US,X • where DSD = D(AX) + D(AT) + D(AX,T) + D(AS,T) over a standard alignment A • Now, let C1 > US and C2 > US + US,X, we can prove that an optimal alignment must be a standard one My T. Thai mythai@cise.ufl.edu

  23. My T. Thai mythai@cise.ufl.edu

  24. My T. Thai mythai@cise.ufl.edu

  25. Show the NP-hardness of any scoring matrix in a broad class M Show that there is a scoring matrix M0 such that MSA for M0 is MAX-SNP hard My T. Thai mythai@cise.ufl.edu

  26. Interesting Observation • Via the brute force, optimal MSA contains very few gaps • Suggesting the study of gap limitations: • Have an upper bound of the number of gaps one can insert during the alignment • Special case: • Gap-0: No gap allows, but we can shift the strings for an alignment (insert gaps at the beginning or at the end of a string) • Gap-0-1: a gap-0 alignment such that the gaps at the beginning or at the end of each string is exactly one space My T. Thai mythai@cise.ufl.edu

  27. Problem Definition • Given a finite alphabet • Scoring matrix • For i, j > 0, si,j represents the penalty for aligning ai with aj • For i > 0, s0,i and si,0 are called indel penalites • Gap opening penalties (in addition to the indel penalties) for aligning ai with the first or last ∆ in the string of ∆’s My T. Thai mythai@cise.ufl.edu

  28. Generic Scoring Matrix Where Σ={A,T}, x, y, x are fixed nonnegative numbers and u > max{0, vA, vT} holds • Let M2 be the class of all scoring matrices that contain a generic submatrix M • Let M1 be the class of all scoring matrices that contain a sub-matrix isomorphic • to a generic matrix M with z > vT. • Let M be the class of all scoring matrices that contain a submatrix isomorphic • to a generic matrix M with y > u and z > vT. • Theorem 1: • The gap-0-1 multiple alignment problem is NP-hard for every scoring matrix M • in M2. • (b) The gap-0 multiple alignment problem is NP-hard for every M in M1 • (c) The multiple alignment problem is NP-hard for every M in M • Note that Mis quite broad and covers most scoring schemes used in • biological applications. My T. Thai mythai@cise.ufl.edu

  29. Reduction • Reduce the MAX-CUT-B: • Given G=(V,E) where k=|V| and each vertex has a degree at most B • Find a partition of V into two disjoint sets such that to maximize the number of edges crossing these two sets • Given a graph G=(V,E) with k vertices v0, …, vk-1 and l edges e0, …, el-1. We will construct a set of k2 sequences t0, …, tk2-1 as follows: My T. Thai mythai@cise.ufl.edu

  30. Reduction • For each vertex vi, construct a sequence ti such that • for each edge em={vh, vi} incident at vi, h < i, n < k5, set where ti,j represents the character at the jth position in ti. • For other j, let ti,j = T • For i ≥ k, set ti = T T T … T with length k12l My T. Thai mythai@cise.ufl.edu

  31. An Example My T. Thai mythai@cise.ufl.edu

  32. Proof of Theorem 1(a) • We will show that a gap-0-1 alignment will partition V into two disjoint subsets V0 and V1: • V0: all vertices vi such that ti remains in place (a space appends at the end) • V1: all vertices vi such that ti shifts to the right • Thus, based on the alignment, we can find the cut. And vice versa, based on the cut, we can find the alignment • The left part is: prove that if k is sufficiently large, the optimal gap-0-1 alignment yields a partion of V with maximum edge cut. My T. Thai mythai@cise.ufl.edu

  33. Proof of Theorem 1(a) • Let c denote the cut based on the alignment A • Consider all the sequences ti after that alignment A: • The total indel penalties is of order O(k4) (appears at the first and last column in the SP score matrix) • The total number of mismatches before the alignment is 3k5l(k2-1) • To maximally reduce this number: • 1 A-A match reduces 2 A-T mismatches • For each edge (vh, vi), if there are in different subsets (of the partition), then a total of k5 A-A matches between sequences th and ti are created • No other A-T mismatches can be elimiated • Thus the SP-score: • k12lvTk2(k2-1)2+3k5l(u-vT)(k2-1)-ck5(2u-vA-vT)+O(k4) My T. Thai mythai@cise.ufl.edu

  34. Theorem 2 Consider the following scoring matrix M0 for the alphabet ∑0 = {A,T,C}. • The gap-0-1 MSA problem is MAX-SNP-hard • The gap-0 MSA problem in MAX-SNP-hard • The MSA problem in MAX-SNP-hard My T. Thai mythai@cise.ufl.edu

  35. MAX-SNP-hard Proof • To prove problem A’ is MAX-SNP-hard, we need to L-reduce problem A, which is MAX-SNP-hard to A’ • L-reduce: • There are two polynomial-time algorithms f, g and constants a, b > 0 such that for each instance I of A: • f produces an instance I’ = f(I) of A’ such that OPT(I’) ≤ aOPT(I) • Given any solution of I’ with cost c’, g produces a solution of I with cost c such that |c-OPT(I)| ≤ b|c’-OPT(I’)| My T. Thai mythai@cise.ufl.edu

  36. Proof of Theorem 2 • To prove MSA (with M0 and the scoring matrix mentioned before) MAX-SNP-hard: • L-reduce the MAX-CUT-B to another optimization problem, called A’, which is L-reduce to a scaled version of MSA • Problem A’: • Given a graph G=(V,E) with bounded degree B. For every partition P={V0, V1}, let cp be the size of cut determined by P. • Find the partition P of V that minimizes dp = 3|E|-2cp My T. Thai mythai@cise.ufl.edu

  37. Show A’ is MAX-SNP-hard • Let f and g be an identity function • Set a = 3B and b = 2, we can easily prove the two properties of the L-reduction since: • cp ≥|E|/B and dp = 3|E| - 2 cp ≤ 3 |E| • Any increase of cp by 1 = decrease dp by 2 My T. Thai mythai@cise.ufl.edu

  38. Show A’ L-reduce to scaled MSA Similar to the above construction, we have: My T. Thai mythai@cise.ufl.edu

  39. Similar to the proof of Theorem 1, we have the optimal SP-score: where • If the SP-score is scaled by a factor of k-5/2 for a MSA of k sequences, then A’ L-reduce to MSA. My T. Thai mythai@cise.ufl.edu

  40. GENETIC ALGORITHMS

  41. How do GAs work? • Create a population of random solutions • Use natural selection: • crossover and mutation to improve the solutions • Stop the operation if satisfying some certain criteria such as: • No improvement on fitness function • The improvement is less than some certain threshold • The number of iteration is more than some certain threhold

  42. Terms and Definitions Chromosomes Potential solutions Population Collection of chromosomes Generations Successive populations

  43. Terms and Definitions Crossover Exchange of genes between two chromosomes Mutation Random change of one or more genes in a chromosome Elitism Copy the best solutions without doing crossover or mutation.

  44. Terms and Definitions • Offspring • New chromosome created by crossover between two parent chromosomes • Fitness function • Measures how “good” a chromosome is. • Encoding scheme • How do we represent every chromosome/gene? • Binary, combination, syntax trees.

  45. Why are GAs attractive? No need for a particular algorithm to solve the given problem. Only the fitness function is required to evaluate the quality of the solutions. Implicitly a parallel technique and can be implement efficiently on powerful parallel computers for demanding large scale problems.

  46. Basic Outline of a GA Initial population composed of random chromosomes, called first generation Evaluate the fitness of each chromosome in the population Create a new population: Select two parent chromosomes from a population according to their fitness Crossover (with some probability) to form a new offspring Mutation (with some probability) to mutate new offspring Place new offspring in a new population Process is repeated until a satisfactory solution evolves

  47. Operations • Mutation Operation: • Modify a single parent • Try to avoid local minima

  48. Let's see some running examples Minimum of a function: http://cs.felk.cvut.cz/~xobitko/ga/example_f.html Elitism: http://cs.felk.cvut.cz/~xobitko/ga/params.html The travelling salesman problem: http://cs.felk.cvut.cz/~xobitko/ga/tspexample.html

  49. Multiple Sequence Alignment Fitness function is used to compare the different alignments Based on the number of matching symbols and the number and size of gaps Also called the cost function Different weights for different types of matches Gap costs can be simple and count the total matching symbols can be complicated and consider the type of matching symbols, location in the sequence, neighboring symbols etc.

More Related