Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures Chapter 11.8: Gaps Lecturer: Dr. Rose Slides by: Dr. Rose February 6, 2003

Gaps • Our investigation of alignment has focused on: • Matches • Mismatches • Spaces • An important concept is that of gaps. • Defn. A gap is a maximal consecutive run of spaces in a single string of a given alignment. • Q: Can a single space be a gap? • A: Yes, if there are no adjacent spaces.

Gaps • Gaps can occur: • Before the first character of a string • After the last character of a string • Inside a string • Example: c t g c g g g - - - g g t a a a t - - g c g g - a g a g g - a a a - • Q: How many gaps are there? • A: 5

Gaps • Q: Other than our recognition of gaps, did the preceding example show anything new? • A: No. • Q: Then what motivates the introduction of this concept? • A: We can include a gap term in the objective function for computing alignment. • So??? • So we can influence the distribution of gaps.

Gaps • Analogy: specifying the location of the hole is critical to donut making, otherwise you’ll end of with a berliner.  • Example: • In this objective fcn, each gap contributes the constant weight Wg irrespective of the gap length. • The variable k, indicates the number of gaps in the alignment.

Gaps • Recall, a space in an alignment corresponds to an insertion or deletion in the edit transcript. • A gap corresponds to an atomic insertion or deletion of an entire substring. • Biologically, mutations are such atomic events. • A single mutation can create a gap • The size of the gap can vary over a large range with equal likelihood.

Gaps Sources of mutation mentioned by textbook: • unequal cross-over in meiosis  insertion in one string and corresponding deletion in the other. • http://www4.ncsu.edu:8030/unity/users/b/bnchorle/www/ • DNA slippage slippage in replication procedure resulting in the repetition of a substring • Retrovirus insertions • Translocations of DNA between chromosomes

Gaps • Common gaps in aligned strings can be used to deduce evolutionary history • Mutations at the single character level are frequent. • Does anybody know what these are called?  makes it difficult to determine evolutionary relationship at the DNA sequence level. • Large gaps occur less frequently.  gap features can be used to recognize similarity over long periods of time. • See Figure 11.6 for an example of gap as alignment feature

Gaps • Consider: • An alignment should reflect the cost of mutational events transforming one string into the other. • A single mutation can produce a gap of more than one space • Consequently: • Distribution of spaces into gaps should follow a plausible model • Gap weights should be modeled to reflect biological meaning

Motivation: cDNA Matching • Preliminaries: • A single gene is comprised of exons and introns • Exons are the coding part of the gene • Introns are the noncoding parts between exons • Gene expression: • RNA is transcribed from DNA • DNA:A RNA:U (uracil) • DNA:C RNA:G • DNA:G RNA:C • DNA:T RNA:A

Motivation: cDNA Matching • Gene expression continued: • RNA is transformed into mRNA (messenger RNA) • The introns are excised • The remaining exons are concatenated • The resulting mRNA leaves the cell nucleus • A ribosome: • Translates the mRNA into the corresponding protein by • parsing the mRNA into codons • assembling amino acids in the order specified by the codons. • The resulting sequence of amino acids is the protein

Motivation: cDNA Matching • Imagine that you have the mRNA for a protein and want to find the corresponding gene. • (Wet biology) Take the mRNA and create complimentary DNA (cDNA). • Map mRNA:U  cDNA:A • Note: cDNA differs from DNA is several respects • cDNA does not contain intron substrings • The nucleotides in cDNA compliment the nucleotides in the corresponding DNA, i.e., AT and C  G

Motivation: cDNA Matching • (Wet biology) Hybridize the cDNA with the DNA • In hybridization: complementary nucleotides try to match up, i.e., AT and C  G • Sections of the cDNA will hybridize with the corresponding sections of DNA. • The non-hybridizing segments are gaps • Possibly corresponding to introns

Motivation: cDNA Matching • Now imagine that you have the mRNA sequence for a protein and want to find the corresponding gene without doing wet biology. • Take the mRNA and create complimentary DNA (cDNA). • Map mRNA:U  cDNA:A with a computer • While we are at it, compile of library of each cDNA string that we create for future use.

Motivation: cDNA Matching • Align (hybridize) the cDNA with the DNA • We assume that the relevant genome has been sequenced. • We have a short string (cDNA) and a very long string, the genome. • Align complementary nucleotides in the two strings, i.e., AT and C  G • Sections of the cDNA will align (hybridize) with the corresponding sections of genome. • The non-alinging (non-hybridizing) segments are gaps • Possibly corresponding to introns

Motivation: cDNA Matching • Q: What kind of objective fcn do we need to align cDNA with DNA? • Features: • Small penalties for spaces • Q: Why does this matter? • A: large penalties would force the cDNA to bunch up, not alowing gaps for introns

Motivation: cDNA Matching • Features: continued • Large penalties for mismatches • Some mismatches are unavoidable (sequencing error) • Long sequences of mismatches must be avoided • Positive values for matches • We want to reward exon matches • Gap penalties

Motivation: cDNA Matching • Gap penalities • Q: Assume: match +, mismatch --, space -, what happens if there is no gap penalty? • A: the alignment would be the longest common subsequence.  Match of ALL characters in the cDNA string • Match of cDNA with noncoding DNA  • Tells us nothing about the position of the exons

Motivation: cDNA Matching • Gap penalties continued • Soln: augment objective fcn with a gap term • Complication: pseudogenes

Motivation: Pseudogenes • Pseudogenes • Nonworking inexact copy of a gene • Conceptually: • a trial gene not ready for prime time or • a failed gene mutation • The psuedogene may be very far from the actual gene

Motivation: Pseudogenes • Pseudogenes: processed psuedogenes • contains only exon substrings • introns have been removed & exons concatenated • Theory: mRNA that is re-transcribed back into DNA and inserted into a random position. • Problem: • Assume the DNA might contain the pseudogene & the working gene • how can processed psuedogenes be located?

Gap Weights • Q: What types of gap weight can we choose from? • A: The textbook lists four general types: • Constant gap weight • Affine gap weight • Convex gap weight • Arbitrary gap weight

Gap Weights • Constant gap weight: simplest • No cost for individual space • Gaps are assigned a constant weight Wg • Operator-weight objective fcn: Wm(#matches) – Wms(#mismatches) – Wg(#gaps) Where Wm= match weight & Wms= mismatch weight • Alphabet-weight objective fcn: Sli=1[s(S´1(i), S´2(i))] - Wg(#gaps) Here s(x,_) = s(_,x)=0 for every letter x in the alphabet.

Gap Weights • Affine gap weight • Extend the constant gap weight with a charge for each space, Ws. • Wg is the gap initiation charge • Ws is the gap extension charge • Gap weight is given by the affine function Wg+ qWs, where q is the number of spaces in the gap. • Operator-weight objective fcn: Wm(#matches) – Wms(#mismatches) – Wg(#gaps) - Ws(#spaces)

Gap Weights • Affine alphabet-weight objective fcn: Sli=1[s(S´1(i), S´2(i))] - Wg(#gaps) - Ws(#spaces) Here s(x,_) = s(_,x)=0 for every letter x in the alphabet. • An important question is what the values of Wg and Ws should be. • Obviously, this is related to similarity matrix, s(). • Textbook says FASTA uses Wg=10 - Ws= 2 for protein sequences

Gap Weights • convex gap weight • Idea: additional spaces contribute less • Example: Wg + logeq • Longer gaps are somewhat penalized

Arbitrary Gap Weight • Arbitrary gap weight • The gap weight is an arbitrary function, w(q), of the gap length. • Obviously, the preceding weight types are subcases of the arbitrary gap weight model

Arbitrary Gap Weight Arbitrary gap weight recurrence: Three types of alignments for S1[1..i] and S2[1..j] • S1(i) aligns to the left of S2(j),  S1 ends with a gap. • Let E(i, j) be the maximal value for alignment case 1. • S1(i) aligns to the right of S2(j),  S2 ends with a gap • Let F(i, j) be the maximal value for alignment case 2. • S1(i) coaligns with S2(j). • Let G(i, j) be the maximal value for alignment case 3. • Let V(i, j) be the maximal value of E(i, j), F(i, j), & G(i, j).

Arbitrary Gap Weight We have the following recurrences: • V(i, j) = max[E(i, j), F(i, j), G(i, j)], • G(i, j) = V(i - 1, j - 1) + s(S1(i), S2(j)), Where S1(i), S2(j) are co-aligned. • E(i, j) = max0k  j-1[V(i, k) – w(j – k)],  S1 ends with a gap. • F(i, j) = max0l  i-1[V(l, j) – w(i – l)]  S2 ends with a gap.

Arbitrary Gap Weight The base conditions are: V(i, 0) = -w(i), V(0, j) = -w(j), E(i, 0) = -w(i), F(0, j) = -w(j), G(0, 0) = 0, but G(i, j) is undefined if only i or j is 0. If end spaces are free then end gaps are free and: V(i, 0) = 0, V(0, j) = 0

Arbitrary Gap Weight Up until this point all dynamic programming examples have had complexity O(nm). Q: What is the complexity of V(i, j)? A: O(nm2 + n2m)? Q: Why does the consideration of gaps require O(nm2 + n2m)? A: Previous computations depended only on the 3 adjacent cells. Considering gaps entails considering all preceding cells in the row and column.

Arbitrary Gap Weight Thm. If |S1| = n and |S2| = m, the recurrences can be solved in O(nm2 + n2m) Proof. (n+1) * (m+1) cells in the table are filled. To fill cell (i, j): • E(i, j) examines j cells of row i, max0k  j-1[V(i, k) – w(j – k)], A row entails m(m+1)/2 = O(m2) to evaluate E for that row. • F(i, j) examines i cells of column j, max0l  i-1[V(l, j) – w(i – l)] A column entails n(n+1)/2 = O(n2) to evaluate F for that column. • G(i, j) examines one other cell. Since there are n rows and m columns give O(nm2 + n2m)

Affine Gap Weight • O(nm2 + n2m) is expensive. • The affine weight gap model supports O(nm) computation. • Recall, we want to maximize the operator objective fcn: Wm(#matches) – Wms(#mismatches) – Wg(#gaps) - Ws(#spaces) As before, three types of alignments: • S1(i) aligns to the left of S2(j),  S1 ends with a gap. • S1(i) aligns to the right of S2(j),  S2 ends with a gap • S1(i) coaligns with S2(j). We will use E(i, j), F(i, j), G(i, j) & V(i, j), but we will modify the gap weight

Affine Gap Weight Q: How can the cost be reduced fromO(nm2 + n2m) to O(nm)? A: The affine model sets a constant cost per space. Q: How does this help? A: It is not necessary to do row (O(m2)) & column (O(n2)) searches  It doesn’t matter where the gaps occur, only how large they are.

Affine Gap Weight The base conditions where end gaps are included are: V(i, 0) = E(i, 0) = - Wg- iWs, V(0, j) = F(0, j) = - Wg- jWs, If end spaces are free then end gaps are free and: V(i, 0) = V(0, j) = 0

Affine Gap Weight We have the following recurrences: V(i, j) = max[E(i, j), F(i, j), G(i, j)], G(i, j) = V(i - 1, j - 1) + Wm, if S1(i)=S2(j) G(i, j) = V(i - 1, j - 1) - Wms, if S1(i)S2(j) E(i, j) = max[E(i, j - 1), V(i, j - 1) – Wg] - Ws  S1 ends with a gap. F(i, j) = max[F(i - 1, j), V(i - 1, j) – Wg ] - Ws • S2 ends with a gap. Notice that each recurrence entails examining recurrences for a constant number of cells.

Affine Gap Weight The textbook explains E(i, j) in detail. Let’s consider F(i, j) = max[F(i - 1, j), V(i - 1, j) – Wg ] - Ws • F(i, j) is the case where S2 ends with a gap. The recurrence considers two cases: • S2(j) is exactly one place to the left of S1(i) There is a gap aligned with S1(i), then F(i, j) = V(i - 1, j) – Wg - Ws • S2(j) is to the left of S1(i - 1) The same gap aligned with S1(i - 1), extends to S1(i), then F(i, j) = F(i - 1, j) - Ws

Bioinformatics Algorithms and Data Structures