A new approach to sequence comparison:

A new approach tosequence comparison: Normalized sequence alignment B87506029 李建鴻 R88725032 朱漢農 D87631003 饒瑞佶 D90921014 吳明龍

Abstract • Local Sequence alignmentinput: A, B, S (score table)output: i1,j1,i2,j2 s.t. A[i1..j1] and B[i2..j2] has best alignment score. ex: 0 1 1 0 1 vs. 1 0 0 1 0 

The Smith-Waterman algorithm • Definition: Sequence alignment using dynamic programming. • one of the most important techniques in computational molecular biology • Discarding poorly conserved initial and terminal seqments.

The Smith-Waterman algorithm(cont.) • Flaw: • Does not discard poorly conserved intermediate segments. • Mosaic effect • Can not answers the question: do two sequences share a fragment with more than 70%? Normalized local alignment: report the maximum degree of similarity.

Introduction • Gene prediction: • Human Genome Project: • Gene prediction in human genome often amount to using related proteins from other species as clues for finding exon-intron structure. • similarity: exons  85 %intron  35 % • use local alignment (Smith-Waterman algorithm) to find the most similar segments

Shadow effect The more biologically important ‘short’ (high similarity) alignment will not be detected if there is a long alignment with higher scores.( but less similarity)

Mosaic effect The local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments.

Fixed it • Goad and kanehisa: • Introduced alignment with miniimal mismatch density. • Did not lead to successful algorithms • Webb Miller:fix this problem at the post-processing stage. • Zhang et al.: • decompose a local alignment into sub-alignments that avoid the mosaic effect. • The approach may miss the alignments with the best degree of similarity if the Smith-Waterman algorithm missed them.

Fixed it • X-drop • A region with an alignment the scores below X: • X-alignments: The alignment that contain no X-drops. expensive to compute in practice.

Another problem • The Smith-Waterman algorithm can not correctly find the most biologically adequate relative in a benchmark sample of different protein families. That algorithm does not take into account the length of the alignment normalize the alignment score by its length.

Normalized local alignment problem • substring I,Js(I,J) : score max s(I,J)/(|I|+|J|) with |I|+|J| T T: a threshod for the minimal overall length • With no restriction on overall length, we can use fractional programming develop fast algorithms, but not biologically meaningful.

Normalized local alignment problem • Slight different: max s(I,J)/(|I|+|J|+L) for a given parameter L. • Control over the degree of normalization by varying L. • Be able to use fractional programming technique for fast computation.

Parameter L • If L=0 • a=A b=A NLA*1=1/2 • a=ACG..ACGT b=ACG..ACGT |a|=|b|=100  NLA*2=100/200 • If L=100 • NLA*1=1/(2+100)=1/102=0.01 • NLA*2=100/(200+100)=100/300=0.33 • L can not too big

Outline of this paper • Formal definition • Dinkelbach’s and Megiddo’s methods as we use in our algorithms. • Description of algorithm • Discussion of implementation • Concluding

Normalized Local Alignment • Formulate the alignment problems first: • Let a = a1a2…an and b = b1b2…bm be 2 sequences with n  m. A new approach to sequence comparison: normalized sequence alignment

Alignment Graph Ga,b • Representing all possible alignments between a and b • Directed acyclic graph • (n+1)x(m+1) lattice points (u, v) as vertices, for 0  u  n, and 0  v  m

path Ex term score vector

4 types of arcs in Ga,b: 1. Horizontal arcs: {((u,v-1),(u,v)) | 0un, 0<vm} 2. Vertical arcs: {((u-1,v),(u,v)) | 0<un, 0vm} 3. Matching diagonal arcs: {((u-1,v-1),(u,v)) | au=bv, 0<un, 0<vm} 4. Mismatching diagonal arcs: {((u-1,v-1),(u,v)) | aubv, 0<un, 0<vm}

Alignment path: • By performing the corresponding edit operations in ai…ak, we obtain bj…bl • Horizontal arc ((u,v-1),(u,v)): insert bvafter au • Vertical arc ((u-1,v),(u,v)): delete au • Mismatching diagonal arc ((u-1,v-1),(u,v)): substitute bv for au Ga,b

Ex: a = A T T G T • ((4,6),(5,7))  ATTGT • ((4,5),(4,6))  ATTGAT • ((4,4),(4,5))  ATTGCAT • ((4,3),(4,4))  ATTGACAT • ((3,2),(4,3))  ATTGACAT • ((2,1),(3,2))  ATGGACAT Ga,b

((1,1),(2,1))  AGGACAT • ((0,0),(1,1))  AGGACAT Ga,b

indel: • insertions (horizontal arcs) + deletions (vertical arcs) • match: • matching diagonal arcs • mismatch: • mismatching diagonal arcs Ga,b

Assumption of scoring: • Match: 1, • Mismatch: , • Indel: , where  and  are positive reals. Ga,b

Alignment vector: • For ai…ak and bj…bl, there is an alignment path between the vertices (i-1,j-1) and (k,l) in Ga,b with x matches, y mismatches, and z indels. • We denote the set of all such alignment vector by AVi,j,k,l(a,b) = {(x, y, z) | (x, y, z) is an alignment vector for ai…ak and bj…bl} Ga,b

Next, we define AV(a,b) as the set of all alignment vectors, i.e. AV(a,b) = AVi,j,k,l(a,b) (1)

Depending on the score table, we have SCORE(x, y, z) = x –y –z (2) • Then, we denote the maximum score between ai…ak and bj…bk by S,(ai…ak,bj…bl) = max{SCORE(x, y, z) | (x, y, z)AVi,j,k,l(a,b)} (3)

Local Alignment problem seeks for two segments with the highest similarity score LA* : LA*,(a,b) = S,(ai…ak,bj…bl) = {SCORE(x,y,z) | (x,y,z)AVi,j,k,l(a,b)}

By equation (1), we have LA*,(a,b) = max{SCORE(x,y,z) | (x,y,z)AV(a,b)} (4)

Normalized score (NSL): NS,,L(ai…ak,bj…bl) = , (5) where LENGTHL(ai…ak,bj…bl) = (k-i+1)+(l-j+1)+L

Normalized Loal Alignment (NLA) problem: NLA*,,L(a,b) = {NS,,L(ai…ak,bj…bl)}

Observe: If (x, y, z) is an alignment vector for ai…ak and bj…bl, then (k-i+1) + (l-j+1) = 2x + 2y + z So, LENGTHL(x, y, z) = 2x + 2y + z + L (6) Ga,b

By (1), (3), (5), and (6), we can define the objective of the NLA problem as NLA*,,L(a,b) = (7)

Algorithms • The algorithm problems are optimization problems of linear functions.

By equations (2) and (6), and definitions (4) and (7): LA,(a,b): maximize x –y –z s.t. (x,y,z)AV(a,b) NLA,,L(a,b): maximize s.t. (x,y,z)AV(a,b)

Parametric local alignment problem For a given , we define a problem LA ,,L()(a,b): maximize x - y - z - (2x+2y+z+L) s.t. (x,y,z)AV(a,b)

Proposition 1. For any normalized scores  < ½, the LA*() can be formulated in terms of LA*.

Proof of proposition 1: LA*() = max{(1-2)x - (+2)y - (+)z - L} = = (1-2)LA*’,’(a,b) - L (8)

Thus, computing LA*() involves solving LA’,’(a,b) • Since ,  and L are positive, for any alignment vector (x’, y’, z’),

Dinkelbach’s algorithm • Dinkelbach (1967) has developed a general algorithm which uses the parametric method of an optimization technique known as fractional programming.

The NLA* can be achieved via a series of LA*() for different . •  = NLA* iff LA*() = 0

Pick an arbitrary a.v. (x,y,z)AV(a,b) do{   * Using Prop. 1, solve LA() and obtain an optimal a.v. (x,y,z) }while(*) return(*)

Time complexity: the product of the number of iterations and the time complexity of S-W algorithm • Space complexity: O(m)

Position of an optimal alignment may also be desired. • By extending the S-W algorithm to include, at each entry of the score matrix, information about the alignment path which ends at that node, and the starting node-position of the path.

RationalNLA Obj: For better time complexity

Introduction • Megiddo(1979): match:1-λ mismatch:δ- λ indel:μ- λ λ is a variable and can be precomputed

Precomputed Method • Binary search + criteria: • 若LA*(λ) = 0，那麼λ= NLA*，且LA(λ)的最佳alignment vector也是NLA的最佳解。 • 若LA*(λ) > 0，試較大的λ • 若LA*(λ) < 0，試較小的λ

Observation • Any two distinct candidate values for NLA* are not arbitrarily close to each other if the scores are rational

Proof(I) • Set Q(a, b)是NLA*的可能值集合 • Q(a, b) = {(x – δy - μz)/(2x + 2y + z + L) | (x, y, z)AV(a, b)} • PROPOSITION 2 • letσ = min{|q1 – q2| | q1, q2  Q(a, b), q1≠q2} • setδ= p/q與μ= r/s是有理數 • σ≧ 1/qs(m + n + L)^2

Proof(II) • set q1, q2  Q(a, b)分別是由alignment vectors (x1, y1, z1)與(x2, y2, z2)所得的正規化積分，q2 < q1 • σ≧ (x1 – δy1 – μz1)/(2x1 + 2y1 + z1 + L) – (x2 – δy2 – μz2)/(2x2 + 2y2 + z2 + L) • ∵對於兩個正有理數 p1/q1 > p2/q2  p1/q1 – p2/q2 ≧ 1/q1q2, and for any alignment vector (x, y, z)  AV(a, b)，2x + 2y + z ≦ m + n • σ≧ (1/qs)[(qsx1 –psy1 –qrz1)/(2x1 + 2y1 + z1 + L) – (qsx2 –psy2 –qrz2)/(2x2 + 2y2 + z2 + L)]≧1/qs(m+n+L)2

Algorithm • 計算σ 存在一區間[e,f] s.t. NLA* 落於[eσ , fσ] • Initially, e = 0 and f=1/2 σ-1 NLA* is in [0, ½) • Let k=(e+f)/2, iteratively solve parametric local alignment problem with parameter kσ • Interval is updated according to the signof the optimum value of the parametric problem

A new approach to sequence comparison:

A new approach to sequence comparison:

Presentation Transcript

Dynamic Programming: Sequence alignment

DNA sequence analysis

An Extension of the Burrows Wheeler Transform and Applications to Sequence Comparison and Data Compression

The Tree of Life How do we select a gene sequence for comparison?

Sequence Comparison

Sequence Comparison

Sequence comparison and Phylogeny

Sequence comparison

TEACHING APPROACH: LITERATURE

Biological Sequence Comparison and Alignment

Oracle Sequence

What is Time-Frequency Approach (TFA)?

Design and creation of multiple sequence alignments Unit 13

Sequence Analysis and Function Prediction

Scalable Visual Comparison of Biological Trees and Sequences

Linear Sequence Alignment

COMPARISON OF SSRG WITH MSRG

2. Comparing biological sequences : sequence alignment

6. Homology Modeling

Sequence comparison: Dynamic programming

Dynamic Programming: Edit Distance