Extending Alignments

Extending Alignments Υλικό βασισμένο στο κεφάλαιο 13 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press

Parametric alignment with the use of scoring matrices • Definition: For any alignment A of two strings, let smtA and smsA, respectively, denote the total score (obtained from the scoring matrix) for the specific matches in A and the total score for the specific mismatches in A. Αs before, idA and gpA denote the number of indels and gaps contained in A. • Using scoring matrices, the parametric value of alignment A is α x smtA+ β x smsA- γ x idA + δ gpA.

Efficient algorithms for computinga polygonal decomposition • Ray-search problem :Given an alignment A, a point p where A is optimal, and a ray h in γ, δ space starting at p, find the furthest point (call it r*) from p on ray h where A remains optimal. If A remains optimal until h reaches a border of the parameter space, then r* is that border point on h. It is also possible that r*=p.

Newtοn’s ray-search algorithm Set r to the (γ, δ) point where h intersects a border of the parameter space. While A is not an optimal alignment at point r do begin Find an optimal alignlnent A* at point r. Set r to be the unique point on h where the value of A equals the value of A*. end, Set r* to r. • Lemma: 1) Newton’s ray-search algorithm finds r* exactly. 2)Unless A is optimal at the initial setting of r , the last computed alignmentA* is cooptimal with Α at r* and yet is also optimal on h for some nonzero distance beyond r* 3) When Newtοn’s ray-search algorithm computes an alignmentat apoint r on h, none of the alignments computed previously (in this execution of Newton's algorithm ) are optimal at r. • Follows: if r* = p, then Newton’s method discovers this and returns an alignment A* that is optimal at p and also optimal for some nonzero distance along h. For any polygon Ρ intersected by h, a single ray-search computes alignments at no more than two points of P

Uses fοr parametric alignment • Sensitivity analysis: check to see how sensitive the alignment is to changes in the parameters • Efficient computation of all cooptimals

Computing suboptimaΙ alignments • Optimal alignment, even with a wide range of models and parameter choices, does not always identify the biological phenomena that it is intended to reflect. • The available objective functions might not reflect the full range of biological forces that cause differences between strings • The objective functions might not induce the optimal alignment tο form the desired shape • The data might contain errors that confound in algorithms • There may be ties for the optimal alignment • There may be many nearly optimal alignments that are biologically more significant than any optimal one

Δ near-optimalalignments • Theorem: For any s-to-t path R, • Corollary: Consider a path R’ from s to u and let δ denote . Then the s-to-t path R consisting of path R’ followed by the longest u-to-t path is a δ-near-optimal path. • Proof: By definition of e(e), e (e) = 0 for any edge e on the longest u-to-t path. Hence δ(R) = δ by the previous Theorem.

Counting and enumerating near-optimal paths - How to count • Definition: Let N(v, δ) be the number of δ-near-optimal s-to-t paths that go through node v. • For a given value Δ, the number of s-to-t paths whose deviation from R* is at most Δ is • We compute that sum by evaluating the following recurrence for each node v and for each “needed” value οf δ:

Counting and enumerating near-optimal paths - Enumeration • The δ-near-optimal paths can be enumerated in order of increasing δ, and the enumeration can be terminated when δ = Δ or when some fixed number of paths have been found. • Α tree enumerating partial paths is maintained.

A οne-dimensional chaining problem • Consider a set of r (possibly) overlapping intervals drawn on the line R, where each interval j has some associated value v(j). The problem is to select a subset of nonoverlapping intervals whose values sum to as large a number as possible

one-dimensional Algorithm • Let I be a list of all the 2r numbers representing the locations of the endpoints of the intervals in L. Sort the numbers in I, annotate each entry in I with the name of the interval it is part of and whether it is a left or a right endpoint. For convenience, let I be a one-dimensional array. • Set maxto zero. • Fοr i from 1 to 2r do • begin • Ιf I[i] represents the left end of an interval say interval j, then set V[j] to v(j)+mαx. • Ιf I[i] represents the right end of interval j, then set maxtο the maximum of max and V[j] . • end.

The two-dimensional chain problem

The two-dimensional chain problem • Definition Α subset of the rectangles is called a chain if no horizontal or vertical line intersects more than one rectangle in the subset and if the rectangles can be ordered so that each one is below and to the right of its predecessor. The value of a chain is the sum of the values of the rectangles in the chain. • The Chain Problem Find a chain with maximum value over all chains.

Τwο-dimensional chain aΙgorithm List Lbegins empty. For i frοm tο 2r do begin If I[i ] is the left end of a rectangle, say rectangle k, then begin search L for the last triple where lj is greater than hk, That is, find the clοsest (in the y dimension) rectangle j with a triple in L whose lowest point is strictly above the highest point of rectangle k Set V(k) to v(k) + V(j). end Else If I[i] is the right end of rectangle k, then begin Search L for the first triple where lj is less than or equal to lk . If lj < lk or lj = lk and V (k) > V(j), then insert the triple (lk , V (k), k) into L, in the proper location to keep the triples sorted by their l values. Delete from Lthe triple for every rectangle j’ where lj’ <= lk and V(k) > V(j’). end end.

Τwο-dimensional chain aΙgorithm • Theorem: Anoptimal chain canbe found in O(rlogr) time .

Extending Alignments