Multiple-Alignment

Multiple-Alignment • 藍永倫、陳婕妤、文國煒、劉宗灝、戴邦炘

Reference • Michael Brudno, Chuong B. Do, Gregory M. Cooper, Michael F. Kim, Eugene Davydov, NISC Comparative Sequencing Program, Eric D. Green, Arend Sidow, and Serafim BatzoglouLAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNAGenome Res., Apr 2003; 13: 721 - 731 ; doi:10.1101/gr.926603 • Michael Brudno, Alexander Poliakov, Asaf Salamov, Gregory M. Cooper, Arend Sidow, Edward M. Rubin, Victor Solovyev, Serafim Batzoglou, and Inna DubchakAutomated Whole-Genome Multiple Alignment of Rat, Mouse, and HumanGenome Res., Apr 2004; 14: 685 - 692 ; doi:10.1101/gr.2067704

簡介 • Multiple Sequence Alignment AAA----CTGCAC----AG A--CTG-CT--ACTG---G ---CTGACTGC----TTA- NP-Complete

LAGAN toolkit • http://lagan.stanford.edu/ • LAGAN • Multi-LAGAN • Suffle-LAGAN

LAGAN (概述) • 一次對兩個 sequence 做 global alignment • Find local alignments. (seeds) • Compute a rough global map. • Restricted DP.

Multi-LAGAN (概述) • 一共有 K 個 sequence • 類似 K-Clustering • 假設已知演化樹 (phylogenetic-tree ) • 每次將兩個最近的 sequence 合併起來，一共做 K-1 次 LAGAN。

Phylogenetic Tree

效能評估 • 測試資料 • ROSETTA Set • 129 genes • 人、鼠 • 平均 10 Kbp • 上述 12 種動物的 CFTR Region • 平均 1Mbp • 先用人類的 annotated exon 來 align 其他 11 種動物，用這個結果當作「標準答案」。

ROSETTA

CFTR

MLAGAN

找出 misaligned 的證據

Discussion • Multiple Alignments 對於較遠物種的比對會比 Pairwise Alignments 來得好。 • Local Alignment v.s. Global Alignment • MLAGAN 雖然比較慢，但正確率最高。 • LAGAN / MLAGAN 不只適合相近的序列，對於差異大的序列也有不錯的表現。

LAGAN

Three main steps 1.Generation of local alignments. 2. Construction of a rough global map. 3. Computation of the final global alignment.

1. Generation of Local Alignment • LAGAN uses CHAOS to find local homologies between two sequence. • Michael Brudno, Michael Chapman, Berthold Gottgens, Serafim Batzoglou, and Burkhard MorgensternFast and sensitive multiple alignment of long genomic sequences.BMC Bioinformatics, 4:66 2003. • CHAOS works by chaining short words, the seeds, which match between the two sequence. • Anchor : chain of seeds, local alignment.

y x 1. Generation of Local Alignment • k : word length, c : degeneracy • A (k, c)-seed is a pair of k-long words that match with at most c differences between the two sequence. • d :maximum distance , s : maximum shift. • Two seeds are x-letters and y-letters apart. They can be chained together if : • x <= d and y <= d • | x - y | <= s

gap cutoff distance cutoff seed seq2 Search box location in seq1 Range of search 1. Generation of Local Alignment • Find seeds at current locationin seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal. • Pick a previous seed that maximizes the score of chain Time O(n log n), where n is number of seeds.

y x 1. Generation of Local Alignment • Scoring of Chains • I love SWEET COW ^(oo)^~ • Match and mismatch penalties for each pair of chained seed. • Gap penalties proportional to | x – y | for each pair of chained seed. • Chains are threw away if they score under a threshold t. • Rapid rescoring • For the chains that score above t. • Rescore them by performing ungapped extensions in both directions from each seed. Finding the optimal location to insert exactly one gap of size | x – y |

A1 A2 y x 2. Construction of a Rough Global Map • (b, e, b’, e’, s) represent a local alignment (anchor). • From (b, b’) to (e, e’) • s is the score of the alignment • A1 < A2iff e1 < b2 and e1’ < b2’ • A1 = (b1, e1, b1’, e1’, s1) • A2 = (b2, e2, b2’, e2’, s2) • A chain of local alignment A1 < A2 < … < Ak, has score s1 + s2 + … + sk. • The optimal rough global map is the highest-scoring chain. • Computed using Sparse Dynamic Programming – LIS, in time O(nlogn), n is the total number of local alignment.

2. Construction of a Rough Global Map • Recursive anchoring • The choice of parameter k (length of seeds), d (maximum degeneracy of seeds), and t (score threshold) is a tradeoff between speed and sensitivity. • Speed : higher k, lower c. • Sensitivity : lower k, higher c. • To achieve combination of speed and sensitivity, LAGAN calls CHAOS with a restrictive set of parameters in the regions between each anchor (local alignment) of the global map.

3. Computation of Global Alignment • Limits the area for each anchor • The rectangle (0, 0) to (i+r, i-r). • The rectangle (i’-r, j’-r) to (M, N). • The band enclosed by the two diagonals • (i-r, j+r) to (i’-r, j’+r) • (i+r, j-r) to (I’+r, j’-r) • r is a parameter, typically 15.

3. Computation of Global Alignment • Do dynamic programming method Needleman-Wunsch to this limited area. • In this sense the anchors in LAGAN are more flexible than the anchors in MUMer, AVID, and GLASS. • LAGAN provide only approximate locations by which the alignment should pass.

Memory-efficient Implementation • LAGAN performs the entire computation with memory proportional to the size of the largest rectangle. • LAGAN achieves this memory efficiency as follow: • Allocates working memory for one rectangle and the neck that follows it. Compute Needleman-Wunsch matrix. • Traces back all optimal alignments ending in the cells at the rightmost column of the neck. • Soon converge upon a single optimal alignment in practice. • Deallocates all working memory, except the memory necessary to keep the traced-back alignments. • Repeat step 1 to step 3 for the next rectangle and neck.

LAGAN Running Time Analysis • The running time of LAGAN is dominated by the “rectangles”. • The running time of “necks” is O[r*(M+N)], which is linear in the sequence lengths. • Suppose there are n anchors, let (x0, y0),…,(xn, yn) be dimension of the n+1 rectangles. Let denote the total length of the inter-anchor segments in each sequence. We can asume the anchors will be aligned in linear time and therefore ignore their length. and

LAGAN Running Time Analysis • The total number of cells in these rectangles is • The first term depends only on the effective lengths of the sequences and the total number of anchors. • If we assume a lower bound on acceptable anchor density, then L1L2/n behaves linearly in sequence length, because L1/n and L2/n areO(1).

LAGAN Running Time Analysis • The total number of cells in these rectangles is • The second term is at most nσx σy where σ denotes the standard deviation. • Assuming constant anchor density. (reasonable assumption for a fixed pair of organism.) Thus, linear in sequence length provided the standard deviations are constant. • If the anchors are spaced evenly, and with a constant density, the running time will be linear in sequence length.

References • LAGAN online • http://genome.lbl.gov/cgi-bin/VistaInput?align_pgm=lagan&num_seqs=2 • http://ai.stanford.edu/~serafim/CS262_2005/index.html • LAGAN • http://lagan.stanford.edu/lagan_web/citing.shtml • “Algorithms for Alignment of Genomic”, SequencesMichael Brudno, Department of Computer Science, Stanford UniversityPGA Workshop 07/16/2004

LAGAN and Multi-LAGAN : Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA

Outline • LAGAN • Multi-LAGAN • Performance Evaluation

Multi-LAGAN

Multiple Alignment • A natural extension of 2-sequence comparisons • More difficult than pairwise: the running time scales as the product of the lengths of all sequences • NP-complete problem (need heuristic approaches)

“Progressive Alignment” • the most widely used heuristic approach • Successive applications of a pairwise alignment algorithm • CLUSTALW (best-known) and MLAGAN

MLAGAN (Multi-LAGAN) • A multiple aligner based on progressive alignment with LAGAN • 2 main phases : • (1) Progressive alignment with LAGAN • (2) (optional) Iterative improvement 1. successively remove each sequence 2. realign it

Algorithm MLAGAN • Input : • K sequences X1,…,XK • A phylogenetic binary tree between them

Algorithm MLAGAN (cont.) 3 main steps: • (1): Generation of rough global maps. Find the rough global map between each pair of sequences. (step 1, 2 of LAGAN)

Algorithm MLAGAN (cont.) • (2): Progressive multiple alignment with anchors. 2.1 Perform a global alignment between the 2 closest sequences according to the phylogenetic tree using step 3 of LAGAN.

Algorithm MLAGAN (cont.) 2.2 Find the rough global maps of the new multi-sequence to all other multi- sequences. (details & scoring metric in later) 2.3 Iterate steps 2.1, 2.2 (K-1 times). Repeat until left with a multiple alignment of all sequences.

Algorithm MLAGAN (cont.) • (3): (Optional) Iterative refinement with anchors. For each sequence Xi in the multiple alignment: 3.1 Find anchors between Xi & the other sequences that align better than a given cutoff.

Algorithm MLAGAN (cont.) 3.2 Align Xi to the multiple alignment of the other sequences with LAGAN. (details in later)

Align 2 Multi-sequences • In the order of the given phylogenetic tree. E.g. 1. (human, chimpanzee) 2. (mouse, rat) 3. (human/chimpanzee, mouse/rat) 4. (human/chimpanzee/mouse/rat, chicken)

Align 2 Multi-sequences (cont.) • Step 2.2 of MLAGAN E.g. Compute the rough global map of 2-sequence X/Y and 1-sequence Z • (1) Anchors in the rough global maps between X & Z, Y & Z. • (2) Reweigh overlapped anchors : (s1+s2)*I/U

Align 2 Multi-sequences (cont.) I: length of intersection U: length of union • (3) The highest weight chain, by LIS.

Scoring with Affine Gaps • An open research area (T-COFFEE) • 2 classical models : (1) sum-of-pairs model (2) consensus model

Scoring with Affine Gaps (cont.) • sum-of-pairs model :Sum of scores of all pairwise alignments • consensus model : • Create a “consensus string” by a majority vote at each position. • Sum of pairwise scores between the consensus and each individual sequence

Scoring with Affine Gaps (cont.) • Each scoring scheme has advantages & disadvantages. E.g. consensus • We use a “combination” of both : • sum-of-pairs => substitutions. • consensus => gaps. p.s. Most similar to CLUSTALW: ※ heuristically weighted per-sequence penalties => gaps

Scoring with Affine Gaps (cont.) • Stacking effect (consensus affine-gap) : Because gap-open penalties are large compared to match & mismatch scores, often it is favorable to artificially open additional gaps in order to stack the gap openings. • Solution: use “gap-end” penalty (== “gap-open” penalty)

Scoring with Affine Gaps (cont.) • consensus string : ATCTGT---CAG

Scoring with Affine Gaps (cont.) • Define : (Aij): K × L alignment matrix Aij belongs to {A, C, G, T, -} (Bij): K × L alignment matrix Bij belongs to {N, O, G, C}

Scoring with Affine Gaps (cont.) Bij = ‘O’ (gap-open): the ones opening a gap. ‘G’ (gap-continue): Aij=‘-’ except gap-open. ‘C’ (gap-close): the ones closing a gap. ‘N’ (nucleotide): Aij≠‘-’ except gap-close.

Multiple-Alignment

Multiple-Alignment

Presentation Transcript

Multiple Alignment

Multiple Sequence Alignment

Multiple Alignment

Multiple Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Alignment

Multiple Alignment –

Multiple alignment

Multiple Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Alignment

Multiple Alignment

Multiple Alignment

Multiple Sequence Alignment

Multiple Alignment

Multiple Sequence Alignment

Multiple alignment

Multiple Alignment

Multiple Sequence Alignment