Alignment of Genomic Sequences

Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago

Sequence Alignment (1) pairs of matched bases(2) pairs of mismatched bases(3) pairs consisting of a base from one sequence and a gap (null base) from the other sequence

Alignment as an Evolutionary Hypothesis TCAGA** *TC-GT

A: TCAGACGATTGLA = 11B: TCGGAGCTGLB = 9

Alignment ITCAG-ACG-ATTG|| | | | | |TC-GGA-GC-T-GMatches = 7Gaps = 6

Alignment IITCAGACGATTG|| || TCGGAGCTG--Matches = 4Gaps = 1

Alignment IIITCAG-ACGATTG|| | | | | TC-GGA-GCTG-Matches = 6Gaps = 4

Which alignment is best?

Gap and Mismatch Penalties • Gap penalty - a factor by which gap values are multiplied to make the gaps equivalent to mismatches • Mismatch penalty - an assessment of how frequently substitutions occur

Similarity Index S = x - wkzk X : number of matches Zk : number of gaps of length k wk : positive number representing penalty for gaps of length k

Distance (Dissimilarity) Index D = y + w'kzk y : number of mismatches zk : number of gaps of length k w'k: positive number representing penalty for gaps of length k

Gap penalty systems • Fixed - no gap extension penalty • Affine or Linear - has two componenets gap opening penalty and gap extension penalty • Logarithmic - also has two components but the cost increases more slowly allowing longer gaps than the latter system

Gap penalty systems Linear Logarithmic Fixed Gap penalty Gap length

BEST Gap opening cost = 2 Gap opening cost = 3 Gap extension cost = 6 Gap extension cost = 0 TCAG-ACG-ATTG|| | | | | | S = -5 S = -11 TC-GGA-GC-T-G TCAGACGATTG|| || S = -4 S = 1 TCGGAGCTG-- TCAG-ACGATTG|| | | | | S = -2 S = -6TC-GGA-GCTG-

Dynamic programming • Large searches are divided into succession of small stages: • solution of the initial search stage is trivial • each partial solution in a later stage can be calculated by reference to only a small number of solutions of the earlier stage • the final stage contains overall solution

A T G C G A 1 0 0 0 0 T 0 2 1 1 1 C 0 1 2 3 2 C 0 1 2 3 3 G 0 1 3 2 4 C 0 1 2 4 3 Pointer values and paths connecting the pointers

Traceback A T G C G A 1 0 0 0 0 T 0 2 1 1 1 C 0 1 2 3 2 C 0 1 2 3 3 G 0 1 3 2 4 C 0 1 2 4 3 ATGCG- || || ATCCGC AT--GCG || || ATCCGC-

Similarity Index S = x - wkzk x - number of matches zk - number of gaps of length k wk - a positive number representing penalty for gaps of length k

TCAGACGAGTG x = 6 (I) | | | | | | a gap of 2 bp TCGGA - - GCTG S = 6 - (a + 2b) TCAGACGAGTGx = 7 (II) | | | | | | | 2 gaps of 1 bp TCGGA -GC - TG S = 7 - 2(a + b) TCAGACGAGTG x = 7 (III) | | | | | | | 2 gaps of 1 bp TCGGA -G - CTG S = 7 - 2(a + b) TCAGACGAG - TG x = 8 (IV) | | | | | | | | 2 1-bp gaps; 1 2-bp gaps TC - G - - GAGCTGS = 8 - 2(a + b) - (a + 2b)

How to align twolong genomic sequences?

Traditional Seq. Alignment • The seqs. are usually known (coding or non-coding) and are homologous • They are not very long, usually < 10,000 base pairs (bp) • They contain no inversions • Relies on dynamic programming: The time and space required are O(N2), where N is the sequence length.

The Human Genome • Genome size: ~3.2 billion bp • Only ~1.5% is coding. • Contains numerous repetitive elements (more than 4 million). • Introns are usually longer than exons. • Non-coding regions evolve fast and are not well conserved.

Genomic Seq. Alignment • The seqs. can be > one million bp (Mb); e.g., the genome size of Mycobacteriumtuberculosis is about 4 Mb. Long time to align. Large computer memory. • May contain inversions and many tandem repeats. • May containnon-alignable (too divergent) segments.

Genomic Seq. Alignment Strategy: Search for anchors that can divide the sequences into subregions. The gaps between anchors can then be aligned by a local alignment algorithm.

The System of Delcher et al. (1999) • Three ideas: (1) Suffix trees; (2) the Longest Increasing Subsequence (LIS); and (3) the local alignment method of Smith and Waterman (1981) • Two closely homologous long sequences or genomes (A and B).

Step 1: Perform a Maximum Unique Match (MUM) decomposition of the two sequences A MUM is a subsequence that occurs once in sequence A and once in sequence B, and is not contained in any longer such sequence.

Max. Unique Matches (MUMs) MUM1 Seq. A tcgatcaAGCTCACTGATatgtaccat Seq. B cgagcgAGCTCACTGATcctgcatca MUM2 -acgctgaATCGACGTAGTCCATGtactgta agtgc-agATCGACGTAGTCCATGatgaat

Suffix Trees A suffix is a subseq. that begins at any position in the seq. & extends to the seq. end. g a a c c g a c c t 1 2 3 4 5 6 7 8 9 10 A suffix: c c g a c c t A suffix tree is a compact representation that stores all possible suffixes of a seq.

g a a c c g a c c t 1 2 3 4 5 6 7 8 9 10 Root o a t c ga 1 1 2 10 gacct accgacct accgacct cc t cct c 2 3 5 2 9 1 6 gacct t gacct t 3 7 4 8

g a a c c g a c c t# g a a c c t a c c t* 1 2 3 4 5 6 7 8 9 10 Root o a t c ga 1 1 2 10 accgacct# gacct# acc cc t cct c 5 3 2 5 2 9 6 t gacct# gacct# gacct# t# 4 tacct* 3 7 4 8 1 1 7

Step 2: Sort the MUMs After finding the MUMs, we sort them according to their positions in genome A. See figure. Longest Increasing Sequence (LIS): If the order of B positions is given by the sequence [1,2,10,4,5,8,6,7,9,3], the LIS is [1,2,4,5,6,7,9]. The LIS gives a global MUM-alignment.

3 5 6 1 2 4 7 Genome A: 3 6 5 Genome B: 1 2 4 7 6 1 2 4 7 Genome A: Genome B: 6 1 2 4 7

Step 3: Close the gaps between MUMs Use the Smith-Waterman algorithm to close the gaps between MUMs. Some regions may be very difficult to align. These regions are ignored and considered as non-alignable parts. Default: If the gap between 2 MUMs is 10 kb, no local alignment is attempted.

Alignment of Genomic Sequences