Sequence Analysis

Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information is utilized for: 1) Gene/Protein Identification 2) Infer Gene/Protein Function 3) Measure Genetic Distance This ENTIRE exercise relies on the comparison between 2 (or more) sequences, and is independent of any functional content within the sequence(s).

In “Pair Wise” analysis and “Multiple Sequence Alignments”, two (or more) sequences are compared to each other and a similarity measurement is derived. This process is completely computational and there is no need for a database query. From this process we can: 1) Identify common regions of sequence identity (infer function). 2) Rank order multiple sequences to identify the sequences that are most similar (measure genetic distance).

In “Sequence Identification”, we compare our sequence(s) of interest to an entire database of (known) sequences, and identify those sequences that are most similar to our sequence of interest. Theoretical Basis of Pairwise Sequence Analysis Needleman-Wunsch Algorithm : Global Alignment (entire sequence contributes to alignment) Fundamental Principle: calculate the alignment score across two sequences. All possible pairs are represented by a two-dimensional array, and all possible comparisons are represented by pathways through the array. Represents Dynamic Programming: Solving a series of subsets of a computational problem to solve the entire problem. “Divide and Conquer”.

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS 'Dynamic programming' is an efficient programming technique for solving certain combinatorial problems. It is particularly important in bioinformatics as it is the basis of sequence alignment algorithms for comparing protein and DNA sequences. In the bioinformatics application Dynamic Programming gives a spectacular efficiency gain over a purely recursive algorithm. Don't expect much enlightenment from the etymology of the term 'dynamic programming,' though. Dynamic programming was formalized in the early 1950s by mathematician Richard Bellman, who was working at RAND Corporation on optimal decision processes. He wanted to concoct an impressive name that would shield his work from US Secretary of Defense Charles Wilson, a man known to be hostile to mathematics research. His work involved time series and planning—thus 'dynamic' and 'programming' (note, nothing particularly to do with computer programming). Bellman especially liked 'dynamic' because "it's impossible to use the word dynamic in a derogatory sense"; he figured dynamic programming was "something not even a Congressman could object to.”

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS Alignment of 2 “Sequences” (words for demo purposes) OFFICEUNIVERSITY COFFEEICEVARSITY “Ungapped Alignment” OFFICEUNIVERSITY | | | ||||| COFFEEICEVARSITY -OFFICEUNIVERSITY ||| COFFEEICEVARSITY

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS Alignment of 2 “Sequences” (words for demo purposes) OFFICEUNIVERSITY COFFEEICEVARSITY “Gapped Alignment” -OFF--ICEUNIVERSITY ||| ||| | ||||| COFFEEICE---VARSITY If gaps at any position (and any length) are allowed, the process becomes computationally expensive, and in many cases the alignment does not provide meaningful information. Hence gaps must be limited to a useful and manageable number.

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS Dynamic Programming (Initialization Step)

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS Gap Penalties: 1) Reduce number of gaps in the alignment 2) Ensure a more meaningful alignment 3) Opening a gap is costly 4) Extending a gap is cheap Gap opening penalty: should be 2 – 3 times larger than the most negative value in the substitution matrix that is being used. Gap extension penalty: should be 0.1 to 0.3 times the value of the gap opening penalty.

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS -OFF--ICEUNIVERSITY ||| ||| ||||||| COFFEEICE---VARSITY -OFF--ICE ||| ||| COFFEEICE VERSITY | ||||| VARSITY

Theoretical Basis of Pairwise Sequence Analysis Smith-Waterman Algorithm : Local Alignment Fundamental Principle: based on Needleman-Wunsch, but compares segments of all possible lengths and chooses whichever optimize the similarity measure. Allows user to search for conserved/functional domains within sequences. Functionally, global alignments start aligning at the far end of the alignment matrix and trace back, where local alignments only show the regions of alignment.

Pair Wise Alignment Multiple Alignments Sequence Searching Process: Objective: Application: Compares 2 sequences Compares 3 or more sequences Compares 1 sequence against thousands Find common sequence motifs Find common sequence motifs, rank based on alignment scores. Sequence Identification, Comparative genomics http://www.ebi.ac.uk/clustalw/ http://www.ncbi.nlm.nih.gov/BLAST/ http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi

BLAST (Basic Local Alignment Search Tool) • Why is BLAST so fast? • By preindexing all the possible 11-letter words into the database records. • EXAMPLE “AGTGTCGATCG” • Steps: • 1) Find all the 11-letter words in your query sequence, plus a few variations. • 2) Look these up in the 11-letter-word index. • 3) Retrieve all sequences containing those words. • 4) Use a rigorous algorithm (e.g. Smith-Waterman) to extend the match in both directions

Sequence Analysis