Bioinformatics Sequences Comparison Guide

Sequence similarity

Sequence Comparison Much of bioinformatics involves sequences • DNA sequences • RNA sequences • Protein sequences We can think of these sequences as strings of letters • DNA & RNA: alphabet of 4 letters • Protein: alphabet of 20 letters

Sequence Comparison - Motivation • Nucleotide • Learn about evolutionary relationships • Finding genes, domains, signals … • Protein • Learn about evolutionary relationships • Classify protein families (function, structure) • Identify common domains (function, structure)

Calculation of an alignment score

12 matches |, 2 gaps - 10 matches | , 3 mismatches How do we align two sequences? ATTGCAGTGATCG ATTGCGTCGATCG Solution 1 Solution 2 ATTGCAGTGATCG ATTGCAGT-GATCG ||||| ||||| ||||| || ||||| ATTGCGTCGATCG ATTGC-GTCGATCG

Solution 1 Solution 2 • ATTGCAGTGATCG ATTGCAGT-GATCG • ||||| ||||| ||||| || ||||| • ATTGCGTCGATCG ATTGC-GTCGATCG We will use a scoring scheme Match +1 +1 Mismatch –1 0 Indel(gap) -2 -2 12 matches, 2 gaps 10 matches, 3 mismatches 10X1+3X(-1) = 7 12X1+2X(-2) = 8 10X1+3X(0) = 10 12X1+2X(-2) = 8 Which alignment is better?

Scoring Alignments - intuition • Similar sequences evolved from a common ancestor • Evolution changed the sequences from this ancestral sequence by mutations: • Replacements: one letter replaced by another • Deletion: deletion of a letter • Insertion: insertion of a letter • Scoring of sequence similarity should examine how many operations took place

Causes for sequence (dis)similarity • mutation: a nucleotide at a certain location is replaced by • another nucleotide (e.g.: ATA → AGA) • insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → AGA) • deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G) • indel: an insertion or a deletion

Gaps • Positions at which a letter is paired with a null are called gaps. • Gap scores are typically negative. • Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap.

Gap Opening • The gap-opening penalty defines the cost for opening a gap in one of the sequences. • If you raise the gap-opening penalty above default, local alignments that contain gaps may be split into several shorter alignments.

This is more likely. This is less likely. Affine Gap Penalties • In nature, a series of indels often come as a single event rather than a series of single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC Normal scoring would give the same score for both alignments Gap = Gapopen + Len * Gapextend

Gap penalties lead to: • Increasing penalties for gaps opening and extension • The alignment will contain fewer gaps and more mismatches • Decreasing penalties for gaps opening and extension • The alignment will contain more gaps (of varied lengths) and fewer mismatches • Holding same score of penalty for gap opening and increasing penalty for gap extension • Very long gaps will not be tolerated – they will be replaced with additional gaps of medium length and with mismatches.

Sequence similarity

Global alignment A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment. As these sequences are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique. Global Local _____ _______ __ ____ __ ____ ____ __ ____

Local alignment Local alignment methods find related regions within sequences - they can consist of a subset of the characters within each sequence. For example, positions 20-40 of sequence A might be aligned with positions 50-70 of sequence B. This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins can be identified as being related. Global Local _____ _______ __ ____ __ ____ ____ __ ____

Global vs. Local: Global Local Jack Leunissen

Global vs. Local: • Use global alignmentif • You expect, based on some biological information, that your sequences will match over the entire length. • Your sequences are of similar length. • Use local alignment if • You expect that only certain parts of two sequences will match (as in the case of conserved segment that can be found in many different proteins). • Your sequences are very different in length. • You want to search a sequence database (we will talk about it in details later).

If two proteins share more than one common region, for example one has a single copy of a particular domain while the other has two copies, it may be possible to "miss" one of the two copies if using local alignment, which presents only the best scoring alignment. Emboss[best solution]vs. Lalign (Embnet) [several solutions]

Comparing nucleotides • Every match got the same score • Every mismatch got the same score • Gaps- we decided but default usually good. • However

In the case of aa • Not all matches are the same • Different mismatches get different scores

Amino acid properties Aspartic acid (D) and Glutamic acid (E) have similar properties Serine (S) and Threonine (T) have similar physicochemical properties => Substitution of S/T or E/D occurs relatively often during evolution => Substitution of S/T or E/D should result in scores that are only moderately lower than identities

Each aa is characterized by a combination of features (size, charge, etc.). The relative importance of each feature may vary according to the aa role in the 3-D structure and function of the protein. So how can we score matches and mismatches?

Amino Acids Substitution Matrices The PAM and BLOSUM substitution matrices describe the likelihood that two residue types would mutate to each other. These matrices are based on biological sequence information: the substitutions observed in structural (BLOSUM) or evolutionary (PAM) alignments of well studied protein families These scoring systems have a probabilistic foundation.

PAM series - Percent Accepted Mutation (Accepted by natural selection) • All the PAM data come from alignments of closely • related proteins (>85% amino acid identity) from 71 protein families (total of 1572 protein sequences). • PAM matrices are based on global sequence alignments - these include both highly conserved and highly mutable regions. Some of the protein families are: Ig kappa chain Kappa casein Lactalbumin Hemoglobin a Myoglobin Insulin Histone H4 Ubiquitin

Various degrees of conservation The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity).

PAM series - Percent Accepted Mutation (Accepted by natural selection) Varying degrees of conservation *

THE BLOSUM Family of Matrices Blocks Substitution Matrices- Henikoff and Henikoff, 1992 • Blocks are short conserved patterns of 3-60 aa long. • Proteins can be divided into families by common blocks. • Different BLOSUM matrices emerge by looking at sequences with different identity percentage. Example:BLOSUM62 is derived from an alignment of sequences that share no less than 62% identity. Block A B C D

The Blocks Database Gapless alignment blocks

Blosum62 scoring matrix

Based on an explicit evolutionary model Derived from small, closely related proteins with ~15% divergence Higher PAM numbers to detect more remote sequence similarities Errors in PAM 1 are scaled 250X in PAM 250 Based on empirical frequencies Uses much larger, more diverse set of protein sequences (30-90% ID) Lower BLOSUM numbers to detect more remote sequence similarities Errors in BLOSUM arise from errors in alignment PAM versus BLOSUM

Guidelines • Lower PAMs and higher Blosums find short local alignment of highly similar sequences • Higher PAMs and lower Blosums find longer weaker local alignment • No single matrix answers all questions

Guidelines • BLOSUM is generally better than PAM for local alignments. • The default matrix is often identity matrix for DNA and BLOSUM 62 for proteins • When using BLOSUM80 instead of BLOSUM45, local alignments tend to be shorter. • Low PAMs have same effects as high Blosums. BLOSUM indicates percent identity while PAM is proportional to the percent of accepted mutations.

Bioinformatics Sequences Comparison Guide