190 likes | 325 Vues
This article explores the core concepts of pairwise alignments in bioinformatics, including the definitions of similarity, homology, and identity. It covers essential elements like gap penalties, scoring matrices, and alignment algorithms. Models such as Needleman-Wunsch for global and Smith-Waterman for local alignments are discussed, along with their computational considerations. Additionally, we delve into database search techniques using BLAST and FASTA, illustrating how these methods identify significant protein similarities through alignment extensions.
E N D
BioinformaticsPart 3: Pairwise Alignments and Database Searches • Similarity, homology and identity • Gap penalties and scoring matrices in pairwise alignments • Alignment algorithms • Database searching: BLAST and FASTA
Similarity, Homology and Identity • If similar proteins share a common ancestor, they are said to be homologous • Homology can be inferred, but not confirmed, from similarity • Similarity (but not homology) between proteins can be expressed in terms of relative identity • Proteins can be similar but not homologous, but homologous proteins always show similarity
Sequence 1 VLKAHLIDGGSKLTS ||||| ||| Sequence 2 VLKAHIDGGSRLTS ungapped alignment Score: 8 Identity: 53% Sequence 1 VLKAHLIDGGSKLTS ||||| ||||| ||| Sequence 2 VLKAH-IDGGSRLTS gapped alignment Score: 13 Identity: 86.7% Examples of Simple Pairwise Alignments
Scoring Penalties in Pairwise Alignments • Penalties are imposed to prevent the unrestricted insertion of gaps • Gap penalty: a penalty for introducing a gap • Extension penalty: a penalty for extending a gap • In protein evolution, it is more likely that an existing gap would be extended than a new gap introduced • Consequently, the score for a gap penalty is greater than the score for an extension penalty
Dot Matrix Analysis and Dot Plots • Compares two sequences in the form of a matrix, with each sequence lying along one axis • A match between residues is indicated by a dot • A sliding window is used to cut down “noise” and produce clearer results • Dot plot reveals diagonal lines where there is sufficient similarity between the sequences
Scoring Matrices in Pairwise Alignments • A scoring matrix takes into account the significance of matches and mismatches between aligned amino acids • In theory, a scoring matrix could be based on the different chemical and physical properties of amino acids • In practice, scoring matrices are based on observed differences between proteins (or parts of proteins)
PAM Scoring Matrices • Based on the analysis of 1,572 changes in 71 groups of closely related proteins (>85% identity) • Mutation probabilities were determined for each amino acid based on a substitution rate of 1% • These were used to construct the PAM 1 (point [or percent] accepted mutation) matrix • The PAM 250 matrix (often used as a default in pairwise alignments) provides scores equivalent to about 20% matches remaining between two sequences
BLOSUM Scoring Matrices • Based on amino acid substitutions in a large set of amino acid patterns called blocks, derived from several hundred groups of related proteins • BLOSUM matrices take distant but significant relationships between proteins into account, because only protein segments are considered • Over-representation of amino acid substitutions in closely related protein segments was reduced by combining those segments into one sequence • Example: proteins showing 62% or more identity were grouped to produce the BLOSUM62 matrix
Alignments and Dynamic Programming • Complete search of all possible alignments is computationally demanding and frequently impossible • Algorithms that use dynamic programming have been developed to obtain alignments between sequences • Algorithms may produce either global or local alignments
Global Alignment: Needleman-Wunsch • A matrix is constructed that shows matches between the two sequences • Moving from the top left of the matrix, a process of summation is carried out taking penalties into account • For any given cell in the matrix, the maximum score for that cell is entered • Needleman-Wunsch attempts to align all residues in the two sequences, and is therefore a global alignment algorithm
Local Alignment: Smith-Waterman • Takes into account that two relatively dissimilar sequences may exhibit short regions of local similarity • Smith-Waterman uses a local alignment algorithm to detect these similarities • Each cell in the matrix is considered as the end point of a potential alignment • A value for each cell is calculated using a similarity score, taking matches, mismatches and gaps into account • A backtracking procedure from the highest scoring cell is then used to trace the alignment through the matrix
Pairwise Database Searching • Use of the Needleman-Wunsch or Smith-Waterman algorithms in pairwise database searching requires enormous computational power • Heuristic approximations of these algorithms are therefore used in database searches • Examples of search tools are BLAST and FASTA • Both BLAST and FASTA aim to identify short identical matches, which are then extended to produce local alignments
BLAST • Search is made for regions of short length (words or k-tuples) obtained from the query sequence that match a database sequence = high scoring pairs (HSPs) • HSPs are extended in both directions to produce optimal alignments above a certain score • A scoring matrix (default is BLOSUM62), gap and gap extension penalties are taken into account in determining alignments • Optimal alignments are then reported in order of decreasing score
FASTA • Regions of short length (words) in the query that match a target sequence are determined • High scoring regions (best initial regions) are used to rank matches for further analysis • Longer high scoring regions, including gaps, are generated by joining best initial regions • A full Smith-Waterman alignment is then performed between the high scoring regions • FASTA is slower than BLAST but may, in some cases, be more sensitive
A Final Few Words of Advice • Protein-protein searches are more informative than nucleotide-nucleotide searches (when the query is known to contain a protein-coding nucleotide sequence) • When performing a pairwise database search with a new, protein-coding nucleotide sequence, always use a translation of the nucleotide sequence in all six frames as the query • This can be done by using, for example, a translated BLAST search (such as tblastx, which translates both the query sequence and a nucleotide database)