Bioinformatics

Bioinformatics Ayesha M. Khan 7th March,2012

Sequence alignment • A comprehensive alignment must account fully for the positions of all residues in both sequences. • This means that many residues may have to be placed at positions that are not strictly identical. • Can we maximize the no. of identical matches by inserting gaps in an unrestricted manner? • We can, as it might achieve an optimal score, however the result of such a process would be biologically meaningless. • That is why gap opening and gap extension penalties are introduced.

Scoring matrices • A scoring matrix should reflect: • Degree of biological relationship between the amino-acids • or nucleotides • -The probability that two AA’s occur in homologous • positions in sequences that share a common ancestor • - Or that one sequence is the ancestor of the other • Scoring schemes based on physico-chemical properties also proposed

1. Unitary matrix The simplest way of scoring is to assign one number for a match and another number for a mismatch. Such a matrix is often referred to as a unitary matrix. • Sparse information, as it gives all identical matches equal weighting. • There is a need to enhance the scoring potential of the weak, but biologically significant signals, so that they can contribute to the matching process.

Example of unitary scoring matrix Identical residues are scored 1 The rest of the non-identical residues are give a 0

2. The Dayhoff Mutation Data MatrixPAM (point accepted mutation) • The most important improvement achieved over the unitary matrix is based on evolutionary distances. • Dayhoff (1978) carried an extensive study of the frequencies in which amino acids substitute for each other during evolution. • The studies involved carefully aligning all of the proteins in several families of proteins and then constructing phylogenetic trees for each family. This lead to a table of relative frequencies with which amino acids replace each other over an evolutionary period. • Dayhoff matrix was derived from sequences 85% identical. • Known as PAM (point Accepted Mutation) family of scoring matrices.

The PAM series are based on estimated mutation rates (Percent Accepted Mutations) from closely related proteins and will therefore be dominated by amino acid mutations caused by single base changes. • Therefore, from biological point of view PAM matrices are based on observed mutations. • Thus the highest scoring alignment is statistically most likely to have been generating by evolution rather than by chance. • The PAM matrices is generally presented as log-odds matrices. • i.e., each score in the matrix is the logarithm of an odds ratio. The odds ratio used is the ratio of the number of times residue "A" is observed to replace residue "B" divided by the number of times residue "A" would be expected to replace residue "B" if the replacement occurred at random. • Thus positive scores in the matrix designate a pair of residues that replace each other more often than expected by chance. PAM matrix (contd.)

PAM250 matrix The amino acids are arranged by assuming that positive values represent evolutionarily conservative replacements. The amino acids are ranked here according to groups based on their physicochemical properties.

PAM matrix (Contd.) Likelihood (odds) ratio for residues a and b : Probability a-b is a mutation / probability a-b is chance val> 0 : likely mutation val = 0 : random mutation val< 0 : unlikely mutation • 250 PAM : similarity scores equivalent to 20% identity • low PAM - good for finding short, strong local similarities high PAM = long weak similarities

BLOSUM matrix(BLOcksSUbstitution Matrix) • Henikoff & Henikoff (1992) derived a set of substitution matrices from blocks of aligned sequences in the BLOCKS database, to represent distant relationships more explicitly. • The Blocks are defined as ungapped regions of aligned AA’s from related proteins. • Clusters of sequence segments created on basis of minimum percentage identity. • Employed > 2000 blocks to derive scoring matrix

BLOSUM matrix (contd) Overall procedure to develop a BLOSUM X matrix – Collect a set of multiple alignments – Find the Blocks (no gaps) – Group segments of Blocks with X% identity – Count the occurrence of all pairs of AA’s – Employ these counts to obtain odds ratio (log) Most common BLOSUM matrices are X= 45, 62 & 80 e.g. sequences clustered at greater than or equal to 80% identity are used to generate the BLOSUM 80 matrix

Local alignment search tools • BLAST algorithm (Altschul et. al 1990) (Basic Local Alignment Search tool) Segment Pair: Given two sequences, a segment pair is defined as a pair of sub-sequences of the same length that form an ungapped alignment. • BLAST calculates all segment pairs between the query and the database sequences, above a scoring threshold. • The fixed length hits are extended until certain threshold parameters are achieved.

Task/Self-study: BLAST at NCBI URL: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome 1. You will find a range of different BLAST programs (nucleotide blast, protein blast etc.) 2. Choose nucleotide BLAST 3. Check the algorithm parameters (at the end of the page) go through each parameter and what it means

Bioinformatics

Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

BIOINFORMATICS