Probability Theory and Basic Alignment of String Sequences

Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3 S. Maarschalkerweerd & A. Tjhang

Overview • Probability Theory -Maximum Likelihood -Bayes Theorem • Pairwise Alignment -The Scoring Model -Alignment Algorithms S. Maarschalkerweerd & A. Tjhang

Probability Theory S. Maarschalkerweerd & A. Tjhang

Probability Theory • What is a probabilistic model? • Simple example: What is probability of base sequence x1x2…xn?  p(xi), p(x1), p(x2)…p(xn) independent of each other If pC = 0.3; pT = 0.2 and sequence is CTC: P(CTC)=0.3*0.2*0.3=0.018 S. Maarschalkerweerd & A. Tjhang

Maximum Likelihood Estimation • Estimate parameters of the model from large sets of examples (training set) • For example: P(T) and P(C) are estimated from their frequency in a database of residues • Avoid overfitting • Database too small, model also fits to noise in the training set S. Maarschalkerweerd & A. Tjhang

Probability Theory • Conditional Probability -P(X,Y) = P(X|Y) P(Y) (joint probability) -P(X) = Y P(X,Y) = Y P(X|Y) P(Y) (marginal probability) S. Maarschalkerweerd & A. Tjhang

Bayes’ Theorem • P(X|Y) = - Posterior probability • Example: P(X)=Probability tumor visible on x-ray P(C)=Probability breast-cancer = 0.01 P(X|C) = 0.9; P(X|¬C) = 0.05 - On the x-ray a tumor is seen. What is the probability that the woman has breast-cancer? P(Y|X) P(X) P(Y) S. Maarschalkerweerd & A. Tjhang

Pairwise Alignment S. Maarschalkerweerd & A. Tjhang

Pairwise Alignment • Goal: determine whether 2 sequences are related (homologous). • Issues regarding pairwise alignment: • What sorts of alignment should be considered? • The scoring system used to rank alignments. • The algorithm used to find optimal (or good) scoring alignments. • The statistical methods to evaluate significance of an alignment score. S. Maarschalkerweerd & A. Tjhang

Example • You need a ‘smart’ scoring model to distinguish b from c. S. Maarschalkerweerd & A. Tjhang

The Scoring Model S. Maarschalkerweerd & A. Tjhang

The Scoring Model • When sequences are related, then both sequences have to be from a common ancestor. • Due to mutation sequences can change. • Substitutions • Gaps (insertions or deletions) • Natural selection ensures that some mutations are seen more often than others. (Survival of the fittest) S. Maarschalkerweerd & A. Tjhang

The Scoring Model • Total score of an alignment: • Sum of terms for each aligned pair of residues • Terms for each gap • Take the sum of those terms S. Maarschalkerweerd & A. Tjhang

Substitution Matrices • We need a matrix with the scores for every possible pair of residues (e.g. bases or amino acids) • We can compute these score by: s(a,b) = log( ) pab= probability that residues a and b have been derived independently from some unknown original residue c. qa= frequency of a pab qaqb S. Maarschalkerweerd & A. Tjhang

BLOSUM50 S. Maarschalkerweerd & A. Tjhang

Gap Penalties • (g) = -gd (linear score) • (g) = -d-(g-1)e (affine score) • d = gap-open penalty • e = gap-extension penalty • g = gap length • P(gap) = f(g)  qxi i in gap S. Maarschalkerweerd & A. Tjhang

Alignment Algorithms S. Maarschalkerweerd & A. Tjhang

Alignment Algorithms • Needleman-Wunsch (global alignment) • Smith-Waterman (local alignment) • Repeated matches • Overlap matches • Hybrid match conditions S. Maarschalkerweerd & A. Tjhang

Dynamic Programming • Enormous amount of possible alignments • Algorithm for finding optimal alignment: Use Dynamic Programming • Save sub-results for later reuse, avoiding calculation of same problem S. Maarschalkerweerd & A. Tjhang

Needleman-Wunsch Algorithm • Global alignment • For sequences of size n and m, make (n+1)x(m+1) matrix • Fill in from top left to bottom right F(i-1, j-1) + s(xi,yj) • F(i,j) = max F(i-1, j) – d F(i, j-1) – d • Keep pointer to cell that is used to derive F(i,j) • Takes O(nm) time and memory { S. Maarschalkerweerd & A. Tjhang

0 -8 -8 -2 -8 -8 Matrix -2 S. Maarschalkerweerd & A. Tjhang

Matrix Traceback S. Maarschalkerweerd & A. Tjhang

Smith-Waterman Algorithm • Local alignment • Two differences with Needleman-Wunsch: 0 F(i-1, j-1) + s(xi,yj) F(i-1, j) – d F(i, j-1) – d 2. Local alignment can end anywhere, so choose highest value in matrix from where traceback starts (not necessarily bottom right cell) { • F(i,j) = max S. Maarschalkerweerd & A. Tjhang

Matrix S. Maarschalkerweerd & A. Tjhang

Smith-Waterman Algorithm • Expected score for a random match s(a,b) must be negative • There must be some s(a,b) greater than 0 or no alignment is found S. Maarschalkerweerd & A. Tjhang

Repeated Matches • Many local alignments possible if one or both sequences are long. Smith-Waterman only finds one of them • Find parts of sequence in the other sequence • Not every alignment is useful threshold S. Maarschalkerweerd & A. Tjhang

Repeated Matches { F(i, 0) F(i-1, j-1) + s(xi,yj) F(i-1, j) – d F(i, j-1) – d F(i-1, 0) F(i-1, j) – T, j = 1,…m; F(i,j) = max { F(i,0) = max S. Maarschalkerweerd & A. Tjhang

Matrix Threshold T = 20 S. Maarschalkerweerd & A. Tjhang

Overlap Matches • Find match between start of a sequence and end of a sequence (can be the same) • Alignment begins on left-hand or top border of the matrix and ends on right-hand or bottom border S. Maarschalkerweerd & A. Tjhang

Overlap Matches • F(0,j) = 0, for j = 1,…,m • F(i,0) = 0, for i = 1,…,n F(i-1, j-1) + s(xi,yj) • F(i,j) = max F(i-1, j) – d F(i, j-1) – d { S. Maarschalkerweerd & A. Tjhang

Matrix S. Maarschalkerweerd & A. Tjhang

Hybrid Match Conditions • Different types of alignment can be created by • adjusting rhs of this formula: F(i,j) = max {…. • adjusting the traceback • Example: • We want to align two sequences from the beginning of both the sequences until local alignment has been found. S. Maarschalkerweerd & A. Tjhang

Summary • Probability theory is important for sequence analysis • Goal: determine whether 2 sequences are related • For that, we need to find an optimal alignment between those sequences using algorithms • Scoring model is required to rank different alignments • Different algorithms for different types of alignments • use dynamic programming S. Maarschalkerweerd & A. Tjhang

Probability Theory and Basic Alignment of String Sequences

Probability Theory and Basic Alignment of String Sequences

Presentation Transcript

Basic String Alignment

Alignment of Long Sequences: LAGAN

Basic Probability Theory

Learning From String Sequences

BASIC NOTIONS OF PROBABILITY THEORY

String Theory

String Theory

String Theory

String Theory

String theory

String Theory

Alignment of large genomic sequences

Sequences Alignment Statistics

Multiple Sequences Alignment

Alignment of Genomic Sequences

Basic probability theory

Sequences, Induction and Probability

Alignment of large genomic sequences

String Theory

String Theory