Understanding Sequence Similarity & Alignment in Genetic Analysis

Class 3: Sequence similarity

Motivation • Same gene, or similar gene • Suffix of A similar to prefix of B? • Suffix of A similar to prefix of B..Z? • Longest similar substring of A, B • Longest similar substring of A, B..Z • For each, How big? How similar?

Define alignment • Align these two sequences optimally GACGGATT GATCGGTT • Define precisely what an alignment is

Definition of alignment • Insert spaces so that the letters line up, or letters align with spaces GA-CGGATT GATCGG-TT • Don’t allow spaces to line up • Allow spaces even at beginning and end GCAT- -CATG

Define similarity • Given an alignment, compute a similarity score • Three possibilities for each column letter-letter match letter-letter mismatch letter-space mismatch

Optimal alignment • Create score function • Conventionally: +1 bonus for match -1 penalty for letter-letter mismatch -2 penalty for letter-space mismatch

Dynamic programming solution • Given sequences s,t of length m,n • Strategy: build up optimal alignment of prefixes • Base case? • Recurrence relation?

Recurrence • Given opt alignment of prefixes of s,t shorter than i,j, find opt of s[1..i], t[1..j] • Three possibilities: • extend s by a letter, t by a space • extend s by a letter, t by a letter • extend s by a space, t by a letter

Tiny instance -- AGC, AAAC 0 -2 -4 -6 -8 -2 -4 -6

Some dp details • What is a good order to fill the array? • How do you recover the opt alignment? • What do you do about ties? • What is the space complexity of this algorithm? • What is the time complexity of this algorithm?

The gap penalty • Model above assumes two gaps of size 1 are equivalent to one gap of size 2 • Is this realistic? Why or why not?

General gap penalties • Alignments can no longer be scored as the sum of their parts • They still are the sum of blocks with one matched letter or one gap each • Blocks are: matched letters, s-gap, t-gap A|A|C|---|A|GAT|A|A|C A|C|T|CGG|T|---|A|A|T

DP for general gaps • Requires three array, one for each block type • Time complexity is cubic • This is expensive at best, prohibitive for large problems • See Setubal/Meidanis 3.3.2 for details

Affine gap penalty • Charge h for each gap, plus g * (len(gap)) • This still has quadratic complexity! • See Setubal/Meidanis

Point accepted mutations • Some mutations are more likely than others • In proteins, some amino acids are more similar than others (size, charge, hydrophobicity) • A point accepted mutation matrix is a table with probabilityof each transition in fixed time

PAM matrices • The entire matrix sums to 1 • A ‘unit of evolution’ is time in which 1/100 amino acids is expected to change

Scoring matrix • Consider aligned letters a,b • Pr(b is a mutation of a) = Mab • Pr(b is a random occurrence) = pb • Score(a,b) = 10log(Mab /pb)

Blast • Basic Local Alignment Search Tool • Def: ‘segment’ is a subsequence (without gaps) • Def: ‘segment pair’ is two segments of equal length • Rem: the score of a segment pair is the sum of its aligned letters

What Blast does • Input: • a PAM matrix • a database of sequences B • a query sequence A • a threshhold S • Output: • all segment pairs(A,B) with score > S

How Blast works • Compile short, high-scoring strings (words) • Search for hits -- each hit gives a seed • Extend seeds

Blast on proteins • Words are w-mers which score at least T against A • Use hashing or dfa to search for hits • Extend seed until heuristically determined limit is reached

Blast on nucleic acids • Words are w-mers in query A • Letters compressed, four to byte • Filter database B for very common words to avoid false positives • Extend seeds as in proteins

What does Blast give you? • Efficiency • A rigorous statistical theory which gives the probability of a segment pair occurring by chance

Homework • Given sequences s,t of length m,n, how many alignments do they have? • Setubal/Meidanis, pp. 101, 102. Problems 2, 3, 4, 8, 16.

Understanding Sequence Similarity & Alignment in Genetic Analysis

Understanding Sequence Similarity & Alignment in Genetic Analysis

Presentation Transcript

Combinatorial Pattern Matching

Falkner-Skan Solutions

MOST Maynard Operation Sequence Technique

Dynamic Programming: Edit Distance

Psi-Blast

Sequence Alignment

SIMILARITY SEARCH The Metric Space Approach

Combinatorial Pattern Matching

Molecular Evolution

Recherche dans des bases de données de séquences biologiques

A generic and modular platform for automated sequence processing and annotation

WV Geometry - July 2014

Multiple Sequence Alignment (MSA)

Sequence comparison and Phylogeny

Tools for multiple sequence alignment

Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them

MapViewer

CLASS 301

Learning Embeddings for Similarity-Based Retrieval

SIMILARITY SEARCH The Metric Space Approach

Learning Embeddings for Similarity-Based Retrieval