Sequential Data Mining

Sequential Data Mining Jianlin Cheng Department of Computer Science University of Missouri, Columbia 2011

Sequential Data • Sequence data: a series of alphabets or numbers • Spatial sequence data: biological sequence, text document, a computer program, web page, a missile trajectory • Temporal sequence data: speech, movie clips, stock prices, a series of web clicks

Key Issue • Measure the similarity between two sequences, i.e. sequence comparison • Sequence database search / retrieval • Sequence classification • Sequence clustering • Sequence pattern recognition • Sequence similarity definition and calculation is a challenge compared to vector-based data

Sequence Alignment • Use biological sequences as an example • Optimal pairwise sequence alignment algorithm • Fast, heuristic sequence alignment (BLAST) • Statistical significance of sequence alignment score

Sequence Data • DNA Sequence (gene) gaagagagcttcaggtttggggaagagaca acaactcccgctcagaagcaggggccgata • RNA Sequence caaaucacuacacacaggguagaagguggaacgcacaggagcaugucaacggggugc • Protein Sequence RELQVWGRDNNSLSEAGANRQGDVSFNLPQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEDIDLPGKWKP

Global Pairwise Sequence Alignment ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL ITAKPQWLKTSESVTFLSFLLPQTQGLYHL

Global Pairwise Sequence Alignment ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL ITAKPQWLKTSESVTFLSFLLPQTQGLYHL ITAKPAKT-TSPKEQAIGLSVTFLSFLLPAG-VLYHL ITAKPQWLKTSE-------SVTFLSFLLPQTQGLYHL Alignment (similarity) score

Three Main Issues • Definition of alignment score • Algorithms of finding the optimal alignment • Evaluation of significance of alignment score

A Simple Scoring Scheme • Score of a character pair: S(match)=1, S(not_match) = -1, S(gap-char) = -1 • Score of an alignment = ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPQWLKSTE-------SVTFLSFLLPQTQGLYHL 5 – 7 – 7 +10 -4 + 4= 1

Optimization Problem • How many possible alignments exist for two sequences with length m and n? • How can we find the best alignment to maximize alignment score?

Possible Alignments Shortest ACTG--- ATGGATG Intermediate AC-TG--- ATG-GATG Longest ACTG------- ----ATGGATG ACTG ATGGATG

Total Number of Possible Alignments AGATCAGAAAT-G --AT-AG-AATCC m + n

Total Number of Alignments Select m positions out of m+n possible positions: Exponential! If m = 300, n = 300, total = 1037

Needleman and Wunsch Algorithm • Given sequences P and Q, we use a matrix M to record the optimal alignment scores of all prefixes of P and Q. M[i,j] is the best alignment score for the prefixes P[1..i] and Q[1..j]. • M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ]

Needleman and Wunsch Algorithm • Given sequences P and Q, we use a matrix M to record the optimal alignment scores of all prefixes of P and Q. M[i,j] is the best alignment score for the prefixes P[1..i] and Q[1..j]. • M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ] Dynamic Programming

Dynamic Programming Algorithm Three-Step Algorithm: • Initialization • Matrix filling (scoring) • Trace back (alignment)

1. Initialization of Matrix M j _ A T A G A A T _ A G A T C A i G A A A T G

2. Fill Matrix _ A T A G A A T _ A G A T M[i,j] = max [ M[i-1,j-1] + S(P[i],Q[j]), M[i,j-1] + S(-, Q[j]) M[i-1,j] + S(P[i], -) ] C A G A A A T G

3. Trace Back _ A T A G A A T _ A G A T C A AGATCAGAAATG G --AT-AG-AAT- A A A T G

Pairwise Alignment Algorithm Using Dynamic Programming • Initialization: Given two sequences with length m and n, create a (m+1)×(n+1) matrix M. Initialize the first row and first column according to scoring matrix. • For j in 1..n (column) for i in 1..m (row) M[i,j] = max( (M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i-1, j] + S(i,-) ) Record the selected path toward (i,j) • Report alignment score M[m,n] and trace back to M[0,0] to generate the optimal alignment. What is the time complexity of this algorithm?

Local Sequence Alignment Using DP • Biological sequences often only have local similarity. • During evolution, only functional and structural important regions are highly conserved. • Global alignment sacrifices the local similarity to maximize the global alignment score. • Need to use alignment method to identify the local similar regions disregard of other dissimilar regions.

Transcription binding site

Local Alignment Algorithm Goal: find an alignment of the substrings of P and Q with maximum alignment score. Naïve Algorithm: (m+1)*m/2 substrings of P, (n+1)*n/2 substrings of Q Using DP for each substring pairs: m2 * n2 * O(mn) = O(m3n3)

Smith-Waterman Algorithm Same dynamic programming algorithm as global alignment except for three differences. • All negative scores is converted to 0 (why?) • Alignment can start from anywhere in the matrix • Alignment can end at anywhere in the matrix

Local Alignment Algorithm • Initialization: Given two sequences with length m and n, create a (m+1)×(n+1) matrix M. Initialize the first row and first column to 0s. • For j in 1..n (column) for i in 1..m (row) M[i,j] = max( 0, M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i-1, j] + S(i,-) ) Record the selected path. • Find elements in matrix M with maximum values. Trace back till 0 and report the alignment corresponding to the path.

1. Initialization _ A T A G A A T _ A G A T C A G A A A T G

2. Fill the Matrix _ A T A G A A T _ A G A T C A G A A A T G

3. Trace Back _ A T A G A A T _ A G A T C A G A A A T G

Two Best Local Alignments Local Alignment 2: Local Alignment 1: ATCAGAAAT AT-AGA-AT ATCAGAA AT-AGAA

Scoring Matrix • How to accurately measure the similarity between amino acids (or nucleotides) is one key issue of sequence alignment. • For nucleotides, a simple identical / not identical scheme is mostly ok. • Due to various properties of amino acids, it is hard and also critical to measure the similarity between amino acids.

Evolutionary Substitution Approach • During evolution, the substitution of similar amino acids is more likely to be selected within a protein family of similar sequences than random substitutions (M. Dayhoff) • The frequency / probability that one residue substitutes another one is an indicator of their similarity.

PAM Scoring Matrices(M. Dayhoff) • Select a number of protein families. • Align sequences in each family and count the frequency of amino acid substitution in each column: Pij. • Similarity score is logarithm of the ratio of observed substitution probability over the random substitution probability. S(i,j) = log(Pij / (Pi * Pj)). Pi is the observed probability of residue i and Pj is the observed probability of residue j • PAM: Point Accepted Mutation • Let data tell strategy! - NBA

A Simplified Example Substitution Frequency Table ACGTCGAGT ACCACGTGT CACACTACT ACCGCATGA ACCCTATCT TCCGTAACA ACCATAAGT AGCATAAGT ACTATAAGT ACGATAAGT Total number of substitutions: 90 P(A<->C) = 0.07+0.07=0.14

A Simplified Example Substitution Frequency Table ACGTCGAGT ACCACGTGT CACACTACT ACCGCATGA ACCCTATCT TCCGTAACA ACCATAAGT AGCATAAGT ACTATAAGT ACGATAAGT Total number of substitutions: 90 P(A<->C) = 0.07+0.07=0.14 S(A,C) = log(0.14/(0.6*0.1)) = 0.36

PAM250 Matrix (log odds multiplied by 10)

BLOSUM Matrices(Henikoff and Henikoff) • PAM calculation is based on global alignments • PAM matrices don’t work well for aligning sequences with little similarity. • BLOSUM: BLOcks SUbstitution Matrix • BLOSUM based on highly conserved local regions /blocks without gaps.

Block 1 Block2

BLOSUM62 Matrix

Significance of Sequence Alignment Score • Why do we need significant test? • Mathematical view: unusual versus “by chance” • Biological view: evolutionary related or not?

Randomization Approach • Randomization is a fundamental idea due to a statistician Fisher. • Randomly permute chars within sequence P and Q to generate new sequences (P’ and Q’). Align random sequences and record alignment scores. • Assuming these scores obey normal distribution, compute mean (u) and standard derivation (σ) of alignment scores

95% Alignment score Normal distribution of alignment scores of two sequences • If S = u+2 σ, the probability of observing the alignment score equal to or more • extreme than this by chance is 2.5%, e.g., P(S>=u+2 σ) = 2.5%. • Thus we are 97.5% confident that the alignment score is not by chance, i.e. significant. • For any score x, we can compute P(S >= x), which is called p-value.

Model-Based Approach (Karlin and Altschul)http://www.people.virginia.edu/~wrp/cshl02/Altschul/Altschul-3.html • Extreme Value Distribution K and λare statistical parameters depending on substitution matrix. For BLOSUM62, λ =0.252, K=0.35

P-Value • P(S≥x) is called p-value. It is the probability that two random sequences has an alignment score >= x. • Smaller means more significant. • E-value = p-value * # sequences

Problems of Using Dynamic Programming to Search Large Sequence Databases • Search homologs in DNA and protein database is often the first step of a bioinformatics study. • DP is too slow for large sequence database search such as Genbank and UniProt. Each DP search can take hours. • Most DP search time is wasted on unrelated sequences or dissimilar regions. • Developing fast, practical sequence comparison methods for database search is important.

Fast Sequence Search Methods • All successful, rapid sequence comparison methods are based on a simple fact: similar sequences share some common words. • First such method is FASTP (Pearson & Lipman, 1985) • Most widely used methods are BLAST (Altschul et al., 1990) and PSI-BLAST (Altschul et al., 1997).

Sequential Data Mining