400 likes | 416 Vues
Learn about algorithms for exact string matching and comparing sequences, including edit distance and pair-wise alignment. Study long sequence handling and utilize the Alggen tool.
E N D
Contents •First week: algorithms for exact string matching: One pattern: The algorithm depends on |p| and | k patterns: The algorithm depends on k, |p| and || •Second week: Comparing short sequences Dootplot –Edit distance between two strings: dynamic programming –Alignment of sequences: –2 sequences –3 or more sequences •Third week: dealing with long sequences.
Doot plot See alggen GAAAATGAAGATTCGAACTAA GCCA ATACATGAACTGCAAAAACAAATAAAAGAAAATATAAAC A
Distance between words Which is the distance between the words: – table, maple – able, table – announce, pronounce – ACCTG, ACTT … and between – ACGG, ACTGTGG -AATCTACTAGCGTACTACTC, ACTACTACGTACTACG
Edit distance We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT Indel 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= d(ACT,C)=
Edit distance We accept three types of errors: 1. Mismatch: ACCGTGAT ACCGAGAT 2. Insertion: ACCGTGAT ACCGATGAT Indel 3. Deletion: ACCGTGAT ACCGGAT The edit distance d between two strings is the minimum number of substitutions,insertions and deletions needed to transform the first string into the second one d(ACT,ACT)= d(ACT,AC)= d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)= 3 0 1 1 d(ACT,C)= 2 2
Edit distance and alignments The alignment that gives the distance can be represented: ACCGTGAT ACCG -GAT * * * * * * * ACCG -TGAT ACCGATGAT * * * * * * * * ACCGTGAT ACCGAGAT * * * * * * * ACCGTGTTATGTGTATG- - TGA - - AT ACCG -GAT- - GTGT -TGTTTGAGTAT * * * * * * * * * * * * * * * * * And the score of the alignment is the addition of the scores of the columns: – 0 if both chars are the same – 1 otherwise
Edit distance and alignments But there are many alignments between two sequences Given ACCG ACT: ACCG ACT - * * ACCG- - - - - - - ACT ACCG- - AC -T ACCG AC - T * * Then the Edit distance is the score of the best alignment so, we can find the distance by generating all alignments and picking up the one with smallest score.
Edit distance and Pairwise alignment Given two DNA sequences A (a1a2...an) and B (b1b2...bm) from the alphabet {a,c,t,g} we say that A* and B* from {a,c,t,g,-} are aligned iff i) ii) iii)For all i, it is not possible that ai = bi= - A* and B* become A and B if gaps ( – ) are removed. |A*|=|B*| Write all alignments between AA and AC ...
Edit distance and Pairwise alignment To blackboard
Edit distance and alignment of strings C T A C T A C T A C G T A C T G A
Edit distance and alignment of strings C T A C T A C T A C G T A C T G A
Edit distance and alignment of strings C T A C T A C T A C G T A C T G A The cell contains the distance between AC and CTACT.
Edit distance and alignment of strings C T A C T A C T A C G T ? A C T G A
Edit distance and alignment of strings C T A C T A C T A C G T 0 ? A C T G A
Edit distance and alignment of strings C T A C T A C T A C G T 0 1 ? A C T G A - C
Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 ? A C T G A - - CT
Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 … A C T G A - - - - - - CTACTA
Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 … A ? C ? T ? G A
Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 … A 1 C 2 T 3 G… A ACT - - -
Edit distance and alignment of strings C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 … A 1 C 2 T 3 G A BA(AC,CTA)- d(AC,CTA)+1 C C C d(A,CTA) BA(A,CTA) BA(AC,CTAC)= best d(AC,CTAC)=min BA(A,CTAC)C d(A,CTAC)+1 -
Bioinformatics Pairwise alignment
Best alignment How can an alignment be scored? Catcactactgacgactatcgtagcgcggctat acatctacgccaa- ctac-t- gtgtagatcgccgg c-tgactgc-- acgactatcgt- attgcggctacacactacgcacaactactgtatgtcgc- cgg---- * * *** * ************* ********* **** ******* * **** ** * *** •Mismatch: unfavorable •Match: favorable •Gap: worst case Then we assign a score for each case, for example 1,-1,-2.
Pairwise alignment Edit distance: match=0 mismatch=1 indel=1 d(A,CTAC)+1 d(AC,CTACT)=minimum d(A,CTA)….+1 d(AC,CTA)+1 Similarity: match=1 mismatch=-1 indel=-2 s(A,CTAC)-2 -+ s(AC,CTACT)=maximum s(A,CTA) 1 s(AC,CTA)-2
Pairwise alignment Connect to alggen tool
Best alignment Given the maximum score, how can the best alignment be found? accaccacaccacaacgagcata … acctgagcgatat a c c . . t •Quadratic cost in space and time •Up to 10,000 bps sequences in length Download alggen tool
Some preconceived ideas We have developed the theory according to the following principles: 1) Both sequences have a similar length (global). 2) The model of gaps is linear If there are k consecutive gaps the penalty scores k(-2).
Semiglobal pairwise alignment Assume that we have sequences with different length S1 S2 It is meaningless to introduce gaps until both sequences have similar length …. The most probable alignment should be Initial gaps Final gaps How can these alignments be found?
Semiglobal pairwise alignment Note that Initial gaps Final gaps C T A C T A C T A C G T A C T
Semiglobal pairwise alignment Given a cell C T A C T A C T A C G T 0 0 0 0 0 0 0 0 0 0 0 0 0 A C T The cell contains the score of the best alignment of CTA with the empty sequence.
Semiglobal pairwise alignment C T A C T A C T A C G T 0 0 0 0 0 0 0… A C T The contribution of the initial gaps is disregarded, then C T A C T A C T A C G T 0 0 0 0 0 0 0… A 1 C 2 T 3 but, what happens with the final gaps?
Semiglobal pairwise alignment C T A C T A C T A C G T 0 0 0 0 0 0 0… A 1 C 2 T 3 How does the algorithm search for the best alignment? … by checking the last row for the best score.
Affine-gap model score Given the following alignments that have the same score … a g t a c c c c g t a g a g t - c c - - g t a - a g t a c c c c g t a g a g t - c - c - g t a - a g t a c c c c g t a g a g t - c - - c g t a - a g t a c c c c g t a g a g t - - c c - g t a - a g t a c c c c g t a g a g t - - - c c g t a - a g t a c c c c g t a g a g t - - c - c g t a - Which is the most reliable case from a biological point of view?
Affine-gap model score Then, how can we distinguish between consecutive gaps and separated gaps? a g t a c c c c g t a g a g t - - - c c g t a - a g t a c c c c g t a g a g t - - c - c g t a - By scoring the opening gaps greater than the extension gaps, for instance, -10 and -0.5. Then, the penalty of k consecutive gaps becomes OG + (k-1) EG which is an affine-gap function. How is the best alignment found?.
Affine-gap model score C T A C T A C T A C G T A C T G A Smallest arrows: refer to the introduction of an opening gap. Largest arrows: refer to the introduction of an extension gap. But from which cell do the largest arrows originate?
Local alignment Given two sequences, we can consider the alignments of all their substrings… …how can the best of them be found? Two questions arise: - how can the alignments be compared? - how can the best one be selected?
Bioinformatics Multiple alignment
Pairwise to multiple alignment What happens with three strings? Let n be their lenght, then the cost becomes S2 A C A -1 S3 __ S1 O(n3) “O(23)” “O(32)” And with k strings? O(nk2kk2)
Multiple alignment Programs of multialignment use different heuristics: Clustal (Progressive alignment) http://www.ebi.ac.uk/clustalw TCoffee (Progressive alignment + data bases) http://igs-server.cnrs-mrs.fr/Tcoffee_cgi/index.cgi HMM (Hidden Markov Models)
Multiple alignment Connect to alggen tool
Advanced Data Structure: Bioinformatics •First week: Algorithms for exact string matching. •Second week: Alignment of sequences. •Third week: Dealing with long sequences.