1 / 31

Sequence Alignment

Sequence Alignment. Motivation :. Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for s imilarities. Searching databa s es for related sequences and s ubsequences. Exploring frequently occurring patterns of nucleotides.

layne
Télécharger la présentation

Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignment Motivation: • Storing,retrieving and comparing DNA sequences in Databases. • Comparing two or more sequences for similarities. • Searching databases for related sequences and subsequences. • Exploring frequently occurring patterns of nucleotides. • Finding informative elements in protein and DNA sequences. • Various experimental applications (reconstruction of DNA, etc.)

  2. ? Similar 3D structure ? Similar sequences produce similar proteins Seq.Align. Protein Function Protein1 Protein2 More than 25% sequence identity ? Similar function

  3. Exact Pattern Matching • Given a pattern P of length m and a (longer) string T of length n, find all the occurrences of P in T. • Naïve algorithm: O(m*n) • Boyer-Moore, Knuth-Pratt-Morris: O(n+m)

  4. Alignment - inexact matching • Substitution - replacing a sequence base by another. • Insertion - an insertion of a base (letter) or several bases to the sequence. • Deletion - deleting a base (or more) from the sequence. • (Insertion and deletion are the reverse of one another)

  5. Seq. Align. Score Commonly used matrices: PAM250, BLOSUM64

  6. Global Alignment Global Alignment INPUT:Two sequences S and T of roughly the same length. QUESTION:What is the maximum similarity between them? Find one of the best alignments.

  7. The IDEA s[1…n] t[1…m] To aligns[1...i]witht[1…j]we have three choices: * align s[1…i-1] with t[1…j-1] and match s[i] with t[j] * align s[1…i] with t[1…j-1] and match a space with t[j] * align s[1…i-1] with t[1…j] and match s[i] with a space s[1…i-1] i t[1…j-1] j s[1… i ] - t[1…j-1] j s[1…i-1] i t[1… j ] -

  8. Recursive Relation Define: scoring matrix m(a,b) a,b ∈ ∑ U {-} Define: Hij the best score of alignment between s[1…i] and t[1…j] for 1 <= i <= n, 1 <= j <= m Hi-1j-1 + m(si,tj) Hij = max Hij-1 + m(-,tj)Hi-1j + m(si,-) Hi0 = ∑0,k m(sk,-) H0j = ∑0,k m(-,tk) Optimal alignment score = Hnm Needleman-Wunsch 1970

  9. t s3 t4 I T s3 - I - - t4 - T s

  10. Local Alignment Local Alignment INPUT:Two sequences S and T . QUESTION:What is the maximum similarity between a subsequence of S and a subsequence of T ? Find most similar subsequences.

  11. Recursive Relation for 1 <= i <= n, 1 <= j <= m Hi-1j-1 + m(si,tj) Hij-1 + m(-,tj)Hi-1j + m(si,-) 0 Hi0 = 0 H0j = 0 Hij = max Optimal alignment score = maxijHij Smith-Waterman 1981 *Penalties should be negative*

  12. Sequence Alignment Complexity: Time O(n*m) Space O(n*m) (exist algorithm with O(min(n,m)))

  13. Ends free alignment Ends free alignmentINPUT:Two equences S and T (possibly of different length).QUESTION:Find one of the best alignments betweensubsequences ofS and Twhen at least one of these subsequences is a prefix of theoriginal sequence and one (not necessarily theother) is a suffix. or

  14. Gap Alignment Definition:A gap is the maximal contiguous run of spaces in a single sequence within a given alignment.The length of a gap is the number of indel operations on it. A gap penalty function is a function that measure the cost of a gap as a (nonlinear)function of its length. Gap penaltyINPUT:Two sequences S and T (possibly of differentlength).QUESTION:Find one of the best alignments between the two sequencesusing the gap penalty function. Affine Gap: Wtotal = Wg + qWs Wg – weight to open the gap Ws – weight to extend the gap

  15. What Kind of Alignment to Use? • The same protein from the different organisms. • Two different proteins sharing the same function. • Protein domain against a database of complete proteins. • Protein against a database of small patterns (functional units) • ...

  16. Sequence Alignment vs. Database • Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record ACTTTTGGTGACTGTAC

  17. Sequence Alignment vs. Database • Tool: Given two sequences,there exists an algorithm tofind the best alignment. • Naïve Solution: Apply algorithm to each of the records, one by one

  18. Sequence Alignment vs. Database • Problem: An exact algorithm is too slowto run millions of times (even linear time algorithm will run slowly on a huge DB) • Solution: • Run in parallel (expensive). • Use a fast (heuristic) method to discard irrelevant records. Then apply the exact algorithm to the remaining few.

  19. Sequence Alignment vs. Database • General Strategy of Heuristic Algorithms: • Homologous sequences are expected to contain un-gapped (at least) short segments (probably with substitutions, but without ins/dels) • Preprocess DB into some fast access data structure of short segments.

  20. FASTA Idea • Idea: a good alignment probably matches some identical ‘words’ (ktups) • Example: Database record: ACTTGTAGATACAAAATGTG Aligned query sequence: A-TTGTCG-TACAA-ATCTGT Matching words of size 4

  21. Dictionaries of Words ACTTGTAGATAC Is translated to the dictionary: ACTT, CTTG, TTGT, TGTA… Dictionaries of well aligned sequences share words.

  22. FASTA Stage I • Prepare dictionary for db sequence (in advance) • Upon query: • Prepare dictionary for query sequence • For each DB record: • Find matching words • Search for long diagonal runs of matching words • Init-1 score: longest run • Discard record if low score *= matching word * * * * * * * * * * * * Position in DB record Position in query

  23. FASTA stage II • Good alignment – path through many runs, withshort connections • Assign weights to runs(+)and connections(-) • Find a path of max weight • Init-n score – total path weight • Discard record if low score

  24. FASTA Stage III • Improve Init-1. Apply an exact algorithm around Init-1 diagonal within a given width band. • Init-1 Opt-score – new weight • Discard record if low score

  25. FASTA final stage • Apply an exact algorithm to surviving records, computing the final alignment score.

  26. BLAST(Basic Local Alignment Search Tool) Approximate Matches BLAST: Words are allowed to contain inexact matching. Example: In the polypeptide sequence IHAVEADREAM The 4-long word HAVE starting at position 2 may match HAVE,RAVE,HIVE,HALE,…

  27. Approximate Matches For each wordof length w from a Data Base generate all similar words. ‘Similar’ means: score( word, word’ ) > T Store all similar words in a look-up table.

  28. DB search 1) For each wordof length w from a query sequence generate all similar words. 2) Access DB. 3) Each hit extend as much as possible -> High-scoring Segment Pair (HSP) score(HSP) > V THEFIRSTLINIHAVEADREAMESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEWASNINETEEN

  29. DB search (2) 4) Around HSP perform DP. At each step alignment score should be > T s-db starting point (seed pair) s-query

  30. Homework • Write a cgi script (using Perl) that performs pairwise Local/Global Alignment for DNA sequences. All I/O is via HTML only. • Input: • Choice for Local/Global alignment. • Two sequences – text boxes. • Values for match, mismatch, ins/dels. • Number of iterations for computing random scores. • Output: • Alignment score. • z-score value (z= (score-average)/standard deviation.) Remarks: 1) you are allowed to use only linear space. 2)To compute z-score perform random shuffling: srand(time| $$); #init, $$-proc.id int(rand($i)); #returns rand. number between [0,$i]. 3)Shuffling is done in windows (non-overlapping) of 10 bases length. Number of shuffling for each window is random [0,10].

More Related