1 / 12

Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego

Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego. Lecture Notes No. 7 Dr. Pavel Pevzner (prepared by Iman Famili). Outline. New computational ideas for sequence comparison: Divide-and-conquer technique Recursive programs Hash tables. Edit Graphs.

gratia
Télécharger la présentation

Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Space-Efficient Sequence AlignmentBioinformatics 202University of California, San Diego Lecture Notes No. 7 Dr. Pavel Pevzner (prepared by Iman Famili)

  2. Outline New computational ideas for sequence comparison: • Divide-and-conquer technique • Recursive programs • Hash tables

  3. Edit Graphs T G C A T A source • Finds similarities between two sequences. • Every alignment in this method corresponds to the longest path problem from a source to a sink. • The alignment is done by constructing an “edit graph”. • There are 3 types of edges in the edit graph horizontal (H), diagonal (D), and vertical (V) corresponding to insertion (I), match/mismatch (M), and deletion (D), respectively. • Every edge of the edit graph (i.e. every movement) has a weight corresponding to the penalty or premium for that action. • The best path is the path with the maximum length. A T C T G A T sink deletions: insertions: mismatches: matches: Edit Graph

  4. Computational Complexity of Dynamic Programming Sequence alignment is limited by: • Time: • Four operations are needed at each vertex. • The required time is proportional to the number of edges in the edit graph (i.e. O(nm), where n and m are sequence lengths). • Space: • The required memory is proportional to the number of vertices in the edit graph, O(nm).

  5. Computational Complexity of Dynamic Programming • To compute the score of alignment, we can reduce the calculations to 2 columns at every computing instance. This can be done since scoring for each box in dynamic programming (DP) matrix is done based only on the three previously calculated boxes. Therefore only a linear memory is required for construction of the DP matrix. • To calculate the alignment (backtracking through the matrix), however, a quadratic memory is needed (n2) since all the scores are needed to find the best alignment. only 2 columns are needed to determine the score of each box (forward calculation) all columns are needed for calculating the best alignment (backtracking)

  6. Space-Efficient Sequence Alignment To solve the space complexity of sequence alignment: • Find the middle vertex between a source and a sink by computing the score of the path s*,m/2 from (0,0) to (i,m/2) and sreverse*,m/2 from (i,m/2) to (n,m) (i.e. find the longest path between the source and the middle vertex and middle vertex and the sink). • Repeat this process iteratively Source m/2 m m/2 m (0,0) (0,0) middle i n n (n,m) (n,m) Sink m/2 m m (0,0) (0,0) middle middle middle n (n,m) n (n,m) m m (0,0) (0,0) n (n,m) n (n,m)

  7. Space-Efficient Sequence Alignment • The computing time is equal to the area of the rectangles. The total time to find the middle vertices is therefore: area+area/2+area/4+…2*area • The space complexity is of order n, O(n). • Pseudocode for this algorithm is: Path (source, sink) If source and sink are in consecutive columns output the longest path from the source to the sink Else middle middle vertex between source and sink Path (source, middle) Path (middle, sink)

  8. String Matching: naïve approach Let’s say we want to compare a sequence of length l=10 against a database of length, for example, n=109 and we want to find the exact sequence l=10 in n. We can: • Move l along n one base at a time and find similar sequences (this takes a long time): n=109 l=10 So, essentially moving diagonally along the database alignments:

  9. Sting Matching: hashing • Create a hash table of all possible combinations of l-length strings that exist in n Hash Table and search your l-length string against the hash table.

  10. Approximate String Matching • Now if instead of l=10 we have l=1000, we can apply the same method by dividing l into overlapping strings of 10 base-long and cross the resultant alignments, as shown below: • String matching in this fashion may be done using filtration/verification algorithms that will be described next.

  11. Filtration/Verification Method • Let’s say we want to find a string in a database with up to 2 mismatches, or in general, find a string t1…tn (text) in a database q1…qp (query) with up to k mismatches. • The query matching problem is to find all m-substrings of the query and the text that match with at most k mismatches. Filtration/verification algorithms are used to perform this task. • Filtration/verification algorithms involve a two-stage process. First, a set of positions are reselected in the text that are potentially similar to the query. Second, each potential position is verified if mismatches are less than k and rejected if more than k mismatches are found. walk in both directions while mismatches are < k

  12. Filtration/Verification Method • Filtration algorithm is done in 2-steps: • Potential match detection: Find all matches of t-tuples in both query and the text for l=m/k+1 (it’s sparse alignment happens rarely) • Potential match verification: Verify each potential match by extending it to the left and to the right until either (i) the first k +1 mismathces are found or (ii) the beginning or end of the query or the text is found • This is the idea behind BLAST and FASTA.

More Related