1 / 1

Problem: Max Blast Sequence

BLAST: Basic Local Alignment Search Tool. Query is compiled to form a list of length w substrings called w-mers. Search database for “hits”. A list of matches shows up. Then search for an exact match between any substring on the w-mers list and the database sequence.

jacoba
Télécharger la présentation

Problem: Max Blast Sequence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BLAST: Basic Local Alignment Search Tool Query is compiled to form a list of length w substrings called w-mers. Search database for “hits”. A list of matches shows up. Then search for an exact match between any substring on the w-mers list and the database sequence. If calculated score is less than Sg, then discard this database sequence from list of “hits.” If calculated score is equal to or greater than Sg, keep this sequence in the list of “hits.” Each substring is extended locally in both directions until the score of the substring no longer improves. Find alignment (either gapped or ungapped) with the max score. For every matched pair of nucleotides, add 5 to the total score. For every mismatched pair of nucleotides, called a “gap”, subtract 4 from the total score. Is the max alignment score statistically significant? • If yes, database sequence will be displayed as output. If no, discard database sequence. Select a moderate score Sg to which the calculated score is compared. Sg indicates if there are too many gaps to make the sequence a likely match to the substring and query sequences. Meng Cao, Arthur Lee, Peiying Li, Matt Prorok, Owen Astrachan Genome Revolution Focus, Duke University 2007 Shortcomings Introduction Flowchart of Steps APT BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides. BLAST approximates the dynamic programming algorithm more directly than its predecessors. Dynamic programming is an optimization process for solving a problem. In this approach, the user finds the best decision for a subproblem and bases that decision from the best decision from the previous subproblem. Unfortunately, this method would have exceptional computational requirements, thus the use of the heuristic algorithms.       In order to find similar sequences, BLAST first finds the highest scoring pair of sequences , or the maximal segment pair (MSP), and in one version of BLAST, matching DNA nucleotides are given a score of +5, while mismatches are assigned -4 (as outlined in the flowchart). In order to more directly approximate dynamic programming, BLAST chooses the boundaries of the MSP to maximize its score by either extending or shortening the two segments that are compared. Because molecular biologists are more likely to be interested in all the conserved regions, not just the most conserved region, BLAST returns all MSP’s that score above a cutoff. While BLAST may be faster than the dynamic programming algorithm, it is a heuristic algorithm, and because it sacrifices accuracy for speed, BLAST can sometimes make mistakes. Left: Table of some of the v versions of BLAST (see Reference 4). Below: Screenshot of BLAST homepage (see Reference 3). Problem: Max Blast Sequence Problem Statement: Given a string named strand of DNA and a database represented as an array of strings, return the element from the database that has the highest BLAST score compared to strand. In this problem all strings in the database will have the same length as strand; BLAST assigns +5 when calculating a match score when nucleotides from each strand with the same index match and -4 if the nucleotides with the same index in the two strands do not match. If there is a tie, return the string that occurs first in the database. Definition: Class: MaxBlast Method: max Parameters: String strand, String [] datab Returns: String Method Signature: public String max(String strand, String[] datab) Class: public class MaxBlast { public String max(String strand, String[] datab) // fill in code here } } Constraints: - Every string in datab is the same length as strand - All strings contain at most 50 characters, each character is either 'g', 'a', 't', or 'c'. - datab contains at most 50 strings. Examples: 1) strands = {“aaa”, “ggg”, “ccc”, “ttt”} string = “ggg” Returns “ggg” because a string from the database that's the same has the maximal score possible. 2) strands = {“agtc”, “agtt”, “agta”, “aggg”} string = “agtg” Returns “agtc” because each string from datab has a score of 5+5+5-4 = 11, but the string "agtc" occurs first. More information on this APT can be found at http://www.cs.duke.edu/csed/algoprobs/4g07blast2.html Though BLAST represents a huge advancement in the ability to compare DNA, it is not without its shortcomings. The basic premise behind the algorithm is that it searches for segments of DNA that are likely to be the most similar, rather than comparing each individual section with every other one. This innovation increases the speed with which DNA can be searched, but is not perfect. It is possible that BLAST will return data that is off. This result has been shown empirically by using BLAST to analyze gene sequences. Koski and Golding report that, in E. coli, in 27% of cases, BLAST returned hits that were not from E. coli’s nearest phylogenetic neighbor, with 7% of cases returning a hit from a different domain of life. However, as BLAST is refined and its DNA database becomes larger, the accuracy should improve. Nonetheless, it is important to emphasize that the closest BLAST hit is based on the computer algorithm and thus, merely implies biological similarity. Conclusion BLAST is currently one of the most popular bioinformatics search programs. The algorithm’s major emphasis on speed appeals greatly to many researchers who are aiming to solve complex problems. BLAST also supplies statistical significance and other analytical techniques involved in computer science. Biological problems that BLAST can help answer deal with DNA and protein sequences. A researcher can use a BLAST search to find and compare gene or protein sequences between organisms and look for similarities. Similar sequences can then be used to describe biological relationships and to give further insight on how systems work. References • Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Journal of Molecular Biology, 215, 403-410. • Korf, I., Yandell, M., & Bedell, J. (2003). BLAST. Cambridge: O’Reilly. • National Center for Biotechnology Information. (n.d.). BLAST: Basic Local Alignment Search Tool. Retrieved October 24, 2007, from http://www.ncbi.nlm.nih.gov/BLAST/. • Sotiriades, E., & Dollas, A. (2007). A General Reconfigurable Architecture for the BLAST Algorithm. Journal of VLSI Signal Processing, 48, 189–208. Retrieved October 16, 2007, from http://www.springerlink.com/content/p2175ql615589u22/fulltext.pdf. • University of Texas at El Paso. (n.d.). Basic Local Alignment Search Tool (BLAST). Retrieved October 15, 2007 from http://www.math.utep.edu/Faculty/mleung/teaching/m4370s04/slides/blast.pdf. • Contact information can be found at http://www.duke.edu/~mc136/contact.txt

More Related