Introduction

APT Solution publicclass BLASTStageOne { public String stageOne(String query, String[] library, int w) { int topResemblance = 0; String bestResemblance = new String(); for (int i = 0; i < library.length; i++) { //explore the entire array int counter = 0; String current = library[i]; int charCycle = 0; while(charCycle <= query.length() - w) { //cycles through all characters of your query char trigger = query.charAt(charCycle); int stringCycle = 0; while (stringCycle <= current.length() - w) { //searches for characters that match current character if(current.charAt(stringCycle) == trigger) { if(current.substring(stringCycle, stringCycle + w) == (query.substring(charCycle, charCycle + w))) { //compares segments of length W from both points and if there is a resemblance, the resemblance counter increases counter++; stringCycle += w; } else stringCycle++; } else stringCycle++; } charCycle++; } if (counter >= bestResemblance.length() && library[i].length() < bestResemblance.length()) { //shorter one wins topResemblance = counter; } } returnbestResemblance; } } Understanding BLAST today and its implications in the future. Figure 1. Jeff Shen, Morgan Kearse, Jeff Shi and Yang Ding Department of Alspaugh, Duke University, Durham, North Carolina 2007 Conclusions Introduction BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics search tool used to compare different DNA samples for their similarities. Researchers can use this search tool to compare their own DNA samples to all the DNA and protein sequences in various genebanks and libraries. BLAST takes a heuristic approach to compare the different sequences, which dramatically increases the speed of searches. The program scans at approximately 2 x 10^6 bases/s. This increase in speed has made a lasting impact in the fields of bioinformatics and computer science. In the past, searches that would have taken days to finish now can be done in mere seconds. Application Because of its speed, BLAST has become a very popular bioinformatics search tool. BLAST has been cited by over twenty thousand scientific journals whose authors used BLAST to compare different DNA sequences or whole genomes for similarities. For example, researchers in the Cold Spring Harbor Lab used an enhanced version of BLAST, BLATZ, to find the similarity between the human genome and the mouse genome. Using BLATZ, they concluded that 39.154% of the human sequence aligned to mouse sequence. Also, other organisms such as drosphilia (fruit fly) have been compared with the human genome with BLAST. Another area that BLAST is prevalent is in the field of protein studies. Not only can researchers use BLAST for comparing DNA sequences, but they can also use the program to find similarities between protein sequences. BLAST has became an indispensable bioinformatics tool in the field of biology, engineering, and biochemistry. To align 2.8 Gb of human sequence versus 2.5 Gb of mouse sequence took 481 days of CPU time and a half day of wall clock time on a cluster of 1024 833-Mhz Pentium III CPUs. This produced 9 Gbytes of output in a relatively space-efficient format that describes the alignments by coordinates within the sequences. These are translated to a textual representation, called axt, which includes the actual bases. Whereas axt files are large, for many post-processing steps, the improved locality of reference (avoiding the need to retrieve parts of multigigabyte datasets) is a clear necessity. The initial axt files were 20 Gbytes, but running axtBest reduced them to 2.5 Gbytes. Only 3.3% of the human genome is covered by multiple alignments (assuming proper masking of interspersed repeats and low-complexity regions), but some of these places, particularly on chromosome 19, are covered to a great depth. Results of some whole-genome runs that measured coverage by outer alignments (Step 2 of Figure 1) are given in Table 1. Placing a lower bound of 3000 on scores for gapped alignments (which should eliminate no outer alignments), 39.154% of the human sequence aligned to mouse, and only 0.164% aligned to reversed mouse. This confirms the high specificity of our approach even before axtBest is applied. Imposing the requirement that gapped alignments score at least 5000 reduced coverage by only 0.221% of human, but halved coverage by bogus alignments from 0.164% to 0.075%. Requiring a score of 10,000 and keeping only regions that align to just one place in the mouse genome, we still align 36.831% of human, whereas only 0.007% aligns to reversed mouse. Of course, for some applications, for example, exploring gene duplications, that strategy for attaining extremely high specificity would throw out the baby with the bath water. Figure 3. Figure 2. Figure 3. Figure 4. . Literature cited http://www.acm.org/crossroads/wikifiles/13-1-CE/13-1-11-CE.html http://www.mrc-lmb.cam.ac.uk/genomes/madanm/articles/antim.htm http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Alignment_Scores2.html http://www.goroadachi.com/etemenanki/blog-05feb.htm Worley, Kim C., Brent A. Wiese, and Randall F. Smith. "BEAUTY:An Enhanced BLAST-based Search Tool that Intergrates Multiple Biological Information Resources into Sequence Similarity Search Results." Genome Research 1995 173-184. 10/15/2007. Altschul, Stephen F., et.al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Oxford Journals 1995 3389-3402. 10/15/2007. McGinnis , Scott and Thomas L. Madden . "BLAST: at the core of a powerful and diverse set of sequence analysis tools." Oxford Journals 2004 20-25. 10/15/2007. Method BLAST utilizes an extremely fast and efficient algorithm for sequence alignment based on an input W from the user. This fixed length W represents the length of matching base pairs that the user wishes for BLAST to use in returning sequences that are considered good matches. The user inputs a target sequence into BLAST that is variable in length (generally a few thousand, or perhaps even million base pairs), and BLAST utilizes a sequence database (of up to a few billion base pairs) to search quickly and efficiently for matches within each sequence of length W or higher. Though it is fast, because it uses a heuristic approach, it is also possibly inaccurate – a factor which is sacrificed for the speed. There are three phases in any BLAST search (and thus sequence comparison). The first stage looks for and scores exact matches between regions of the target sequence and members of the sequence database that are length W or higher. By default, W is considered to be 11 for nucleic searches, or “seeds.” In the second phase, BLAST takes these matching sequences and their scores and extends the search in both directions from the match. Here, insertions and deletions are ignored and not calculated towards the cumulative score, which is tallied up and used in the final phase. In the final phase, BLAST moves to a version of the more accurate Smith-Waterman algorithm if high-scoring alignments from the first two phases are found. This algorithm compares segments of all possible lengths and returns an optimized measurement of similarity. Finally, those results deemed to be statistically significant from the original search are returned to the user. Explaining the APT APT Problem Statement You are writing code to find which of several DNA strands in a DNA library has the most similarities with your query strand. The process this uses is very similar to the first step of the BLAST algorithm, in that it must search for exact matches of a small fixed length W between the query and sequences in the library. You must find the best match with your query of all the strands in the library, and if two strands have the same level of similarity, return the shorter strand. Definition Class: BLASTStageOne Method: stageOne Parameters: String query, String[] library, int w Returns: String Method signature: String stageOne(String query, String[] library, int w) Class public class BLASTStageOne { public String stageOne(String query, String[] library, int w) { //fill in code here } } BLAST in the future… As an example, there are companies such as Korilog that have made software (KoriBLAST) that use the BLAST system along with other programs to create software solutions to make it easier for labs and researchers in areas of data integration, visualization and management. Their goal is to provide the means for state-of-the-art graphical environments for quick and easy research. The software program is dedicated to making the BLAST program very useful by doing sequence data mining. Figur1 4.

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction