Genome Resequencing with Short Reads

Mentor: Ryan Cunningham Mentees: Aaron Luo, Naveen Arivazhagan • Motivation • Our Method • Human genome is about 3 billion base pairs long. Sequencing recovers little bits of this genome from random locations. • There are errors in reading: There can be caps, deletions, or insertions that don’t belong. • De novo sequencing projects seek to assemble the genome of an organism from scratch. • Resequencing projects use a reference genome as a template to map the sequencing data. • Traditional Sequencing with long reads may cost billions or trillions of dollars and months to finish. • Shorter reads will be cheaper (~150 million) and faster, albeit lower accuracy. • Most current resequencing techniques are not sensitive enough to find correlations between distantly related genomes. • We are developing a tool that can profile short reads quickly, and find distant correlations by being more sensitive while not losing out on accuracy. • We then use a nondeterministic finite automaton (NFA) to identify how closely the read matches a substring of the template genome. • Unlikely reads are eliminated quickly with an initial filter which we implement through an indexing structure • K-Mer Index • When we get a read we look up the first word of the read in the inverted index. • Since we are trying to be sensitive we need to allow for slight mismatches between the given read and the words found in the MSA. • So we need to find all other words that are close to the search word so that we can look them up in the index. • After identifying all the words that are very similar to the read we pass the set of alignment columns corresponding to these words to the NFA. • The Index can be looked up very quickly as the short words can be directly translated into integers and act as the index number in the word-index. • We identify all of the words in the MSA and create an inverted index for them. • Each word maps to the set of alignment columns in the MSA containing that word, this is implemented by a rolling hash. • NFA to Perform and Score Alignment • This is an example of how mismatches are handled. • The NFA is capable of matching strings containing all possible variants of the input string, but is unable to make sense of simple insertions, deletions, or substitutions within any given input string. • We compensate for this by taking the Cartesian product of the original NFA with the number of mismatches encountered. • For each state and transition in the original NFA, a corresponding state or transition is added to the modified NFA. • Additional transitions are added for additions, mismatches, and deletions. • This is an example NFA that is used in alignment. • Given an input string with language ∑ = {G, A, T, C}, each state corresponds to a column in the MSA. • Two nodes are considered adjacent and connected by a transition of a non-gap character. • Each transition can be paired with a score, and any given input string may be scored based on the likelihood of the path it took. • The most optimal alignment is simply finding the path that produces the closest match. As transitions are taken, the most optimal transitions and highest score can be tracked. Genome Resequencing with Short Reads • Acknowledgements/References • Future Work • Optimize the “seeding” and “extending” code • This is still a test model, we might come up with an improved way of building the index • The NFA can capture more sophisticated types of structural rearrangement of DNA • Since speed is the priority we will eventually have to multi-thread it • Compare it to other available tools using simulated data • There are animals already that require our technology to sequence • Mentor - Ryan Cunningham through the PURE Program, Funded by Rockwell Collins. • Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31{88, 2001. • D. E. Knuth, J. H. Moms, and V. R. Pratt, "Fast Pattern Matching in Strings," SIAM J. Computing 6, 323-350 (1977) • M. 0. Rabin, “Probabilistic Algorithm for Testing Primality,” J Number Theor. 12, 128-138 (1980). • Gonzalo Navarro and Ricardo Baeza-Yates. A hybrid indexing method for approximate string matching. Journal of Discrete Algorithms, 1(1):205-239,2000. • Michael Sipser. Introduction to the Theory of Computation, volume 27. Thomson Course Technology Boston, MA, 2006.

Genome Resequencing with Short Reads

Genome Resequencing with Short Reads

Presentation Transcript

Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer

Groundbreaking Reads

Souhegan Reads!

Good Reads

Resequencing Genome

DNA Sequencing with Longer Reads

Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads

Good Reads

HUGO: Hierarchical mUlti -reference Genome cOmpression tool for aligned short reads

Detecting Copy Number Variation With Short Paired Reads

Raw reads

Winter Reads

Arlington Reads

Linebacker Reads

Affymetrix Resequencing Arrays

Genome Resequencing analysis

Quick Reads

Lancashire Reads

San Diego firm offers Whole Genome Resequencing services

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer

Holy Reads