Rapid Alignment of Short Sequences to Large Databases

Presentation – Homework 2 Advanced Topics: Current Bioinformatics Instructor: Dr. JianhuaRuan Group MemberS: JamiulJahid Mohammad Iftekharul Islam TanzirMusabbir

NGS Analysis Papers • PatMaN: Rapid Alignment of Short Sequences to Large Databases • Kay Prufer, UdoStenzel, Michael Dannemann, Richard Green, Michael Lachmann • ProbeMatch: Rapid Alignment of Obligonucleotides to Genome Allowing Both Gaps and Mismatches • You Kim, Nikhil Teletia, Victor Ruotti, Maher, James Thomson and Jignesh Patel

PatMaN: Rapid Alignment of Short Sequences to Large Databases • PatMaN – Patter Matching in Nucleotide Databases • A tool for performing exhaustive searches to identify all occurrences of a large number of short sequences within a genome-sized databases. • Reads sequences in FastA format and reports all hits within the given edit-distance cutoff. • Advantages: • Allows predefined number of gaps and mismatches • Ambiguity codes can be searched • Search time is short for perfect matches

ProbeMatch: Rapid Alignment of Oligonucleotides to Genome Allowing Both Gaps and Mismatches • For matching a large set of oligonucleotides sequences against a genome database using gapped alignments • Advantages: • It generates both ungapped and gapped alignments • It allows up to three errors including insertion, deletion and mismatch • It able to detect multiple classes of mutations: SNVs and indels.

ProbeMatch: Background High throughput DNA sequence technologies : Illumina, 454 Life Sciences Large set of short sequences is produced Must be mapped to a genome, allowing for only a few errors Traditional sequence alignment tools can do this, but computationally impractical

ProbeMatch: Background • ELAND (Efficient Local Alignment of Nucleotide Data) • Search DNA databases for a large number of short sequences • Only ungapped alignments allowing up to two mismatches • MAQ (Mapping and Assembly with Quality) • Only ungapped alignments allowing up to three mismatches • Measures error probability of alignements using sequence quality information • SOAP • SeqMap

ProbeMatch: Background These programs are often faster than BLAST by an order of magnitude or more But usually map only 60-80% of the query sequences to genomes Further processing is needed using computationally expensive but sensitive alignment method Overall gain is limited ProbeMatch effectively approaches this challenge

ProbeMatch: Rapid Alignment of Oligonucleotides to Genome Allowing Both Gaps and Mismatches Allows a richer match model Finds gapped and ungappedalignements with up to three errors of any error combination Able to detect multiple classes of mutations

ProbeMatch: Methodology Takes as input a query sequence set and a database of sequences. Database is divided into small segments ProbeMatch loads each segment and build a q-gram index To find potential hits, ProbeMatch searches against q-gram index and extends hits to find longer alignments.

ProbeMatch: Methodology If two sequences Q and T, match within k errors and j non-overlapping fragments are taken from Q, then T contains at least one of the fragments with at most ⌊k/j⌋ errors The matched hits then are extended to check if the entire query sequence and the target sequence can be aligned within k errors Gapped q-gram index (“Better Filtering with gapped q-grams” Burkhardt and Kärkkäinen, 2002) provides more efficient filtering than ungapped q-gram

ProbeMatch: Result 169095 transcriptome short reads from a prostate cell line(RWPE), generated by the Illumina Genome Analyzer, was matched against the human genome using various alignment programs Table : Comparison of execution times and sensitivity

PatMaN: Rapid Alignment of Short Sequences to Large Databases • Algorithm • Constructing a single keyword tree of all the query sequences. • When ambiguity flag is set, a match occurs if the base is one of the nucleotide in ambiguity code. • When ambiguity flag is omitted a base alignment to this character will be counted as a mismatch. • All bases along a query sequence are added as a path from the root of the tree to a leaf, with edge as a base added and leaf as the query sequence id. • Suffix link is also added into the tree

PatMaN: Rapid Alignment of Short Sequences to Large Databases Suppose query sequence is ‘CCC’, ‘GA’, ‘GT’. Basic keyword tree is -- CCC C C C G GA A T GT

PatMaN: Rapid Alignment of Short Sequences to Large Databases After adding the suffix link CCC C C C C G G G GA A T GT

PatMaN: Rapid Alignment of Short Sequences to Large Databases Completing the tree A, T, N CCC A, T, N C C C C G G G GA A A, T, N N T G GT

PatMaN: Rapid Alignment of Short Sequences to Large Databases • Algorithm • Once the tree is completed each sequence in the target database is evaluated base by base and compared to a list of partial matches. • Each partial match consist • A node • Number of mismatches and gaps so far. • The list is initialized with • Root of the tree • An edit count of zero. • In each iteration of the algorithm all partial matches are advanced along a perfectly matching outgoing edges.

PatMaN: Rapid Alignment of Short Sequences to Large Databases • Complexity • Without ambiguity code O(L) time and spaces requires, where L is the total length of all query sequences. • When ambiguity is enabled both time and space requirement increases exponentially. • The time depends on the target database but heavily depends on the maximum edit distances as well as the average length of query sequences. • For each additional edit operation an exponentially increasing number of partial matches must be considered.

PatMaN: Rapid Alignment of Short Sequences to Large Databases • Result • Time constrain of PatMaN means it is suitable for short sequence with a limited number edit operation. • HG -U95 is matched against chimpanzee genome(panTro2) with no gaps but one mismatch. • PatMaN takes 2.5h and found 15.9 million hits.

Q/A?

Rapid Alignment of Short Sequences to Large Databases

Rapid Alignment of Short Sequences to Large Databases

Presentation Transcript

Homework

Homework

Homework Folder

Homework:

Homework Help

Homework

Where’s My Homework??!

My Homework 5 Presentation

HomeWork

Homework! Oh, Homework!

Helping with Homework

Homework Tips For Parents

Homework

CAS LX 502

Welcome to 5 th Grade

Live Homework Help

Managing Homework

WHAT HOMEWORK?????

Online Assignment Help.Homework Help Online - EDU Homework Help

Homework Help Online Services

ACCT 553 Endless Education /uophelp.com

ACCT 553 Dreams Come True /uophelp.com

Sea Ice

Sea Ice