Rapid Alignment of Short Sequences to Large Databases
Learn about PatMaN and ProbeMatch for genome sequence alignment, advantages, background, methodology, and results in bioinformatics analysis.
Rapid Alignment of Short Sequences to Large Databases
E N D
Presentation Transcript
Presentation – Homework 2 Advanced Topics: Current Bioinformatics Instructor: Dr. JianhuaRuan Group MemberS: JamiulJahid Mohammad Iftekharul Islam TanzirMusabbir
NGS Analysis Papers • PatMaN: Rapid Alignment of Short Sequences to Large Databases • Kay Prufer, UdoStenzel, Michael Dannemann, Richard Green, Michael Lachmann • ProbeMatch: Rapid Alignment of Obligonucleotides to Genome Allowing Both Gaps and Mismatches • You Kim, Nikhil Teletia, Victor Ruotti, Maher, James Thomson and Jignesh Patel
PatMaN: Rapid Alignment of Short Sequences to Large Databases • PatMaN – Patter Matching in Nucleotide Databases • A tool for performing exhaustive searches to identify all occurrences of a large number of short sequences within a genome-sized databases. • Reads sequences in FastA format and reports all hits within the given edit-distance cutoff. • Advantages: • Allows predefined number of gaps and mismatches • Ambiguity codes can be searched • Search time is short for perfect matches
ProbeMatch: Rapid Alignment of Oligonucleotides to Genome Allowing Both Gaps and Mismatches • For matching a large set of oligonucleotides sequences against a genome database using gapped alignments • Advantages: • It generates both ungapped and gapped alignments • It allows up to three errors including insertion, deletion and mismatch • It able to detect multiple classes of mutations: SNVs and indels.
ProbeMatch: Background High throughput DNA sequence technologies : Illumina, 454 Life Sciences Large set of short sequences is produced Must be mapped to a genome, allowing for only a few errors Traditional sequence alignment tools can do this, but computationally impractical
ProbeMatch: Background • ELAND (Efficient Local Alignment of Nucleotide Data) • Search DNA databases for a large number of short sequences • Only ungapped alignments allowing up to two mismatches • MAQ (Mapping and Assembly with Quality) • Only ungapped alignments allowing up to three mismatches • Measures error probability of alignements using sequence quality information • SOAP • SeqMap
ProbeMatch: Background These programs are often faster than BLAST by an order of magnitude or more But usually map only 60-80% of the query sequences to genomes Further processing is needed using computationally expensive but sensitive alignment method Overall gain is limited ProbeMatch effectively approaches this challenge
ProbeMatch: Rapid Alignment of Oligonucleotides to Genome Allowing Both Gaps and Mismatches Allows a richer match model Finds gapped and ungappedalignements with up to three errors of any error combination Able to detect multiple classes of mutations
ProbeMatch: Methodology Takes as input a query sequence set and a database of sequences. Database is divided into small segments ProbeMatch loads each segment and build a q-gram index To find potential hits, ProbeMatch searches against q-gram index and extends hits to find longer alignments.
ProbeMatch: Methodology If two sequences Q and T, match within k errors and j non-overlapping fragments are taken from Q, then T contains at least one of the fragments with at most ⌊k/j⌋ errors The matched hits then are extended to check if the entire query sequence and the target sequence can be aligned within k errors Gapped q-gram index (“Better Filtering with gapped q-grams” Burkhardt and Kärkkäinen, 2002) provides more efficient filtering than ungapped q-gram
ProbeMatch: Result 169095 transcriptome short reads from a prostate cell line(RWPE), generated by the Illumina Genome Analyzer, was matched against the human genome using various alignment programs Table : Comparison of execution times and sensitivity
PatMaN: Rapid Alignment of Short Sequences to Large Databases • Algorithm • Constructing a single keyword tree of all the query sequences. • When ambiguity flag is set, a match occurs if the base is one of the nucleotide in ambiguity code. • When ambiguity flag is omitted a base alignment to this character will be counted as a mismatch. • All bases along a query sequence are added as a path from the root of the tree to a leaf, with edge as a base added and leaf as the query sequence id. • Suffix link is also added into the tree
PatMaN: Rapid Alignment of Short Sequences to Large Databases Suppose query sequence is ‘CCC’, ‘GA’, ‘GT’. Basic keyword tree is -- CCC C C C G GA A T GT
PatMaN: Rapid Alignment of Short Sequences to Large Databases After adding the suffix link CCC C C C C G G G GA A T GT
PatMaN: Rapid Alignment of Short Sequences to Large Databases Completing the tree A, T, N CCC A, T, N C C C C G G G GA A A, T, N N T G GT
PatMaN: Rapid Alignment of Short Sequences to Large Databases • Algorithm • Once the tree is completed each sequence in the target database is evaluated base by base and compared to a list of partial matches. • Each partial match consist • A node • Number of mismatches and gaps so far. • The list is initialized with • Root of the tree • An edit count of zero. • In each iteration of the algorithm all partial matches are advanced along a perfectly matching outgoing edges.
PatMaN: Rapid Alignment of Short Sequences to Large Databases • Complexity • Without ambiguity code O(L) time and spaces requires, where L is the total length of all query sequences. • When ambiguity is enabled both time and space requirement increases exponentially. • The time depends on the target database but heavily depends on the maximum edit distances as well as the average length of query sequences. • For each additional edit operation an exponentially increasing number of partial matches must be considered.
PatMaN: Rapid Alignment of Short Sequences to Large Databases • Result • Time constrain of PatMaN means it is suitable for short sequence with a limited number edit operation. • HG -U95 is matched against chimpanzee genome(panTro2) with no gaps but one mismatch. • PatMaN takes 2.5h and found 15.9 million hits.