230 likes | 352 Vues
Indel Mappers. Indel Mapper. Pindel – A Pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads Kai Ye, Marcel H. Schulz, Quan Long, Rolf Apweiler and Zemin Ning. The programs.
E N D
IndelMapper • Pindel – A Pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads • Kai Ye, Marcel H. Schulz, Quan Long, Rolf Apweiler and ZeminNing
The programs • Stampy – A statistical algorithm for sensitive and fast mapping of Illumina sequence reads • GertonLunter and Martin Goodson (Gen Res Oct 2010) • Last - Probabilistic alignments with quality scores: an application to short-read mapping toward accurate SNP/indel detection • Michiaki Hamada, Edward Wijaya, Martin C. Frith and Kiyoshi Asai (Bioinformatics Oct 2011)
Flow of PIndel • Aim: Compute precise breakpoints as well as the fragments inserted or deleted compared to the reference from paired-end reads • Use SSAHA2 to map all reads to reference • If both ends are uniquely mapped, Keep them • If one end is uniquely mapped (no mismatch allowed for this anchoring end) • Other end must be mapped with a threshold of at least 20 (alignment score for ~36bp read)
Finding the unmapped end • Given a unique anchor of one end, find the locus of its unmapped pair and its fragments • 2 fragments if it is a deletion • 3 fragments if it is an insertion
Finding the unmapped end • Due to an deletion (must be supported >=2 reads) • User specify Max. delete size, Min_F & Min_C
Finding the unmapped end • Due to insertion (<=20bp for 36bp reads) • Insertion must be supported by >=2 reads • Compute min&max unique substrings (US) of both 5’&3’ ends of the unmapped read • Check if minUS_5’ is adjacent with maxUS_3’ and vice versa • The region between minUS_5’ and minUS_3’ is the inserted fragment
Outline of Stampy • Scanning the read • Phred scores • Similarity Filtering • Single End - Mapping Posterior • Paired-end reads: paired-end candidates
Scanning the read • Overlapping 15mers considered • Including 1-mismatch ‘neighbours’ • For reads >34bp and <50bp long • 1-mismatch neighbours are considered for half of the 15mers • reads >=50bp long, only a-third of the 15mers are considered
Phred scores • Corresponding positions of the read are marked with a Phred score • 0, if it is repetitive (>200 occurrences in the reference); for its 1-neighbor, it is marked by the Phred quality of the mutated base • All positions of non-repetitive 15mers are retrieved • The scores are used to calculate the mapping posterior later
Similarity Filtering • Three 4-nt words close to but non-overlapping with the 15mer are chosen • Counts of A-C-G-T for these 12 read-positions • Counts of A-C-G-T for these 12 positions at the putative genomic location • Get the absolute difference between the two sets of counts (read and reference); Score T • Candidate positions exceeding T will be discarded
Single End - Mapping Posterior • Probability that a mapping is incorrect • Lopt is max likelihood mapping location • The sum runs over all considered locations • This is only an approximate as correct location is not considered among all Li • Read contains highly repetitive 15mers • Low quality or highly diverged from reference • Sequence is not represented in reference • Final mapping Phred score is summing 1, 2, 3 1 ‐ P(read | Lopt ) / Σ P(read | Li )
Paired-end reads: paired-end candidates • Pair is unmapped if no candidates found for both reads • Report the pair-coordinates • Best locations for pair are with 4sd of mean insert-length OR • Phred score >=2 in (1 & 2) • Else • Candidates which constitute 99.9% of the posterior mapping score of the single read are extracted • Its mate will be mapped against the reference implied by the insert-length distribution
Paired-end reads: paired-end candidates • Final mapping quality • Product of the top-scoring single-end hits selected as the pair • Or Single-end posterior score of anchoring hit
LAST • Uses probabilistic alignment instead of maximum score-based alignment • Based on posterior decoding technique which uses marginal probabilities that incorporate all possible alignments with quality scores
Outline of LAST • Incorporating quality scores into a score matrix • Probabilistic model for alignment • Marginal Probabilities • Probabilistic alignments with quality scores • Y-centroid alignment • LAMA alignment
Incorporating quality scores into a score matrix • Old Method: Sa,b is the substitution score of aligning nucleotide reference-a onto read-b • Incorporate quality-score, q, into S • T is a scaling factor
Probabilistic model for alignment • Let S(A) be the score for alignment A. • For a local alignment A, the probability of A • x is the genome region • y is the read-base with a quality score • S(A) is computed from the ‘new’ substitution score matrix
Marginal Probabilities • Pik is the marginal probability that a base xi (i-th base of x) aligns with a base yk (k-th base of y) • qi is the marginal probability that a base xi aligns with a gap • Ui is the marginal probability that xi belongs to an un-aligned region that is not contained in the local alignment
Probabilistic alignments with quality scores • Two methods considering quality scores by using the marginal probabilities • Y-centroid alignment • LAMA alignment
Y-centroid alignment • Maximizing S(A) for alignment A • Y is a parametric input • xi~yk is an aligned column (without gaps) in A • Computed from NW algorithm
Parameter Y • Adjusts the sensitivity and precision of the aligned columns • When Y is low, LAST is conservative and only align bases with high probabilities • When Y is high, increases rate of alignments at the cost of more false-positives • Y-centroid is bad • Even with a low-Y, LAST may still contain many gaps
LAMA alignment • Consider the aligned and gap explicitly • For the gaps Deletion in alignment Insertion in alignment