230 likes | 342 Vues
SHRiMP: The SHort Read Mapping Package. Michael Brudno Department of Computer Science University of Toronto 11/09/08. Handling NGS Data. NGS: at least 3 distinct read types: Illumina/Solexa, 454 letter-space AB SOLiD color-space (di-base sequencing) 2-pass SMS (Helicos)
E N D
SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08
Handling NGS Data • NGS: at least 3 distinct read types: • Illumina/Solexa, 454 letter-space • AB SOLiD color-space (di-base sequencing) • 2-pass SMS (Helicos) • 2 reads, same location • higher error rates • Need new algorithms • SOLiD: Biologists want letters, not colors • 2-pass: How to best handle two reads?
SHRiMP Overview } Common Isolate similarity in stages: • Spaced Seed Filtering • Vectorized Smith-Waterman • Full Alignment • Specialized for SOLiD, 2-pass, Letter-space • Compute p-values (and other statistics)
Outline • AB SOLiD Reads • 2-pass (SMS) Reads
AB SOLiD: Dibase Sequencing hmm??? HMM!!! AB SOLiD reads look like this: T012233102 T012033102 G G G A T G G C A A T A C G T T T A 0 0 TGAGCGTTC|||TGAATAGGA 2 A G 1 3 3 1 C T 2 0 0
AB SOLiD: Color space is complex! INDELS TGAGTTA 122103 TGA-TTA 12-303 TGAGTTTA 1221003 TGAGTATA 1221333 SNPs TGAGTT 12210 TGACTT 12120 TGAATT 12030 TGATTT 12300 G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT It’s bloody complicated!
AB SOLiD: Translations TGAGCGTTC|||||||||TGAGCGTTC TGAGCGTTC|||TGAATAGGA • Look at: 012233102 • Recall: 012033102 • 4 translations for every color sequence 0 0 2 A G 1 3 3 1 C T 2 0 0
AB SOLiD: Modified Smith-Waterman • 4 S-W matrices, one per translation • Errors transition into other matrix • ‘Crossover’ penalty charged for errors G A T A C C T T T G A G C G T T C C C A T T G Genome … A G C G T T C Translation A Translation C
AB SOLiD: Obligatory Comparison • SHRiMP and AB Mapper (1.6) • SHRiMP seed weight 8 (1111001111) • AB 35_2, 35_3 schemas • 10,000 35bp reads • C. savignyi (173Mb), very high polymorphism • Considering single top hits only
AB SOLiD: Resultant Alignments • SHRiMP emits letter-space alignments • Clear to biologists • Color-space need notbe scary! G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| ||| T: GAaCCCCTTACAACTGAACCCC-TAC R: 1 T1211000203110121201000-231 25
Outline • AB SOLiD Reads • 2-pass (SMS) Reads
2-pass SMS Reads • SMS reads have high error rates • “Dark bases” (skipped letters) • Multiple passes are possible • Ameliorate errors over passes • Good chance of missing base in one read • Acceptable chance of getting it in at least one
Mapping 2-pass Reads Original Reads C-GACTTTA CTGACTTA CTGA-T--- ? Reference Genome
SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTG-ACT CTGCACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2
SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTG-ACT CTGCACT CAGCA-T S=9 CTGAC-T CTGACAT CAG-CAT Match = +4 Mismatch = -3 Gap = -2
SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTG-ACT CTGCACT CAGCA-T S=9 CTGAC-T CTGACAT CAG-CAT C-TG-ACT CATGCACT CA-GCA-T CT-GAC-T CTAGACAT C-AG-CAT S=8 C-TGAC-T CATGCACT CA-G-CAT CT-GAC-T CTAGACAT C-AG-CAT Match = +4 Mismatch = -3 Gap = -2
SMS 2-pass: Near-optimal Alignments C T G A C T • Compute a DP matrix • Sum it up with the DP matrix computed in reverse C A G C A T + Match = +4 Mismatch = -3 Gap = -2
SMS 2-pass: Near-optimal Alignments —T A— CC A — —A CC AT GG TT AA —C C— —T A— C T G A C T • Compute a DP matrix • Sum it up with the DP matrix computed in reverse • Leave only near optimal alignments C A G C A T = Match = +4 Mismatch = -3 Gap = -2 Represent the remaining cells as a directed graph (Shwikowski & Vingron, 2003)
SMS 2-pass: SHRiMP with 2-pass data AT CC A — —A CC A— —T TT GG AA —C C— —T A— • Build a DAG representing the (near) optimal alignments of the two reads • Generate seeds (short paths) from the DAG • Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW. • Do full alignment for top hits
SMS 2-pass: Results (in brief) • 10,000 synthetic reads (~25-65 bp) • 7% deletion,1% insertion, 1% sub rate • Mapped to Human chromosome 1 • Spaced seed weight 8: 111101111
SHRiMP Summary • Fast mapping of short reads to a genome • -- Handles: • color-space (SOLiD) reads • 2-pass (SMS) reads • insertions and deletions • -- Easy to parallelize • Computation of p-values & other statistics for hits
SHRiMP TODO List • Faster Mapping (biggest complaint) • Matepair data support • Transcriptome Data • Suggestions?
Acknowledgements SHRiMP is brought to you by: • Steve Rumble • Vlad Yanovsky • Adrian Dalca • Marc Fiume • Phil Lacroute • Arend Sidow http://compbio.cs.toronto.edu/shrimp University of Toronto Stanford University