Parallel Pair-HMM SNP Detection

GNUMAP-SNP Parallel Pair-HMM SNP Detection Nathan Clement The University of Texas Austin, TX, USA

Outline • Motivation • NGS Issues and Requirements • Pair-HMM • Memory Optimizations • Results • Conclusion

Motivation Mutation Detection: • SNP discovery • HapMap and resequencing • Species Identification • Bisulfite Sequencing • Epigenetic influences • RNA editing

Error Rates* * Data current as of May 2011: Glenn, Travis C, “Field guide to next-generation DNA sequencers,” Molecular Ecology Resources, vol 11, pp 759-769, 2011

Pair-HMM

Pair-HMM (Mathematics) • Match • Gap (in both directions)

Pair-HMM (M)

Pair-HMM (X)

Pair-HMM (Y)

Pair-HMM

Expected Results

Why Inline SNP Calling? • Post-Processing • Disk space, less memory • Inline • Requires more memory • Less disk space • Can include specifics probabilities for each read

Previous Optimizations • Two methods for speeding up mapping: • Entire genome on one machine • Split memory among different machines • Must normalize across all genome portions • MPI reduction

Previous Optimizations

Memory Requirements • Human Genome (3gb) • HashMap ≈ 12GB • 4 bits/character = 1.5GB • 5 floating point values per base (plus N) = sizeof(float)*5 * 3GB=60GB • Also stores total for easy computation = sizeof(float) * 3GB = 12GB • Total of ≈ 90GB per run

Three Memory Optimizations • Normal (no optimization) • Integer discretization • Centroid discretization

Integer Discretization • Only need one floating point value (for total) and 1 byte/nucleotide. • “Parts per 255” • Biggest hit: Going into and out of “integer space”

Integer Discretization • Step 1: Convert from Integer Space • Step 2: Add from rito Genome • Step 3: Convert back to Integer Space Genome

Centroid Discretization • Many states not used: • [255, 255, 255, 255, 255] • [0, 0, 0, 0, 0] • Many states not biologically relevant • SNP transition (common) vstransversion (not likely) • MSA uses this compression to perform fast alignment of one-to-many alignment

Centroid Discretization (cont)

Centroid Discretization (cont) • Benefits • Doesn’t waste impossible or infrequently used space • Much smaller memory footprint • Drawbacks: • Slight overhead in converting from centroid to floating point spaces • Rounding error (how significant?)

Speed Comparison

Optimization Stats (chrX)

Conclusion • For high error rates, HMM approach is ideal, but requires more memory • Distributing the genome across processors doesn’t scale linearly • Discretization methods provide good memory reductions (up to 42%) • Centroid discretization performs poorly • Integer discretization can be used when available memory is low

Questions

Parallel Pair-HMM SNP Detection

Parallel Pair-HMM SNP Detection

Presentation Transcript

A Linked-HMM for Robust Voicing and Speech Detection

Sequence detection for parallel ACK

HMM Algorithms

Parallel Edge Detection

Hmm…

Angle Pair Relationships Warmup Notes on Parallel Lines Parallel Lines Construction Activity

A Parallel Implementation of MSER detection

Pair HMM and the Stepping Stone algorithm

HMM – HMM Comparison

SNP-pair Tetrahedron: Geometric Presentation of Haplotype Space of Pairwise SNPs

Protein homology detection by HMM–HMM comparison Johannes Söding

SNP Biochip with Electrical Detection and Gold Nanoparticles

Scalable Parallel Intrusion Detection

HMM-BASED PATTERN DETECTION

SNP comparisons

Biostatistics-Lecture 19 Linkage Disequilibrium and SNP detection

SNP chips

HMM structure:

SNP Comparison

Protein homology detection by HMM–HMM comparison Johannes Söding