1 / 23

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data. Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang. The Ohio State University. HiCOMB 2014 May 19 th , Phoenix, Arizona. Outline. Introduction Sequence Data Format Converter Design Experimental Results

nardo
Télécharger la présentation

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19th, Phoenix, Arizona

  2. Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion

  3. Explosion of Next-Generation Sequencing Data • NGS Advantages • Faster and cheaper • E.g., over one billion short reads per instrument run • More accurate: higher resolution and deeper coverage • Challenges • Urgent need for turning raw data into knowledge • Parallelism is the key

  4. Historical Trends in Storage Prices v.s. DNA Sequencing Costs Reported by Lincoln Stein

  5. Varieties of NGS Data Formats • Different Formats • SAM (Sequence Alignment/Map) • The de-facto text format for storing large nucleotide sequence alignments • BAM (Binary Alignment/Map) • The compressed, indexable, binary form of the SAM format • Indexing is supported by BAI (BAM Index) file • Other formats • BED (Browser Extensible Data), FASTA, FASTQ, WIG(wiggle), GFF(Gene Finding Feature), etc.

  6. Analysis Pipeline • Current Pipeline • Parallelism mainly focuses on the analysis steps, e.g., SNP discovery and BLAST • Reality • Cross-utilization Problem: sequencing data ≠ input • Some other analysis steps stay sequential • Needs for removing other sequential bottlenecks

  7. Motivation: Removing Other Sequential Bottlenecks • Parallel Format Conversion • Current format conversion commonly makes use of a single core • Current downstream tools may not be exchanged between different aligners • Not hard to implement but important to scale out • Parallelizing Certain Statistical Analysis Steps • E.g., parallel analysis on the histogram data

  8. Framework only discuss the first component today • Sequence Data Format Converter • Input: SAM/BAM • Output: • BAM/SAM • FASTA, FASTQ, BED, BEDGRAPH, JSON and YAML • Statistical Analysis Module • Parallelize other statistical analysis steps • E.g., non-local means (NL-Means) and false discovery rate (FDR) computation

  9. Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion

  10. Sequence Data Format Converter • 3 Converter Instances • SAM Format Converter • BAM Format Converter • Preprocessing-Optimized SAM Format Converter • Support partial format conversionon a specific chromosome region

  11. SAM Format Converter No communication among procs after partitioning partitioning is the key step for parallelization Extensibility and Programmability

  12. Partitioning Algorithm • Key: each SAM record is delimited by a line breaker • Initial even partitioning • Adjust partition boundaries by detecting line breakers

  13. BAM Format Converter Cannot be parallelized because of the third-party API • Challenge • No explicit delimiter: • Even partitioning -> unparsable records • Solution: add a preprocessing phase • Partition data by supporting random access

  14. BAMX and BAIX • BAMX (BAM eXtended) File • Transform each varying-length BAM record into a regular-layout BAMX record • Align varying-length BAM fields by padding • BAIX (BAI eXtended File) • Index file of the BAMX file • Store the alignment starting positions in BAM (logically) and in BAMX (physically)

  15. Partial Conversion • If only interested in a subset, no need for full conversion • Based on the BAIX file • Given logical alignment starting and ending positions, locate the physical starting and ending positions in the BAMX file (by binary search) • Evenly partition the subset and proceed in parallel

  16. Preprocessing-Optimized SAM Format Converter M procs N procs M × N target files • Main Ideas • Preprocessing can also optimize the SAM format conversion • Such preprocessing can be parallelized because of the easy partitioning on the SAM format

  17. Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion

  18. Experimental Setup • Dataset • Whole genome DNA-sequencing of three mouse samples • Approximately 125 million sequences providing about 40-fold coverage of the genome • In the SAM/BAM format • Cluster • 8 GB Memory • Up to 32 8-core machines (256 cores in total)

  19. Performance of SAM Format Converter • Input: 100 GB SAM data • Output: BED, BEDGRAPH and FASTA

  20. Performance of BAM Format Converter • Input: 117 GB BAM data • Output: BED, BEDGRAPH and FASTA

  21. SAM Format Converter Comparison: Preprocessing-Optimized vs. Original • Input: 15.7 GB BAM data • Output: BED, BEDGRAPH and FASTA

  22. Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion

  23. Conclusion • In the NGS analysis pipeline, the overall latency cannot be reduced unless all sequential bottlenecks are removed • The first framework that can easily support parallel sequence format conversion in distributed environment • SAM format converter • BAM format converter • Preprocessing-optimized SAM format converter

More Related