Download
gpgpu in ngs bioinformatics n.
Skip this Video
Loading SlideShow in 5 Seconds..
GPGPU in NGS Bioinformatics PowerPoint Presentation
Download Presentation
GPGPU in NGS Bioinformatics

GPGPU in NGS Bioinformatics

359 Views Download Presentation
Download Presentation

GPGPU in NGS Bioinformatics

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. GPGPU in NGS Bioinformatics A massively parallel solution to a massively parallel problem Brian Lam, PhD HPC4NGS Workshop, May 21-22, 2012 Principe Felipe Research Center, Valencia

  2. Talk Outline • Next-generation sequencing and current challenges • What is GPGPU computing? • The use of GPGPU in NGS bioinformatics • How it works – the BarraCUDA project • System requirements for GPU computing • What if I want to develop my own GPU code? • What to look out for in the near future? • Conclusions

  3. Talk Outline • Next-generation sequencing and current challenges • What is GPGPU computing? • The use of GPGPU in NGS bioinformatics • How it works – the BarraCUDA project • System requirements for GPU computing • What if I want to develop my own GPU code? • What to look out for in the near future? • Conclusions

  4. DNA sequencing “DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases A, G, C, T in a molecule of DNA.” - Wikipedia

  5. Next-generation DNA sequencing • Commercialised in 2005 by 454 Life Science • Sequencing in a massively parallel fashion

  6. How it works • An example: Whole genome sequencing • Extract genomic DNA from blood/mouth swabs • Break into small DNA fragments of 200-400 bp • Attach DNA fragments to a surface (flow cells/slides/microtitre plates) at a high density • Perform concurrent “cyclic sequencing reaction” to obtain the sequence of each of the attached DNA fragments An IlluminaHiSeq 2000 can interrogate 825K spots / mm2

  7. Capturing the sequencing signals GTCCTGA ATTCNGG TATTTTT Not to scale

  8. What do we get from the sequencers Billions of short DNA sequences, also called sequence reads ranging from 25 to 400 bp

  9. The throughput of NGS has increased dramatically

  10. Current bioinformatics pipeline Sequence Alignment Variant Calling, Peak calling Image Analysis Base Calling

  11. Just how complex is the pipeline? Source: Genologics

  12. Talk Outline • Next-generation sequencing and current challenges • What is Many-core/GPGPU computing? • The use of GPGPU in NGS bioinformatics • How it works – the BarraCUDA project • System requirements for GPU computing • What if I want to develop my own GPU code? • What to look out for in the near future? • Conclusions

  13. Many-core computing architectures • A physical die that contains a ‘large number’ of processing • i.e. computation can be done in a massively parallel manner • Modern graphics cards (GPUs) consist of hundreds to thousands of computing cores

  14. Just how fast are today’s GPUs

  15. If GPUs are so fast why do we need CPUs? • GPUs are fast, but there is a catch: • SIMD – Single instruction, multiple data VS • CPUs are powerful multi-purpose processors • MIMD – Multiple Instructions, multiple data

  16. CPU vs GPU

  17. SIMD – pros and cons • The very same (low level) instruction is applied to multiple data at the same time • e.g. a GTX680 can do addition to 1536 data point at a time, versus 16 on a 16-core CPU. • Branching results in serialisations • The ALUs on GPU are usually much more primitive compared to their CPU counterparts.

  18. GPU in scientific computing • Scientific computing often deal with large amount of data, and in many occasions, applying the same set of instructions to these data. • Examples: • Monte Carlo simulations • Image analysis • Next-generation sequencing data analysis

  19. Why bother? • Lowcapital cost and energy efficient • Dell 12-core workstation £5,000, ~1kW • Dell 40-core computing cluster £ 20,000+, ~6kW • NVIDIA Geforce GTX680 (1536 cores): £400, <0.2kW • NVIDIA C2070 (448 cores): £1000, 0.2kW • Many supercomputers also contain multiple GPU nodes for parallel computations

  20. GPGPU in bioinformatics • Examples: • CUDASW++  6.3X • MUMmerGPU 3.5X • GPU-HMMer 60-100x

  21. Talk Outline • Next-generation sequencing and current challenges • What is Many-core/GPGPU computing? • The use of GPGPU in NGS bioinformatics • How it works – the BarraCUDA project • System requirements for GPU computing • What if I want to develop my own GPU code? • What to look out for in the near future? • Conclusions

  22. A few GPGPU programs are available • Ion-torrent server (T7500 workstation) uses GPUs for base-calling • MummerGPU – comparisons among genomes • BarraCUDA, Soap3 – short read alignments

  23. Talk Outline • Next-generation sequencing and current challenges • What is Many-core/GPGPU computing? • The use of GPGPU in NGS bioinformatics • How it works – the BarraCUDA project • System requirements for GPU computing • What if I want to develop my own GPU code? • What to look out for in the near future? • Conclusions

  24. Heterogeneous computing

  25. Current bioinformatics pipeline Sequence Alignment Variant Calling, Peak calling Image Analysis Base Calling

  26. Sequence alignment Sequence alignment is a crucial step in the bioinformatics pipeline for downstream analyses This step often takes many CPU hours to perform Usually done on HPC clusters

  27. The BarraCUDA Project – an example of GPGPU computing • The main objective of the BarraCUDA project is to develop a software that runs on GPU/ many-core architectures • i.e. to map sequence reads the same way as they come out from the NGS instrument

  28. Software Pipeline Read library CPU CPU Copy alignment results to CPU Write to disk Copy read library to GPU Copy genome to GPU GPU Alignment Alignment Alignment Alignment Results Alignment Genome Alignment Alignment Alignment Alignment Alignment Alignment Alignment Alignment Alignment Alignment Alignment

  29. Burrows-Wheeler transform • Originally intended for data compression, performs reversible transformation of a string • In 2000, Ferragina and Manzini introduced BWT-based index data structure for fast substring matching at O(n) • Sub-string matching is performed in a tree traversal-like manner • Used in major sequencing read mapping programs e.g. BWA, Bowtie, Soap2

  30. How it works – a backward search algorithm matching substring ‘banan’

  31. In mathematical terms , Modified from Li & Durbin Bioinformatics 2009, 25:14, 1754-1760

  32. Programming code BWT_exactmatch(READ,i,k,l){ if (i < 0) thenreturnRESULTS; k = C(READ[i]) + O(READ[i],k-1)+1; l = C(READ[i]) + O(READ[i],l); if (k <= l) thenBWT_exactmatch(READ,i-1,k,l); } main(){ Calculate reverse BWT string B from reference string X Calculate arrays C(.) and O (.,.) from B Load READS from disk For every READ in READS do{ i= |READ|;  Position k = 0;  Lower bound l = |X|;  Upper bound BWT_exactmatch(READ,i,k,l); } Write RESULTS to disk } Modified from Li & Durbin Bioinformatics 2009, 25:14, 1754-1760

  33. Porting to GPU • Simple data parallelism • Used mainly the GPU for matching

  34. Porting to GPU __device__BWT_GPU_exactmatch(W,i,k,l){ if(i < 0) thenreturnRESULTS; k= C(W[i]) + O(W[i],k-1)+1; l= C(W[i]) + O(W[i],l); if(k <= l) thenBWT_GPU_exactmatch(W,i-1,k,l); } __global__GPU_kernel(){ W= READS[thread_no]; i = |W|;  Position k = 0;  Lower bound l = |X|;  Upper bound BWT_GPU_exactmatch(W,i,k,l); } main(){ Calculate reverse BWT string B from reference string X Calculate array C(.) and O(.,.) from B Load READS from disk CopyB, C(.) and O(.) to GPU Copy READS to GPU Launch GPU_kernel with <<|READS|>> concurrent threads COPY Alignment Results back from GPU Write RESULTS to disk }

  35. How fast is that? • Very fast indeed, using a Tesla C2050, we can match 25 million 100bp reads to the BWT in just over 1 min. • But… is this relevant?

  36. Inexact Matching? matching substring ‘anb’ where ‘b’ is subsituted with an ‘a’

  37. Inexact matching requires base substitution within the query substring Search space complexity = O(9n)! Klus et al., BMC Res Notes 2012 5:27

  38. BarraCUDA started its life as a GPU version of BWA • It _________worked! • 10% faster than 8x X5472 cores @ 3GHz • BWA uses a greedy breadth-first search approach (takes up to 40MB per thread) • Not enough workspace for thousands of concurrent kernel threads (@ 4KB) – i.e. reduced accuracy – NOT GOOD ENOUGH! partially

  39. BarraCUDA uses a depth-first search approach hit hit

  40. SIMD – the catch • The very same (low level) instruction is applied to multiple data at the same time • e.g. a GTX680 can do addition to 1536 data point at a time, versus 16 on a 16-core CPU. • Branching results in serialisations • The ALUs on GPU are usually much more primitive compared to their CPU counterparts.

  41. Branch serialisation

  42. Branch divergence

  43. Multi-kernel design Thread 1 Thread 1.1 Thread 1.2 A B CPU thread queue: A B hit hit

  44. Mapping accuracy Klus et al., BMC Res Notes 2012 5:27

  45. Mapping speed Klus et al., BMC Res Notes 2012 5:27 Time Taken (min) 0 0

  46. Porting to GPU • Simple data parallelism • Used mainly the GPU for matching

  47. Scalability Klus et al., BMC Res Notes 2012 5:27

  48. Talk Outline • Next-generation sequencing and current challenges • What is Many-core/GPGPU computing? • The use of GPGPU in NGS bioinformatics • How it works – the BarraCUDA project • System requirements for GPU computing • What if I want to develop my own GPU code? • What to look out for in the near future? • Conclusions

  49. System requirements • Hardware • The system must have at least one decent GPU • NVIDIA Geforce 210 will not work! • One or more PCIex16 slots • A decent power supply with appropriate power connectors • 550W for one Tesla C2075, and + 225W for each additional cards • Don’t use any Molex converters! • Ideally, dedicate cards for computation and use a separate cheap card for display

  50. System requirements