1 / 42

Massively Parallel Solutions for Molecular Sequence Analysis

Massively Parallel Solutions for Molecular Sequence Analysis. Bertil Schmidt School of Computer Engineering, Nanyang Technological University , Singapore Heiko Schröder School of Computer Science and Information Technology, RMIT University, Melbourme, Australia Manfred Schimmler

Télécharger la présentation

Massively Parallel Solutions for Molecular Sequence Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Massively Parallel Solutions for Molecular Sequence Analysis Bertil Schmidt School of Computer Engineering, Nanyang Technological University , Singapore Heiko Schröder School of Computer Science and Information Technology, RMIT University, Melbourme, Australia Manfred Schimmler Institut für Datentechnik und Kommunikationsnetze, TU Braunschweig, Germany

  2. Contents • Motivation • Smith-Waterman Algorithm • Parallelization on the Hybrid Architecture • Parallelization on the Fuzion 150 • Performance Evaluation • Conclusion and Future Work

  3. Motivation • Genetic sequence databases are growing exponentially • Growth rate will continue, since multiple concurrent genome projects have begun, with more to come

  4. Motivation • Discovered sequences are analyzed by comparison with databases • Complexity of sequence comparison is proportional to the product of query size times database size •  Analysis too slow on sequential computers • Two possible approaches • Heuristics, e.g. BLAST,FastA, but the more efficient the heuristics, the worse the quality of the results • Parallel Processing, get high-quality results in reasonable time

  5. Mycobacterium Smegmatis Mycobacterium Tuberculosis 3918 Protein Sequences 1.329.298 AminoAcids 4289 Protein Sequences 1.359.008 AminoAcids Full Genome Comparison • related Organisms, but Tuberculosis causes a disease  find common and different parts • 16106 pairwise sequence comparisons • Many Genome-Genome Comparisons will be required in the near future

  6. GGHSRLILSQLGEEG.RLLAIDRDPQAIAVAKT....IDDPRFSII |||::::| : |::| ||:::||||:|:|||:: ::| |:::: GGHAERFL.E.GLPGLRLIGLDRDPTALDVARSRLVRFAD.RLTLV Slower Search Speed Faster Data Quality Lower Higher Protein Sequence Alignment • BLAST, FastA, Smith-Waterman Smith- Waterman FastA BLAST

  7. Smith-Waterman Algorithm • Optimal local alignment of two sequences • Performs an exhaustive search for the optimal local alignment • Complexity O(nm) for sequence lengths n and m • Based on the 'dynamic programming' (DP) algorithm • Fill the DP matrix using a substitution (mutation) matrix • Find the maximal value (score) in the matrix • Trace back from the score until a 0 value is reached

  8. Smith-Waterman Algorithm • Aligning S1 and S2 of length n and m using Recurrences: • Calculate three possible ways to extend the alignment • by one AminoAcid (AA) in each sequence • by one AA in the first sequence and align it with a gap in the second • by one AA in the second sequence and align it with a gap in the first

  9. A T C T C G T A T G A T G  0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 T 0 0 2 1 2 1 1 4 3 2 1 1 3 2 C 0 0 1 4 3 4 3 3 3 2 1 0 2 2 T 0 0 2 3 6 5 4 5 4 5 4 3 2 1 A 0 2 2 2 5 5 4 4 7 6 5 6 5 4 T 0 1 4 3 4 4 4 6 5 9 8 7 8 7 C 0 0 3 6 5 6 5 5 5 8 8 7 7 7 A 0 2 2 5 5 5 5 4 7 7 7 10 9 8 C 0 1 1 4 4 7 6 5 6 6 6 9 9 8 A T C T C G T A T G A T G G T C T A T C A C Smith-Waterman Algorithm Align S1=ATCTCGTATGATGS2=GTCTATCAC 0 0 0 0 0 0 2 1 0 0 2 1 0 2 2 =1, =1 4 3 5 7 9 8 10

  10. Systola 1024: PC add-on board with 1024 processors (ISATEC, Germany) • Fuzion 150: 1536 processors on a single chip (Clearspeed Technology, UK) Parallel Architectures for Bioinformatics • Embedded Massively Parallel Accelerators

  11. Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 High speed Myrinet switch Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Parallel Architectures for Bioinformatics • Supercomputer performance at low cost • combines SIMD and MIMD paradigm within a parallel architecture Hybrid Computer

  12. Previous Applications • Scientific Computing • Volume Visualization • Automatic Visual Quality Control • Cryptography • Computer Tomography • Video Compression • Range of Transforms (Fourier, Wavelet, Hough, Radon) • Computer Graphics

  13. RAM NORTH RAM WEST Controller program memory host computer bus ISA Interface processors Architecture of Systola 1024 • Instruction Systolic Array: • 32  32 mesh of processing elements • wavefront instruction execution

  14. - + - - * - + - - - * * * * + - + * + - + + * * - + + * * + + * - + + * * + + + + * * column selectors + instructions - * + * - + - - + * row selectors Instruction Systolic Array • wavefront instruction execution  fast accumulation operations (e.g. row sum, broadcast, ringshift)

  15. A P1 P2 P13  A T C T C G T A T G A T G  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 2 1 0 2 G 0 0 0 0 0 2 T 0 0 0 2 2 1 1 1 2 2 1 1 4 3 2 1 1 3 2 3 2 B C 0 0 0 1 1 3 4 4 3 4 3 3 3 2 1 0 2 2 4 T 0 0 0 3 2 2 3 6 5 4 5 4 5 4 3 2 1 A 0 2 2 2 2 2 5 5 4 4 7 6 5 6 5 4 T 0 1 4 3 4 4 4 6 5 9 8 7 8 7 C 0 0 3 6 5 6 5 5 5 8 8 7 7 7 1 A 0 2 2 5 5 5 5 4 7 7 7 10 9 8 C 0 1 1 4 4 7 6 5 6 6 6 9 9 8 Parallelization of Smith-Waterman • matrix cells along a single diagonal are computed in parallel • comparison is performed in A+B1 steps on A PEs 0 5

  16. a1023 a1022 a992 a63 a62 a32 bk….b1b0 a31 a30 a0 Mapping onto Systola 1024 a: query sequence (equal to 1024) • Subject sequences can be pipelined with only 1 step delay  k steps for subject sequence of length k b: subject sequence …c1c0 X • Efficient routing on the ISA: Row Ringshift and Broadcast

  17. Query sequence length 256 512 1024 2048 4096 Systola 1024 speedup to PIII 850 294 5 577 6 1137 6 2241 6 4611 6 Cluster of 16 Systolas speedup to PIII 850 20 81 38 86 73 91 142 94 290 94 Performance Evaluation • Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths • Parallel implementation scales linearly with sequence length and number of PCs • Computing time dominates data transfer time need a state-of-the-art architecture

  18. Fuzion 150 Architecture Linear SIMD Array 1536PEs each with 2 Kbytes DRAM SIMD Controller Instruction Fetch • 0.25-m, single-chip, SIMD architecture • 1536 PEs @ 200 MHz  300 GOPS • 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth • Multithreading (control units interact via semaphores) • developed by Clearspeed Technology (UK) for graphics, networking processing Local Memory Host AGP Rambus FUZION Bus 1,2 or 4 Channels (6.4 GB/s) 32-bit EPU (ARC) Video I/O Display

  19. Instructions ALU (8 bits) Register file 32 Bytes Left PE Right PE PE Memory 2 KByte DRAM Block I/O Channel Local Memory Fuzion 150 Architecture Block 5 Fuzion Bus PE (5,0) PE (5,1) PE (5,255) Block 1 PE (1,0) PE (1,1) PE (1,255) Block 0 PE (0,0) PE (0,1) PE (0,255)

  20. a1535 a1534 a1280 a511 a510 a256 a0 a1 a255 Mapping onto the Fuzion 150 Block 5 a: query sequence (equal to 1536) Block 1 b: subject sequence Block 0 bk….b1b0 …c1c0 X • No fast global communication  2-step local communication • Subject sequence can be pipelined with only step delay

  21. Mapping onto the Fuzion 150 • Reduce communication time • Assign 16 AAs to each PE  query lengths up to 24576 AAs can be processed within a single pass • Partitioning for query lengths <24576: • each subarray of corresponding size computes the alignment of the same query sequence with different subject sequences

  22. Query sequence length 256 512 1024 2048 4096 Fuzion 150 speedup to PIII 850 12 136 22 151 42 157 82 163 162 165 Performance Evaluation • Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths • Parallel implementation scales linearly with sequence length • Computing time dominates data transfer time

  23. Performance Evaluation • Normalized time Comparison for a 10 Mbase search on different parallel architectures with different query length • 4faster than 16K-PE MasPar • 6faster than Kestrel • 5faster than SAMBA (special-purpose 3-board architecture)

  24. Cluster of Systola 1024 speedup to PIII 850 17 min 79 Fuzion 150 speedup to PIII 850 11 min 133 Performance Evaluation for Full Genome Comparison • Scan times for pairwise protein sequence comparison of Mycobacterium Tuberculosis and Escherichia Coli • Comparison has to be performed for several parameters (Substitution matrices, gap penalties) • Mycobacterium Smegmatis will be published later this year • Results of the comparison will be interpreted with the Centre for Molecular Cell Biology, NUS, Singapore

  25. Conclusions and Future Work • Demonstrated how fine-grained parallel architectures can be applied efficiently for Comparative Genomics • Significant runtime savings for genome comparisons and database searching  More Discovery Is Possible at a good price-performance ratio • Other Computational Biology applications of interest to us: • ClustalW • HMM • pattern matching algorithms, such as inverted repeats, short tandem repeats, etc • Availability of accelerators as a special-resource in a Grid Environment

  26. Contents • Protein Structure • Protein Structure Prediction • Approach based on Local Protein Structure • Refinements • Conclusions and Future Work

  27. R side chain amine group carboxyl group alpha carbon(with attached hydrogen) Protein Structure • Proteins are large molecules composed of smaller molecules called amino acids • There are 20 kinds of amino acids found in natural proteins • All share a common structure

  28. Protein Structure

  29. From Primary to Tertiary Structure • A protein’s 3D shape is determined by its primary amino acid sequence (Anfinsen, 1963) • Predicting tertiary structure from amino acid sequence is an unsolved problem • Difficult to model the energies that stabilize a protein molecule • Conformational search space is enormous

  30. Target amino acid sequence YLAADTYK Template Fold library FISSETCN MEPSSYV TGLIRKN Template amino acid sequence 7 21 2 Target/template Score: Prediction Methods • Given an amino acid sequence: • search a set of known folds by aligning sequence and a template fold representative • predict the fold that gets the best scoring alignment

  31. Prediction Methods • This method is very effective when target and template have >30% sequence identity • Approximately 1/3 of protein sequences can be assigned folds and modeled this way • Our aim is to contribute to determine tertiary structures in case matching sequences cannot be found

  32. Local structure and prediction • What is Local structure ? • describes environment of an amino acid • an amino acid’s relationship to neighbors • we use this information to predict structure from primary sequence

  33. Dihedral Angles • The 6 atoms in each peptide unit lie in the same plane •  and  free to rotate • The structure of a protein is almost totally determined, if all angles  and  are known

  34. Idea of our Approach N Back bone • Stiff  free • local predictability  database of sub-chain structures • reduction of the number of degrees of freedom by 10, reduces the computation time significantly in combination with a global optimization algorithm (e.g. GA or SA) C C  and  Side chains

  35. multiple Selected PDB structures Histogram for each amino acids pair Dihedral angle extraction stiff flexible Classification of Dihedral Angles

  36. Frequency Frequency ALA-ALA GLY-ILE Frequency LEU-ARG Classification of Dihedral Angles multiple Stiff flexible

  37. multiple Selected PDB structures Histogram for each amino acids pair Dihedral angle extraction stiff flexible Classification of Dihedral Angles • Stiff angles: determine mean value • Multiple angles: determine sequence of mean values, one for each peak in decreasing order of these peaks • Flexible angles: determine mean value and mark as flexible

  38. Prediction based on Classification • Given a sequence of amino acids, find the subsequence in which all angles are of type stiff • predict structure of these subsequences, using the mean values of the corresponding histograms

  39. Prediction based on Classification • Part of a protein predicted with this method (backbone of a helix, original structure on the left, predicted structure on the right) • Successfully predicted certain stiff structures of subsequences up to the length of 15

  40. Refinement of the method • For multiple angles: • consider sequences of length 3 or 4: • extract sequences (C,A,B,D) and determine the histogram of angles  and related to the peptide chain between A and B • if histogram for  for amino acids (A,B) is multiple, check if angle for (A,B,C,D) is stiff • with longer subsequences the occurrences of these sequences drops dramatically

  41. Refinement of the method • For multiple angles: • if an amino acid sequence has only a small number of multiple edges, it is possible to try all combinations of possible peaks • many combinations lead to collisions in part of the protein, and thus can be eliminated

  42. Conclusion and Future Work • Presented a method to predict stiff structures of subsequences up to the certain length • Presented a refinement of the method to handle multiple angles • how to handle flexible angles ? • Using the local prediction as an input for a global optimization method, e.g. based on Simulated Annealing

More Related