Massively Parallel Solutions for Molecular Sequence Analysis

Massively Parallel Solutions for Molecular Sequence Analysis Bertil Schmidt School of Computer Engineering, Nanyang Technological University , Singapore Heiko Schröder School of Computer Science and Information Technology, RMIT University, Melbourme, Australia Manfred Schimmler Institut für Datentechnik und Kommunikationsnetze, TU Braunschweig, Germany

Contents • Motivation • Smith-Waterman Algorithm • Parallelization on the Hybrid Architecture • Parallelization on the Fuzion 150 • Performance Evaluation • Conclusion and Future Work

Motivation • Genetic sequence databases are growing exponentially • Growth rate will continue, since multiple concurrent genome projects have begun, with more to come

Motivation • Discovered sequences are analyzed by comparison with databases • Complexity of sequence comparison is proportional to the product of query size times database size •  Analysis too slow on sequential computers • Two possible approaches • Heuristics, e.g. BLAST,FastA, but the more efficient the heuristics, the worse the quality of the results • Parallel Processing, get high-quality results in reasonable time

Mycobacterium Smegmatis Mycobacterium Tuberculosis 3918 Protein Sequences 1.329.298 AminoAcids 4289 Protein Sequences 1.359.008 AminoAcids Full Genome Comparison • related Organisms, but Tuberculosis causes a disease  find common and different parts • 16106 pairwise sequence comparisons • Many Genome-Genome Comparisons will be required in the near future

GGHSRLILSQLGEEG.RLLAIDRDPQAIAVAKT....IDDPRFSII |||::::| : |::| ||:::||||:|:|||:: ::| |:::: GGHAERFL.E.GLPGLRLIGLDRDPTALDVARSRLVRFAD.RLTLV Slower Search Speed Faster Data Quality Lower Higher Protein Sequence Alignment • BLAST, FastA, Smith-Waterman Smith- Waterman FastA BLAST

Smith-Waterman Algorithm • Optimal local alignment of two sequences • Performs an exhaustive search for the optimal local alignment • Complexity O(nm) for sequence lengths n and m • Based on the 'dynamic programming' (DP) algorithm • Fill the DP matrix using a substitution (mutation) matrix • Find the maximal value (score) in the matrix • Trace back from the score until a 0 value is reached

Smith-Waterman Algorithm • Aligning S1 and S2 of length n and m using Recurrences: • Calculate three possible ways to extend the alignment • by one AminoAcid (AA) in each sequence • by one AA in the first sequence and align it with a gap in the second • by one AA in the second sequence and align it with a gap in the first

 A T C T C G T A T G A T G  0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 T 0 0 2 1 2 1 1 4 3 2 1 1 3 2 C 0 0 1 4 3 4 3 3 3 2 1 0 2 2 T 0 0 2 3 6 5 4 5 4 5 4 3 2 1 A 0 2 2 2 5 5 4 4 7 6 5 6 5 4 T 0 1 4 3 4 4 4 6 5 9 8 7 8 7 C 0 0 3 6 5 6 5 5 5 8 8 7 7 7 A 0 2 2 5 5 5 5 4 7 7 7 10 9 8 C 0 1 1 4 4 7 6 5 6 6 6 9 9 8 A T C T C G T A T G A T G G T C T A T C A C Smith-Waterman Algorithm Align S1=ATCTCGTATGATGS2=GTCTATCAC 0 0 0 0 0 0 2 1 0 0 2 1 0 2 2 =1, =1 4 3 5 7 9 8 10

Systola 1024: PC add-on board with 1024 processors (ISATEC, Germany) • Fuzion 150: 1536 processors on a single chip (Clearspeed Technology, UK) Parallel Architectures for Bioinformatics • Embedded Massively Parallel Accelerators

Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 High speed Myrinet switch Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Systola1024 Parallel Architectures for Bioinformatics • Supercomputer performance at low cost • combines SIMD and MIMD paradigm within a parallel architecture Hybrid Computer

Previous Applications • Scientific Computing • Volume Visualization • Automatic Visual Quality Control • Cryptography • Computer Tomography • Video Compression • Range of Transforms (Fourier, Wavelet, Hough, Radon) • Computer Graphics

RAM NORTH RAM WEST Controller program memory host computer bus ISA Interface processors Architecture of Systola 1024 • Instruction Systolic Array: • 32  32 mesh of processing elements • wavefront instruction execution

- + - - * - + - - - * * * * + - + * + - + + * * - + + * * + + * - + + * * + + + + * * column selectors + instructions - * + * - + - - + * row selectors Instruction Systolic Array • wavefront instruction execution  fast accumulation operations (e.g. row sum, broadcast, ringshift)

A P1 P2 P13  A T C T C G T A T G A T G  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 2 1 0 2 G 0 0 0 0 0 2 T 0 0 0 2 2 1 1 1 2 2 1 1 4 3 2 1 1 3 2 3 2 B C 0 0 0 1 1 3 4 4 3 4 3 3 3 2 1 0 2 2 4 T 0 0 0 3 2 2 3 6 5 4 5 4 5 4 3 2 1 A 0 2 2 2 2 2 5 5 4 4 7 6 5 6 5 4 T 0 1 4 3 4 4 4 6 5 9 8 7 8 7 C 0 0 3 6 5 6 5 5 5 8 8 7 7 7 1 A 0 2 2 5 5 5 5 4 7 7 7 10 9 8 C 0 1 1 4 4 7 6 5 6 6 6 9 9 8 Parallelization of Smith-Waterman • matrix cells along a single diagonal are computed in parallel • comparison is performed in A+B1 steps on A PEs 0 5

a1023 a1022 a992 a63 a62 a32 bk….b1b0 a31 a30 a0 Mapping onto Systola 1024 a: query sequence (equal to 1024) • Subject sequences can be pipelined with only 1 step delay  k steps for subject sequence of length k b: subject sequence …c1c0 X • Efficient routing on the ISA: Row Ringshift and Broadcast

Query sequence length 256 512 1024 2048 4096 Systola 1024 speedup to PIII 850 294 5 577 6 1137 6 2241 6 4611 6 Cluster of 16 Systolas speedup to PIII 850 20 81 38 86 73 91 142 94 290 94 Performance Evaluation • Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths • Parallel implementation scales linearly with sequence length and number of PCs • Computing time dominates data transfer time need a state-of-the-art architecture

Fuzion 150 Architecture Linear SIMD Array 1536PEs each with 2 Kbytes DRAM SIMD Controller Instruction Fetch • 0.25-m, single-chip, SIMD architecture • 1536 PEs @ 200 MHz  300 GOPS • 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth • Multithreading (control units interact via semaphores) • developed by Clearspeed Technology (UK) for graphics, networking processing Local Memory Host AGP Rambus FUZION Bus 1,2 or 4 Channels (6.4 GB/s) 32-bit EPU (ARC) Video I/O Display

Instructions ALU (8 bits) Register file 32 Bytes Left PE Right PE PE Memory 2 KByte DRAM Block I/O Channel Local Memory Fuzion 150 Architecture Block 5 Fuzion Bus PE (5,0) PE (5,1) PE (5,255) Block 1 PE (1,0) PE (1,1) PE (1,255) Block 0 PE (0,0) PE (0,1) PE (0,255)

a1535 a1534 a1280 a511 a510 a256 a0 a1 a255 Mapping onto the Fuzion 150 Block 5 a: query sequence (equal to 1536) Block 1 b: subject sequence Block 0 bk….b1b0 …c1c0 X • No fast global communication  2-step local communication • Subject sequence can be pipelined with only step delay

Mapping onto the Fuzion 150 • Reduce communication time • Assign 16 AAs to each PE  query lengths up to 24576 AAs can be processed within a single pass • Partitioning for query lengths <24576: • each subarray of corresponding size computes the alignment of the same query sequence with different subject sequences

Query sequence length 256 512 1024 2048 4096 Fuzion 150 speedup to PIII 850 12 136 22 151 42 157 82 163 162 165 Performance Evaluation • Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths • Parallel implementation scales linearly with sequence length • Computing time dominates data transfer time

Performance Evaluation • Normalized time Comparison for a 10 Mbase search on different parallel architectures with different query length • 4faster than 16K-PE MasPar • 6faster than Kestrel • 5faster than SAMBA (special-purpose 3-board architecture)

Cluster of Systola 1024 speedup to PIII 850 17 min 79 Fuzion 150 speedup to PIII 850 11 min 133 Performance Evaluation for Full Genome Comparison • Scan times for pairwise protein sequence comparison of Mycobacterium Tuberculosis and Escherichia Coli • Comparison has to be performed for several parameters (Substitution matrices, gap penalties) • Mycobacterium Smegmatis will be published later this year • Results of the comparison will be interpreted with the Centre for Molecular Cell Biology, NUS, Singapore

Conclusions and Future Work • Demonstrated how fine-grained parallel architectures can be applied efficiently for Comparative Genomics • Significant runtime savings for genome comparisons and database searching  More Discovery Is Possible at a good price-performance ratio • Other Computational Biology applications of interest to us: • ClustalW • HMM • pattern matching algorithms, such as inverted repeats, short tandem repeats, etc • Availability of accelerators as a special-resource in a Grid Environment

Contents • Protein Structure • Protein Structure Prediction • Approach based on Local Protein Structure • Refinements • Conclusions and Future Work

R side chain amine group carboxyl group alpha carbon(with attached hydrogen) Protein Structure • Proteins are large molecules composed of smaller molecules called amino acids • There are 20 kinds of amino acids found in natural proteins • All share a common structure

Protein Structure

From Primary to Tertiary Structure • A protein’s 3D shape is determined by its primary amino acid sequence (Anfinsen, 1963) • Predicting tertiary structure from amino acid sequence is an unsolved problem • Difficult to model the energies that stabilize a protein molecule • Conformational search space is enormous

Target amino acid sequence YLAADTYK Template Fold library FISSETCN MEPSSYV TGLIRKN Template amino acid sequence 7 21 2 Target/template Score: Prediction Methods • Given an amino acid sequence: • search a set of known folds by aligning sequence and a template fold representative • predict the fold that gets the best scoring alignment

Prediction Methods • This method is very effective when target and template have >30% sequence identity • Approximately 1/3 of protein sequences can be assigned folds and modeled this way • Our aim is to contribute to determine tertiary structures in case matching sequences cannot be found

Local structure and prediction • What is Local structure ? • describes environment of an amino acid • an amino acid’s relationship to neighbors • we use this information to predict structure from primary sequence

Dihedral Angles • The 6 atoms in each peptide unit lie in the same plane •  and  free to rotate • The structure of a protein is almost totally determined, if all angles  and  are known

Idea of our Approach N Back bone • Stiff  free • local predictability  database of sub-chain structures • reduction of the number of degrees of freedom by 10, reduces the computation time significantly in combination with a global optimization algorithm (e.g. GA or SA) C C  and  Side chains

multiple Selected PDB structures Histogram for each amino acids pair Dihedral angle extraction stiff flexible Classification of Dihedral Angles

Frequency Frequency ALA-ALA GLY-ILE Frequency LEU-ARG Classification of Dihedral Angles multiple Stiff flexible

multiple Selected PDB structures Histogram for each amino acids pair Dihedral angle extraction stiff flexible Classification of Dihedral Angles • Stiff angles: determine mean value • Multiple angles: determine sequence of mean values, one for each peak in decreasing order of these peaks • Flexible angles: determine mean value and mark as flexible

Prediction based on Classification • Given a sequence of amino acids, find the subsequence in which all angles are of type stiff • predict structure of these subsequences, using the mean values of the corresponding histograms

Prediction based on Classification • Part of a protein predicted with this method (backbone of a helix, original structure on the left, predicted structure on the right) • Successfully predicted certain stiff structures of subsequences up to the length of 15

Refinement of the method • For multiple angles: • consider sequences of length 3 or 4: • extract sequences (C,A,B,D) and determine the histogram of angles  and related to the peptide chain between A and B • if histogram for  for amino acids (A,B) is multiple, check if angle for (A,B,C,D) is stiff • with longer subsequences the occurrences of these sequences drops dramatically

Refinement of the method • For multiple angles: • if an amino acid sequence has only a small number of multiple edges, it is possible to try all combinations of possible peaks • many combinations lead to collisions in part of the protein, and thus can be eliminated

Conclusion and Future Work • Presented a method to predict stiff structures of subsequences up to the certain length • Presented a refinement of the method to handle multiple angles • how to handle flexible angles ? • Using the local prediction as an input for a global optimization method, e.g. based on Simulated Annealing

Massively Parallel Solutions for Molecular Sequence Analysis

Massively Parallel Solutions for Molecular Sequence Analysis

Presentation Transcript

Massively Parallel Processors

Error model for massively parallel (454) DNA sequencing

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems

Programming Massively Parallel Graphics Processors

Approximate History Map for Massively Parallel Environments

Massively Parallel/Distributed Data Storage Systems

Emulating Massively Parallel (Peta FLOPS ) Machines

A Massively Parallel Architecture for Bioinformatics

Sequence detection for parallel ACK

Parallel Computation in Biological Sequence Analysis

Sequence design for parallel ACK

Massively Parallel Multgrid for Finite Elements

Comparative Sequence Analysis in Molecular Biology

Massively Parallel Solutions for Molecular Sequence Analysis

Massively Parallel Signature Sequencing (MPSS)

Sequence design for parallel ACK

PAVEMENT/PIO Parallel I/O System for Massively Parallel Processors

CM-5 Massively Parallel Supercomputer

Module 5 Algorithmic Methods For Molecular Sequence Analysis

Parallel computational methods for sequence analysis

Massively Parallel Computing for Protein Alignment

A Fault Tolerant Protocol for Massively Parallel Machines