Carlo Pascoe (speaker), David Box, Herman Lam, Alan George

FPGA-Accelerated Isotope Pattern Calculator for Use in Simulated Mass Spectrometry Peptide and Protein Chemistry Carlo Pascoe (speaker), David Box, Herman Lam, Alan George NSF Center for High-Performance Reconfigurable Computing (CHREC) Dept. of Electrical and Computer Engineering, University of Florida Gainesville FL, USA Email: {pascoe, box, hlam, george}@chrec.org

Motivation • Protein Identification Algorithms (PIAs) • Heavily utilized in pharmaceutical research and cancer diagnostics • Current industry standard methods unreliable (at best!)[1,2] • Highly accurate algorithms with potential to revolutionize accuracy exist, however not/under utilized due to extreme computational intensity and prohibitive execution times • Must accelerate for feasible use Objective: Develop sustainable solution for increasing the speed, and thus achievable accuracy, of many PIAs Approach: • Accelerate Isotope Pattern Calculator (IPC), a dominant subroutine common in de novo PIAs • Provide customizable design for general use • Capitalize on reconfigurable computing at scale to achieve sustainable supercomputing performance

Presentation Outline • Background • Protein Identification • De Novo PIAs • Theoretical Mass Spectrum Generation • IPC Problem Description • Elemental Isotope SADs • Stage 1: SED Calculation • Stage 2: SED Combination • Additional IPC Functionality • A Configurable & Scalable IPC Hardware Architecture • SED Calculation Reduced to LUTs • SED Iterative Combination in Hardware • Performance Evaluation on Novo-G • Single-FPGA Performance • Multi-FPGA Performance • Summary & Conclusions • Future Work • Q&A SAD: Single-Atom Distribution SED: Single-Element Distribution,

Protein Identification This… • Protein: biochemical molecule consisting of one or more polypeptides • Macromolecular chains of linked amino acids • Current protein ID approach • Methodically fragment protein sample • Analyze with mass spectrometer • Employ PIAs to generate string representing amino acid primary structure • Algorithms classified as database or de novo To this... To this… RPPGFSPFRpeptide amino acid sequence

De Novo PIAs • General de novo approach • Make educated guess for amino acid string • Generate theoretical mass spectra and compare to experimental spectrum • Iteratively refine guess until theoretical and experimental spectra match • Theoretical need to consider all linear combinations of amino acids • Number of candidates grows exponentially with final sequence length • Employ diverse heuristic pruning methods to limit protein search space • Necessity for practical use on conventional computing systems • Often leads to false identifications (e.g., N and GG can have same mass) By accelerating key computation common in many de novo algorithms, algorithm developers can employ less restrictive pruning criteria, potentially allowing a greater degree of accuracy in less time

Theoretical Mass Spectrum Generation • Majority of execution time for many highly accurate de novo algorithms • Calculation comprises: • Decomposition of candidate sequence string into many amino acid substrings • Generation of probable mass contributions for each predicted substring • Histogram-like combination of probable masses to form theoretical distribution directly comparable to experimental mass spectra • Complicated by fact that, in nature, elements occur as mixture of isotopes • Neutron quantity differences suggest distribution of possible molecule masses • Use IPCsubroutine to predict possible masses • Enumerates all possible combinations of constituent element isotopes • produce list of mass/probability pairs Although a relatively simple calculation for the smallest of molecules, IPC executions for medium- to large- sized molecules quickly become a computational bottleneck of many chemistry applications, most notably de novo protein identification.

IPC Problem Description Given a chemical formula and a database of element isotope SADs, produce a list of mass/probability pairs representing the distribution of possible molecular masses • Analogous to evaluating • representsithisotope of jthunique element in chemical formula containingNjatoms of jthelement type • Problem reducible to two-stage process • Compute each single element distribution (SED) • Combine SEDs to form final distribution SAD: Single-Atom Distribution SED: Single-Element Distribution,

Elemental Isotope SADs

Stage 1: SED Calculation • Consider SEDs of Hydrogen from SAD H1: → → M= 1.007825, p= 9.99885e-01, M= 2.014102, p= 1.15e-04 M= 2.01565, p= 9.99770e-01, M= 3.02193, p= 2.2997e-04, M= 4.02820, p= 1.3225e-06 H2: → → HN: → • Algorithm 1. Calculate HN SED • p0 ← 0.999885, m0 ← 1.007825 • p1 ← 0.000115, m1 ← 2.014102 • For ← 0 to N • ← N – • p ← • m ← • Print (m, p) • End Loop → A really long list with many low probability peaks! Impose Threshold Probability

Stage 1: SED Calculation • Can modify Algorithm 1. to handle any element with two stable isotopes (e.g., Helium, Carbon, Nitrogen, etc.) • If an element has more than two stable isotopes? • Consider SEDs of Sulfur • Algorithm 2. Calculate SN SED • p0 ← 0.9493, m0 ← 31.972079 • p1 ← 0.0076, m1 ← 32.971459 • p2 ← 0.0429, m2 ← 33.967867 • p3 ← 0.0002, m3 ← 35.967081 • For ← 0 to N • For ← 0 to N – • For ← 0 to N – (+ ) • ← N – (+ + ) • p ← • m ← • Print (m, p) • End Loop • End Loop • End Loop SN: → A REALLY, REALLY long list! Computation Significantly Increases as the Number of Stable Isotopes Increases

Stage 2: SED Combination • With Stage 1 complete, analogous to evaluating • representsithpeak from SED generated for jthunique element • Removal of exponent allows for straightforward combination Simple Example) H2O: → M= 2.01565, p= 9.9977e-01, M= 3.02193, p= 2.2997e-04, M= 4.02820, p= 1.3225e-06 M= 2.01565 + 15.9949 = 18.0106, p= 9.9977e-01 * 9.9757e-01 = 9.9734e-01, M= 2.01565 +16.9991 = 19.0148, p= 9.9977e-01 * 3.8e-04 = 3.7991e-04, M= 2.01565 +17.9992 = 20.0149, p= 9.9977e-01 * 2.05e-03 = 2.0495e-03, M= 3.02193 + 15.9949 = 19.0168, p= 2.2997e-04 * 9.9757e-01 = 2.2941e-04, M= 3.02193 +16.9991 = 20.0210, p= 2.2997e-04 * 3.8e-04 = 8.7389e-08, M= 3.02193 + 17.9992 = 21.0211, p= 2.2997e-04 * 2.05e-03 = 4.7144e-07, M= 4.02820 + 15.9949 = 20.0231, p= 1.3225e-06 * 9.9757e-01 = 1.3193e-06, M= 4.02820 +16.9991 = 21.0273, p= 1.3225e-06 * 3.8e-04 = 5.0255e-10, M= 4.02820 +17.9992 = 22.0274, p= 1.3225e-06 * 2.05e-03 = 2.7111e-09 M= 15.9949, p= 9.9757e-01, M= 16.9991, p= 3.8e-04, M= 17.9992, p= 2.05e-03

Additional IPC Functionality Probability Threshold: Filter prob < PT(e.g, PT = 1.0e-05) M= 18.0106, p= 9.9734e-01, M= 19.0148, p= 3.7991e-04, M= 20.0149, p= 2.0495e-03, M= 19.0168, p= 2.2941e-04 Simple Example ContinuedH2O: M= 18.0106, p= 9.9734e-01, M= 19.0148, p= 3.7991e-04, M= 19.0168, p= 2.2941e-04, M= 20.0149, p= 2.0495e-03, M= 20.0210, p= 8.7389e-08, M= 20.0231, p= 1.3193e-06, M= 21.0211, p= 4.7144e-07, M= 21.0273, p= 5.0255e-10, M= 22.0274, p= 2.7111e-09 Sort by Mass M= 18.0106, p= 9.9734e-01, M= 19.0148, p= 3.7991e-04, M= 20.0149, p= 2.0495e-03, M= 19.0168, p= 2.2941e-04, M= 20.0210, p= 8.7389e-08, M= 21.0211, p= 4.7144e-07, M= 20.0231, p= 1.3193e-06, M= 21.0273, p= 5.0255e-10, M= 22.0274, p= 2.7111e-09 M= 18.0106, p= 9.9734e-01, M= 20.0149, p= 2.0495e-03, M= 19.0148, p= 3.7991e-04, M= 19.0168, p= 2.2941e-04, M= 20.0231, p= 1.3193e-06, M= 21.0211, p= 4.7144e-07, M= 20.0210, p= 8.7389e-08, M= 22.0274, p= 2.7111e-09, M= 21.0273, p= 5.0255e-10 Sort by Probability Window Filter: Filter any peaks after the Nth(e.g, N = 6) Mass Peak Centroiding: Essentially moving average filter over close peaks, weighted by probability M= 18.0106, p= 9.9734e-01, M= 19.0148, p= 3.7991e-04, M= 19.0168, p= 2.2941e-04, M= 20.0149, p= 2.0495e-03, M= 20.0210, p= 8.7389e-08, M= 20.0231, p= 1.3193e-06 M= 18.0106, p= 9.9734e-01, M= 20.0149, p= 2.0495e-03, M= 19.0148, p= 3.7991e-04, M= 19.0168, p= 2.2941e-04, M= 20.0231, p= 1.3193e-06, M= 21.0211, p= 4.7144e-07 M= 18.0106, p= 9.9734e-01, M= 19.0156, p= 6.0932e-04, M= 20.0149, p= 2.0509e-03

A Configurable & Scalable IPC Hardware Architecture Adapt two-stage procedure to a configurable & scalable hardware architecture capable of converting a stream of independent chemical formula queries into a delimited stream of variable-quantity mass/probability pairs Single Module Handles Stage 1 Functionality Multiple Modules Handle Stage 2 Computation No. of Modules Independent from Input Stream Data and Host Stream Consists of Chemical Formula Query Information and Control Data Result Distributions Returned in Same Order as Received in Input Stream

SED calculation reduced to LUTs Precompute SEDs Exactly,Pull SEDs from LUTS at Runtime vs. SADs vs. FCFDs Single Bank of LUTs Feed All Distribution Calculators Sample LUT Address Space for SEDs SEDs Presorted by Probability,Filtered at Runtime with Configurable Threshold Prob. 0 H1 − H256 C1 − C256 In-Stream Control[3] N1 − N256 O1 − O256 S1 − S64 Token-Based Round Robin Scheduler 60 Other Elements with 16 SEDs per Element 2047 SAD: Single-Atom Distribution SED: Single-Element Distribution FCFD: Full Chemical Formula Distribution Equation from Slide 8:

SED Iterative Combination in Hardware Single-cycle SED combination architecture required for worst-case excessively wasteful when processing common-case, employ iterative combination to boost hardware utilization X: No. of Parallel Multipliers and Adders Y: Buffer Depth • Algorithm 3. Distribution Calculator Procedure • WhileControl ≠ “done” • SED[1…N]←FIFO[1…N].pop(), Control←FIFO[N+1].pop() • If Control = “begin” • PrevItBuff[1...N]←SED[1…N], PrevItBuff[N+1...Y]←(-1,0) • CurrItBuff[1...Y] ←(-1,0) • If Control = “middle” or Control = “end” • While tmp←PrevItBuff[1..Y].shift() ≠ (-1,0) • i ← 1 • While i ≤ N and SED[i].prob > 0 • MultAdd[1...X].mass←SED[i...i+X−1].mass + tmp.mass • MultAdd[1...X].prob←SED[i...i+X−1].prob ∗ tmp.prob • PSort[1…X]←Sort(Filter(MultAdd[1...X], TP)) • CurrItBuff[1…Y]←InSort(CurrItBuff[1…Y], PSort[1..X]) • i ← i+ X • End Loop • End Loop • PrevItBuff[1...Y]←CurrItBuff[1…Y] • CurrItBuff[1...Y] ←(-1,0) • If Control = “end” • FinalResBuff[1...Y]←PrevItBuff[1...Y] • End Loop Result Reporting Circuitry Operates Independently of Distribution Calculation Insert CentroidingHere if so Desired

Performance Evaluation on Novo-G Novo-G Annual Growth 2009: 96 top-end Stratix-III FPGAs, each with 4.25GB SDRAM 2010: 96 more Stratix-III FPGAs, each with 4.25GB SDRAM 2011: 96 top-end Stratix-IV FPGAs, each with 8.50GB SDRAM 2012: 96 more Stratix-IV FPGAs, each with 8.50GB SDRAM • Previously discussed hardware architecture implemented in VHDL and tested on Novo-G[4,5] • Initial experiments on single Altera Stratix IV E530 FPGA in GiDELPROCStar IV board along with an Intel Xeon E5620 CPU for host support • Single-device implementation scaled up to a single Novo-G “ps4” compute node • i.e., up to 16 E530s in 4 PROCStar IVs • Implications of scaling to multiple compute nodes of Novo-G discussed • Software baseline: highly optimized, serial C++ code mirroring hardware algorithm • Executed on single E5620 core • Orders of magnitude faster than code at [6] • Hardware and software results compared to confirm hardware correctness

Single-FPGA Performance Performance Trends for Various IPC Parameter Configurations Configurations Bandwidth Limited Computation-bound problem in software becomes I/O-bound in FPGAs Reducing Calculation Word Width Reduced Logic Usage & Increased Operating Frequency vs. Reduced Result Precision Increasing Parallel Computations per DC Increased Operations per Clock Cycle vs. Increased Logic Usage, Reduced Routability, & Operating Frequency Reducing Distribution Window Width Reduced Logic Usage vs. Reduced Result Exactness Suitable “sweet spot,” achieving remarkable speedup while ensuring results remain scientifically relevant

Multi-FPGA Performance Performance Trends of “sweet spot” for Various Novo-G Node Configurations Increasing PROCStar IVs per Node Available system bandwidth far exceeds board link bandwidth bottleneck observed with single-board Increasing FPGAs per PROCStar IV Scalability now limited by CPU resources Scalability limited by I/O-bandwidth Novo-G “ps4” nodes have 8 physical cores (16 logical with hyper-threading) vs. max 32 threads for row 9 PROCStar IV only supports 8 lane, Gen 1 PCIe Expect increased scalability with system config. employing more lanes and/or more recent Gen 3 PCIe standard Expect increased scalability with system configemploying more physical cores Multi-Node Scaling Expectations Assuming input queries are pre-partitioned, no required communication between compute nodes Overhead limited to initialization & completion synchronization so expect performance to scale almost linearly with additional nodes We plan to verify these expectations by scaling to multiple compute nodes in Novo-G as future work Multiple FPGA Advantage?

Summary & Conclusions • Presented first FPGA-based Isotope Pattern Calculator • Computationally intense subroutine common in de novo PIAs • Provides 23 customization parameters for general use • Discussed parameter tradeoffs & experimentally demonstrate effect on performance Between 72 and 566speedup†on a single FPGA Wide range of achieved single-node performance due to embarrassingly parallel scalability restricted by real-world system limitations such as insufficient I/O bandwidth and CPU resources Up to 1259speedup †on a single board (4 FPGAs) Up to 3340speedup † on a single node (16 FPGAs) Can enable use of previously dismissed protein identification algorithms with potentially revolutionary accuracy yet obscene execution time on conventional computing platforms Still much to be done before this is a reality for protein Identification † with respect to a highly optimized, serial C++ IPC implementation

Future Work • Continue scaling design to multiple nodes of Novo-G • Integrate FPGA accelerated IPC into full de novo PIA • First integrate with full theoretical spectrum generator • Move more of algorithm onto FPGA to lessen bandwidth bottleneck issue • Explore the possibility of a GPU accelerated IPC • GPU amenable given minor modifications to the algorithm as stated • Preliminary design already mapped out, ready for implementation & testing • Implement non-sorted output option • Sorting fundamentally integral to current DC design • Non-sorting DC would allow greater parallelization while utilizing less resources • If sorted distribution not required by targeted PIA, expect much greater performance

Thank You For Listing! Any Questions

References [1] A. W. Bell et al., “A HUPO test sample study reveals common problems in mass spectrometry-based proteomics,” Nat. Methods, vol 6, pp. 423-430, 2009. [2] E. A. Kappet al., "An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis," Proteomics, vol 5, pp. 3475–3490, 2005. [3] C. Pascoe et al., “Reconfigurable supercomputing with scalable systolic arrays and in-stream control for wavefront genomics processing,” Proc. of Symposium on Application Accelerators in High-Performance Computing (SAAHPC), TN, 2010. [4] A. George, H. Lam, A. Lawande, C. Pascoe, and G. Stitt, “Novo-G: A View at the HPC Crossroads for Scientific Computing,” Proc. of the Int. Conf. on Eng. of Reconf. Sys. and Algs. (ERSA), NV, 2010. [5] A. George, H. Lam, and G. Stitt, “Novo-G: At the Forefront of Scalable Reconfigurable Computing,” IEEE Computing in Sci. & Eng. (CiSE), Vol. 13, No. 1, Jan/Feb. 2011, pp. 82-86. [6] Dirk (2005), Isotopic Pattern Calculator, http://isotopatcalc.sourceforge.net/index.php, File: gips-0.7.tar.gz.

Carlo Pascoe (speaker), David Box, Herman Lam, Alan George

Carlo Pascoe (speaker), David Box, Herman Lam, Alan George

Presentation Transcript

Dr. Samuel Lam

Business Research Methods School of Journalism, UBC

Experimental Analysis of Multi-FPGA Architectures over RapidIO for Space-Based Radar Processing

The Romantic Period

Composite Filament Winding Machine: P09226

The Presidential Election of 2000

The Great Bambino

EdUC 3004 – Curriculum Design Final Project

FTCA: On-Board Processing Design Optimization Framework

EdUC 3004 – Curriculum Design Final Project

George Box, 1992

EEL 5934: The RC Modeling Language (RCML)

Evaluating Partial Reconfiguration for Embedded FPGA Applications

F5-11: Device Performance Metrics and Mission-Critical Processing

Zend Framework in 2009

Lam Enhanced Gas Box

Novo-G : Adaptively Custom Reconfigurable Supercomputer

Lessons Learned with Performance Prediction and Design Patterns on Molecular Dynamics

Addressing Climate Change : How will we Adapt? Alan C. Clark

History of Computers

David Bueno, Adam Leko, Chris Conger, Ian Troxel, and Alan D. George HCS Research Laboratory

Red Box Store - Speaker