An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry

An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A.

Collaborators of This Project • University of Southern California • Ting Chen • Harvard Medical School • George M. Church • John Rush • Matthew Tepel

Genome (DNA) Transcriptome (RNA) Proteome (Protein) Perspectives A key goal of bioinformatics: To study biological systems based on global knowledge of genomes, transcriptomes, and proteomes. • Genome: entire sets of materials in the chromosomes. • Transcriptome: entire sets of gene transcripts. • Proteome: entire sets of proteins.

Genome (DNA) Transcriptome (RNA) Proteome (Protein) Perspectives A key goal of bioinformatics: To study biological systems based on global knowledge of genomes, transcriptomes, and proteomes. • Genome: entire sets of materials in the chromosomes. • Transcriptome: entire sets of gene transcripts. • Proteome: entire sets of proteins. this talk’s focus

Proteomics • Proteome: all proteins encoded within a genome • half millions distinct proteins (temporal, spatial, modifications) • ~30,000 human genes • mRNA and protein expressions may not correlate • Proteomics:study of protein expression by biological systems • relative abundance and stability; post-translational modifications • fluctuations as a response to environment and altered cellular needs • correlations between protein expression and disease state • protein-protein interactions, protein complexes • Technologies: • 2D gel electrophoresis • mass spectrometry • yeast two-hybrid system • protein chips this talk’s focus

A Key Step of Proteomics • How to sequence proteins? • How to sequence protein peptides? (this talk’s focus)

Outline of This Talk • Problem Formulation (Biology) • Problem Formulation (Computer Science) • Basic Computational Techniques • Improved Computational Complexity and More Robust Algorithms • Conclusions

Outline of This Talk (1) • Problem Formulation (Biology) • Problem Formulation (Computer Science) • Basic Computational Techniques • Improved Computational Complexity and More Robust Algorithms • Conclusions

Protein Identification: HPLC-MS-MS Peptides Proteins B-ions / Y-ions One Peptide Mass/Charge Mass/Charge Tandem Mass Spectrum

Peptide Fragmentation and Ionization B-ion Y-ion Complementary: Mass(B-ion)+Mass(Y-ion) = Mass(peptide)+4H+O

B-ions and Y-ions Fragmentation

Tandem Mass Spectrum 100 175.113 361.121 448.225 Abundance (100%) 88.033 274.112 430.213 50 200 400 Mass / Charge

Raw Tandem Mass Spectrum

Prediction from Raw Tandem Mass Spectrum

Protein Database Search Find the peptide sequences in a protein database that optimally fit the spectrum. • It does not work if the target peptide sequence is not in the database. • It does not work if there is an unknown modification at some amino acid. • It is very slow because it must search the entire database. • E.g., SEQUEST, Yates,Univ. of Washington.

De Novo Peptide Sequencing Problem • Input: (1) the mass W of an unknown target peptide, and (2) a set S of the masses of some or all b-ions and y-ions of the peptide. • Output: a peptide P such that • (1) mass(P)=W and • (2) S is a subset of all the ion masses of P. Peptide Mass 429.212 Daltons 100 P = SWR, Mass(P) = 429.212, Ions(P) = {88.033, 175.113, 274.112, 361.121, 430.213, 448.225} 361.121 274.112 Abundance (100%) 50 Mass / Charge

Tandem Mass Spectrum 100 175.113 361.121 448.225 Abundance (100%) 88.033 274.112 430.213 50 200 400 Mass / Charge Peptide Mass 429.212 Daltons

Amino Acid Mass Table

Feature 1 All B-ions form a forward mass ladder. 100 175.113 361.121 448.225 b1 b2 b3 Abundance (100%) 88.033 274.112 430.213 50 W S R 1 200 400 Mass / Charge Peptide Mass 429.212 Daltons

Feature 2 All Y-ions form a reverse mass ladder. 100 y1 y2 y3 175.113 361.121 448.225 R S W Abundance (100%) 88.033 274.112 430.213 50 W S R 200 400 19 Mass / Charge Peptide Mass 429.212 Daltons

Basic Difficulty #1 100 It is unknown whether an ion is a B-ion or an Y-ion. 175.113 361.121 448.225 Abundance (100%) 88.033 274.112 430.213 50 200 400 Mass / Charge Peptide Mass 429.212 Daltons

Basic Difficulty #2 There are missing ions. 100 361.121 Abundance (100%) 274.112 50 Ion 2 Ion 1 200 400 Mass / Charge Peptide Mass 429.212 Daltons

Feature 3 (to our Rescue) Complementary Ion Pairs: b1/y2 and b2/y1 100 y1 y2 y3 175.113 361.121 448.225 R S W Abundance (100%) b1 b2 b3 88.033 274.112 430.213 50 W S R 200 400 Mass / Charge Peptide Mass 429.212 Daltons

Formulating the Computational Problem • T = an alphabet of 20 characters a1,a2,…,a20. • two special characters: alpha and beta. • the mass of alpha = 1, the mass of beta = 19, the mass of ai is mi. • A peptide sequence is x1,x2,x3,…,xn-1,xn,where each xi is from T. • A b-ion is x0,x1,x2,…,xi for some 1 <= i <= n, where x0 = alpha. • A y-ion is xi,…,xn-2,xn-1,xn,xn+1 for some 1 <= i <= n, where xn+1 = beta.

De Novo Peptide Sequencing Problem • Input: (1) the mass W of an unknown target peptide, and (2) a set S of the masses of some or all b-ions and y-ions of the peptide. • Output: a peptide P such that • (1) mass(P)=Wand • (2)Sis a subset of all the ion masses of P. Peptide Mass 429.212 Daltons 100 P = SWR, Mass(P) = 429.212, Ions(P) = {88.033, 175.113, 274.112, 361.121, 430.213, 448.225} 361.121 274.112 Abundance (100%) 50 Mass / Charge

Amino Acid Mass Table

Basic Computing Scheme peptide mass W tandem mass spectrum S NC-spectrum graph Find feasible paths to order the masses in S to identify all the b-ions and y-ions consistent with S. Convert feasible paths into legal peptide sequences

NC-Spectrum Graph: Nodes (1) N0 C0 429.22 0 mass of this peptide

NC-Spectrum Graph: Nodes (2) Assumption 2: If Ion 1 is a b-ion N1: a b-ion node Assumption 1: If Ion 1 is an y-ion C1: a b-ion node Ion # 1 (274.11) N0 C1 N1 C0 174.11 273.11 0 429.22 mass of this peptide mass( ) + mass( ) = mass(P) + 18

NC-Spectrum Graph: Nodes (3) Ion # 2 (88.10) N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 mass( ) + mass( ) = mass(P) + 18

NC-Spectrum Graph: Edges (1) Mass(S) = 87.08. S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22

NC-Spectrum Graph: Edges (2) Mass(W) = 186.21 Mass(S) = 87.08. W S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22

NC-Spectrum Graph: Edges (3) Mass(W) = 186.21 Mass(S) = 87.08. W S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 S+W Mass(S+W) = 273.29

NC-Spectrum Graph: Edges (4) Mass(W) = 186.21 Mass(R) = 156.19 Mass(S) = 87.08. W R S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 S+W Mass(S+W) = 273.29

NC-Spectrum Graph N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22

NC-Spectrum Graph: Paths = Sequences W R S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 b-ions

NC-Spectrum Graph: A Feasible Path (1) b-ions a feasible path W R S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).

NC-Spectrum Graph: A Feasible Path (2) y-ions b-ions a feasible path S S N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 GVV Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).

NC-Spectrum Graph: Not A Feasible Path (1) • not a feasible path: • miss ion #2 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).

NC-Spectrum Graph: Not A Feasible Path (2) not a feasible path: (2) repeat ion #1 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).

NC-Spectrum Graph: Not A Feasible Path (3) • not a feasible path: • miss ion #2 • repeat ion #1 N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Definition: A feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).

Reformulating the De Novo Peptide Sequencing Problem Input: an NC-spectrum graph G. Output: a feasible path from N0 to C0.

Observations • A longest path does not always go through exactly one of each pair of nodes. • It is an NP-hard problem if the spectrum graph is a general directed graph.

Basic Algorithm • Input: a peptide mass W and a tandem mass spectrum S. • Output: a feasible peptide sequence. • Steps: • Compute the nodes of the NC-spectrum graph G. • Compute the edges of G. • Compute a feasible path P in G. • Convert P into a feasible sequence.

Basic Algorithm (1) • Input: a peptide mass W and a tandem mass spectrum S. • Output: a feasible peptide sequence. • Steps: • Compute the nodes of the NC-spectrum graph G. • Compute the edges of G. • Compute a feasible path P in G. • Convert P into a feasible sequence.

Step 1. Compute the nodes and place them in the increasing order of masses. Compute the Nodes of the NC-Spectrum Graph N0 N2 C1 N1 C2 C0 87.10 174.11 273.11 360.12 0 429.22 Step 2. Rename the nodes from left to right as X0,…, Xk,Yk,…,Y0 X0 X1 X2 Y2 Y1 Y0 87.10 174.11 273.11 360.12 0 429.22 Observation: Xi and Yi form a complementary pair of nodes Ni and Ci for ion i. Running Time: O(k), where k = # of masses in the spectrum.

Basic Algorithm (2) • Input: a peptide mass W and a tandem mass spectrum S. • Output: a feasible peptide sequence. • Steps: • Compute the nodes of the NC-spectrum graph G. • Compute the edges of G. inverse of each other • Compute a feasible path P in G. • Convert P into a feasible sequence.

An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry

An Algorithmic Approach to Peptide Sequencing via Tandem Mass Spectrometry

Presentation Transcript

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Protein Sequencing and Identification by Mass Spectrometry

Mass Spectrometry

De Novo Peptide Sequencing via Probabilistic Network Modeling

Tandem Mass Spectrometry

Protein Sequencing and Identification by Mass Spectrometry

Mass Spectrometry

Protein Identification and Peptide Sequencing by Liquid Chromatography – Mass Spectrometry

Algorithms for Peptide Mass Spectrometry

Efficient and accurate algorithms for peptide mass spectrometry

PEAKS: De Novo Sequencing using Tandem Mass Spectrometry

Peptide Sequencing by Mass Spectrometry

Mass Spectrometry

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Algorithmic Problems in Peptide Sequencing

Peptide Identification via Tandem Mass Spectrometry Sorin Istrail

Protein Identification Using Tandem Mass Spectrometry

Protein sequencing and Mass Spectrometry

Peptide Sequencing by Mass Spectrometry

Mass Spectrometry

Protein Sequencing and Identification by Mass Spectrometry

Algorithmic Problems in Peptide Sequencing