Popitam,

Popitam, une méthode tolérante aux mutations/modifications pour l'identification de protéines à partir de données de spectrométrie de masse (MS/MS) Patricia Hernandez Swiss Institute of Bioinformatics

Overview - proteomics - proteome - proteome visualization: 2D gels - protein identification - classical workflow - shared peak count - modifications and identification - modified peptides - SPC - spectral alignment, de novo sequencing, tag extraction - Popitam - overview - tags - scoring function, genetic programming - some results

proteomics Proteome --> Proteomics: science that studies proteins expressed by a genome --> proteome --> changes with the state of development, the tissue or the environmental conditions --> identification and quantification--> 3D structure prediction--> localisation in the cell--> biological function --> modifications --> interactions with other proteins ...

proteomics 2d gels --> a simple way to "see" a proteome --> numerous proteins from a biological sample (example: blood) are separated according to 2 criteria : molecular weight of the protein isoelectric point --> this method allows separating simultaneously thousands of proteins and displaying them on a two-dimensional map --> spot = (generally) one purified protein --> we can "see" the proteins, but we don't know to which protein corresponds a given spot...

protein identification Spots identification: classical workflow --> identify a spot = give a protein name to a spot --> protein databases (for example SwissProt) - records all known proteic sequences - annotated MS/MS identification MGMGQ MGQGWAWATWATA... fragmentit select a peptide measure the mass of the fragments by ms cut the aa chain into peptides (every K and R aa) measure the mass of the peptides by ms select an unknown purified protein MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVK... MS identification (PMF) MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVKTHGTSSQATTSSQK…

protein identification Shared peak count MS spectrum: list of the masses of peptides that constitute the protein of interestMS/MS spectrum: list of masses of fragments that constitute a peptide of the protein of interest MS: virtually cut the theo. seq. into peptides and compute masses compare the list of experimental and theoretical masses in order to find the best match between experimental and virtual spectra--> detection --> ions --> noise MS/MS: virtually cut the theo. seq. into peptides, and further cut the peptides into fragments, and compute the masses p i g protein database hbb_human

modifications and identification Modified peptides (1) PTMs--> most eukaryote proteins --> addition of a chemical group : --> participate to: - methylation:+14- phosphroylation:+80- glycosylation: >800 ... - proteic structures- proteic functions - control of metabolic pathways The sequence of the database may differ from the experimental peptide: CONFLICT (different sources report differing sequences) --> in about 4'600 human entries VARIANT (authors report that sequence variants exist) = alleles --> in about 2'200 human entries MUTATIONS associated with diseases --> 187 references to mutations and diseases in COMMENTS section

EPYK PEP MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVKTHGTSSQATTSSQK… PEPYK intensity intensity PYK m/z m/z modifications and identification Modified peptides (2) a modified protein MS, selection of the peptide digestion fragmentation

modifications and identification SPC and modified peptides experimental MS/MS spectrum modified experimental MS/MS spectrum intensity intensity m/z m/z intensity intensity m/z m/z theoretical peptide "Shared peak count" algorithms have to introduce modifications into the theoretical peptide databases.

AAIEGKLMQRAPALK modifications and identification Database size (1) AAIEGKaAIEGKAaIEGKaaIEGKAAIeGKaAIeGKAaIeGKaaIeGK LMQRlMQR APALKaPALKAPaLKaPaLKAPAlKaPAlKAPalKaPalK AAIEGK LMQRAPALK New database, if the two following modifications are taken into account - modification occurring on amino acid A: A->a - modification occurring on amino acids L: L->l and E: E->e = all the peptide from the initial database, plus all modified peptides that can be built from the initial database

modifications and identification Database size (2) B(L,p,k) gives the probability to have k positions of modification in a sequence of lenght L, if p is the probability that a position may be modified (we assume the positions to be independent) Aim: assess the number of peptides that contain zero, one, two... "positions" for a possible modification xxxxoxxx xoxx xxox xxxo ooxx oxox oxxo xoox xoxo xxoo N0N1 N2 L = 10, p = 1/20:800'000 = 478'990 + 252'100 + 59'710 + 8'380 + 771 + c L= 10, p= 5/20: 800'000 = 45'050 + 150'169 + 225'254 + 200'225 + 116'798 + c

modifications and identification Database size (3) Expected number s of peptides that may contain exactly M modifications Expected size of database when taking into account 0 to M modifications xxxxoxxx xoxx xxox xxxo ooxx ooxx oxoxoxox ... N0N1 N2

modifications and identification Database size (3) SwissProt Human, 10'000 proteins n = 806'787 peptides [300,3000] (=~from 3 to 30 aa) L = 11 amino acids 0 to 3 modifications occuring on one specific amino acid: p=1/20P0to3_mod = 1'375'700 + c 0 to 3 modifications that may occur on several loci: Phosphorylation: H,D,S,T,Y (eucaryotes): p = 5/20P0to3_mod = 4'865'100 + c 0 to 3 modifications that may occur on every amino acid: p=1 P0to3_mod = 3,97e12 + c Mutation scenario: Each amino acid may mutate into one of the remaining 19 amino acids:All possible words = 19k-1 P1_mut = 1.16e14

modifications and identification Other strategies 2 major problems: - size of the database - a priori knowledge on the deltaMass due to the modification Solutions: Define an identification algorithm that is not based on a SPC --> spectral convolution/alignment - PEDENTA (2000) --> de novo sequencing followed by sequence matching - extraction of one or several complete sequences LUTEFISK (1997), SHERENGA (1999)... - extraction of one or several small tags (PeptideSearch, 1994), Patchwork sequencing... --> Popitam (2003): "guided" sequencing

A B C D E F if (i',j') and (i,j) are co-diagonal C otherwise E modifications and identification Spectral convolution/alignment Pevzner PA, Dancik V, Tang CL: Mutation-tolerant protein identification by mass spectrometry. J.Comput.Biol. 2000, 7:777-787 Key idea:k-similarity D(k) Given Sexp and Stheo, the goal is to find a serie of k shifts in Sexp that makes Sexp and Stheo as similar as possible. D(k) represents the maximum number of elements in common between a theoretical and an experimental spectrum after k shifts theo. MS/MS spectrum A B D SPC score: D(k=0) = 2 SA score: D(k=2) = 6 exp. MS/MS spectrum F

modifications and identification De novo sequencing Taylor JA, Johnson RS: Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Commun.Mass Spectrom. 1997, 11:1067-1075 Longest path problem in a directed acyclic graph --> dynamic programming--> complete sequences --> mutations, but no modifications 4/24

modifications and identification Tag extraction Island of sequence ionsThe tags (m1-SEQ-m2) are manually extracted2 steps: tags as filtering, then SPC Mann M, Wilm M: Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal.Chem. 1994, 66:4390-4399 Schlosser A, Lehmann WD: Patchwork peptide sequencing: Extraction of sequence information from accurate mass data of peptide tandem mass spectra recorded at high resolution*. Proteomics. 2002, 2:524-533 Based on very accurate masses (10 mDa)Small tags are extracted from low mass regions (2 aa)

Popitam Popitam key's idea Spectrum graph --> good way to structure the information contained in the MS/MS spectrum, allows mutations Tags --> modified source peptides --> fragmented spectra Search space --> use dtb information during tag extraction --> take into account only mutations compatible with the spectrum (graph) --> make only modification scenarios compatible with the current theoretical peptide Scoring function --> take into account a lot of parameters --> genetic programming

For each Pi extractTags(); processTags(); score(); Popitam Popitam overview any source of biological sequences initial node I(P1) I(P2) ... P1 P2 ... Peptide sequence database filter final node IDENTIFICATION MS/MS 7/12

b+-NH3 measured mass [m/z] a+-H20 bMass (ideal fragmentation) y++ b+ - # nodes > # peaks- families - selection based on intensity- for each peak, make all possible hypotheses “N-term”: bMass = chargeNb * m/z – (chargeNb-1) – offset“C-term”: bMass = PM – […] Popitam Spectrum graph 5/12

Popitam Tag extraction ckTEetvmgoEV LTELetLvmITEIetIvmtlE peLTEpeLetpeLvmpeITEpeIetpeIvmpetlE 9 nodes,11 edges --> 21 tags

Popitam Tag extraction (2) LVNELTEFAK (125 peaks) Pentium, 1.6 GHz AIGGGLSSVGGSSTIK (1159 peaks) 1 16/97 5.6*104 0m02s 2 30/338 5.4*106 0m27s3 44/692 5.7*107 3m16s4 58/1121 3.4*108 21m09s5 72/1667 2.3*109 2h17m07s AHFSISNSAEDPFIAIHADSK(145 peaks)1 24/121 6.1*104 0m02s2 46/308 1.9*108 16m15s3 68/831 2.0*1010 22h06m47s

Popitam Tag extraction (3) Recursively extract from the graph all tags that are compatible with the current theoretical peptide--> a tag = a path (bMass, edge label, ionic hypothesis…) ACCACMCAK - k A C MCAK MCAK A C k CACMCAK CACMCAK MCAK CMCAK k

Popitam Tag processing • discard subtags- discard tags that begin the theo. peptide, but not the graph (and vice versa)- discard tags that finish on the last aa, but not on the last node- group "family" tags • AVVQDPALKPLALVYGEATSRPeakNb : 1260 ParentMass : 2197.15 NodeNb : 86 EdgeNb : 142 / 1098 29 tags --> 13 subSeqs KplALVYGE 30 39 43 45 50 58 63 64 68 plALVYGE 39 43 45 50 58 63 64 68 ALVYGE 43 45 50 58 63 64 68 LVYGE 45 50 58 63 64 68 VYGE 50 58 63 64 68 YGE 58 63 64 68 paLKplALvy 0 4 10 16 22 26 31 42LKplALvy 4 10 16 22 26 31 42 KplALvy 10 16 22 26 31 42 plALvy 16 22 26 31 42 ALvy 22 26 31 42 LKPla 10 13 19 22 31 LKPla 10 14 19 22 31KPla 13 19 22 31 KPla 14 19 22 31 PLAlv 29 35 40 42 48 LAlv 35 40 42 48 DpaL 65 69 78 84 LKP 11 15 20 24 LVY 16 19 24 29 LVY 44 49 57 62PAL 19 22 26 31 QDP 10 16 20 24 alkpL 54 63 71 75 avVqd 0 5 9 18 dpAL 37 43 45 50 avVQD 55 60 65 70 75 VQD 60 65 70 75 paLK 59 66 69 75

Popitam Subsequence processing (1) Aim: Find all possible arrangements of subsequences, given the theoretical peptideBUTdo not include in a same arrangement tags that are incompatible with the others. Compatibility rules: --> no peak shared --> beginMasses must respect positions in the sequences A V V Q D P A L K P L A L V Y G E A T S R0 5 10 15 Compatibility graph 0 1 2 3 4 5 ... 0 x x 1 x x 2 x x 3 x x x x 4 x 5 x x x ... 0 KplALVYGE 794.41 0 1 2 6 15 19 21 27 30 1 LKPla 282.17 2 7 29 33 41 2 PLAlv 785.34 6 8 19 21 28 3 DpaL 1673.89 14 20 31 36 4 LKP 284.11 17 22 32 36 5 LVY 410.26 14 22 28 29 ... Each found clique in the graph is a possible arrangement of subsequencesHere, 91 cliques, but most of them are really uninteresting.

Popitam Scoring function (1) --> 2 levels scoring: - scoring linked to the subsequences (local) subscores: number of tags that compose the subsequence length of the subsequence occurrence probabilities of the ionic type hypothesized (geometric/arithmetic mean) - scoring linked to the arrangement (global) subscores: global coverage linear regression AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 LVY 1202.7AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 LVY 1202.7 avVqd 1.0AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 avVqd 1.0AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LVY 1202.7 ...

Popitam Scoring function (2) How can we combine the subscores in order to build an efficient scoring function ?--> empirical function (expert knowledge) --> probabilitic function --> function built using GENETIC PROGRAMMING GENETIC PROGRAMMING population of "programs" : trees nodes : mathematic operators (+, -, *, /, ^, ...) bolean operators (AND, OR, NOT...) conditional operators (if-then-else...) iterative functions (do-until...) other specific functions... leaves : subscores, coefficient

Popitam Genetic operators (1) Initiation: Programs are initially randomly determined (structure, functions, values) Iterations: At each iteration, the programs are evaluated (fitness function). Only the best are allowed to reproduce, using genetic operators (permutation, mutation, crossing-over...).

Popitam Genetic operators (2)

Popitam Genetic programming genetic programming allows testing several scoring functions and making them "cleverly" evolve in order to find an optimal one tree population if (correctId() ) si  ]0.5;1[ (according to the discriminative power) else { if (belongToList() ) si  ]0;0.5] (according to the position in the list) else si = 0; scoring function1 Popitam fitness scoring function3 scoring function2 Popitam Popitam fitness fitness

Popitam Some results

Popitam,

Popitam,

Presentation Transcript