An Application of the Metric Access Methods to the Mass Spectrometry Data

An Application oftheMetric Access Methodstothe Mass Spectrometry Data Jiří Novák and David Hoksza, Charles University in Prague

Presentation Outline • Mass Spectrometry (MS) • basic principles • existing methods for interpretation of the mass spectra • common problems of interpretation • Proposed Method • setup of our algorithm – step by step • metric access methods (MAMs) • experiments • Conclusions and Future Work

Mass Spectrometry (MS) • method for unknown protein (or peptide) sequences identification • determines peptide (or peptide fragment) ions masses • simple MS (one spectrum) x tandem MS/MS (collection of spectra) • example - protein sequence digested to peptides by trypsin: MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLK...

Interpretation of Spectra • main idea: different aminoacids (AAs) ~ different masses • 2 basic approaches: graph algorithms x DB of known protein sequences • De Novo Peptide Sequencing (for MS/MS) • direct spectra interpretation using graph algorithms • many paths in graph represent many peptide sequences corresponding to an experimental spectrum • Peptide Mass (Fragment) Fingerprinting (PMF for MS; PFF for MS/MS) • search DB of already known protein sequences(or DB generated by DNA translation) • theoretical spectra are generated from stored sequences andcompared with experimental spectra • Sequence Tag (for MS/MS) • combined approach • short sequences (tags) are determined by hand (or using De Novo) in each spectrum then DB is searched

Problems of Interpretation • single aminoacids (or groups) with similar masses can be mistaken • posttranslational modifications • AAs masses are changed • some peaks important for identification (y or b-ions) are missing • fragment ions do not arise • noise • up to 80% of peaks • peaks of fragment ions with unpredictable chemical structure

Proposed Method

Proposed Method - Step by Step (1) • DB construction • theoretical mass/charge values (e.g. y-ions) are generated for each peptide sequence • vectors of specified size are formed and stored in DB • Heuristics for selection of peaks from experimental spectra • last k peaks • the set of peaks correspond to complementary b and y-ions dim = 3step = 2 m(bi) + m(yk-i) = mp + 2 dim = 3

MAMs • Metric Access Methods (MAMs) • DB index structures (logarithmic search time complexity in the best case) • Metric • satisfies: reflexivity, positiveness, symmetry and triangle inequality • qualifies the distance (or similarity) between theoretical and experimental spectra • M-tree (Metric tree) • MAM; dynamic and balanced tree • organizes objects (vectors) to n-dimenzional hyperspherical regions • inner nodes(routing entries) rout(Oi) = [Oi, r(Oi), ptr(T(Oi)), d(Oi,par(Oi))] • leaf nodes (ground entries)grnd(Oi) = [Oi, oid(Oi), d(Oi,par(Oi))]

Proposed Method - Step by Step (2) • logarithmic distance (pseudo-semimetric) was designed • a few big shifts of peaks between two compared vectors often represent bigger distance (smaller similarity) than many smaller shifts of peaks in Euclidean space • logarithmic distance is the opposite (common problem is insertion of a few peaks that have no use in an experimental spectrum) example: x = {200,300,400,500} y = {200,300,460,500} z = {210,305,420,475} • Euclidean distance • x and z are closer than x and y • Logarithmic distance • x and y are closer than x and z • one peak is mistaken

Proposed Method - Step by Step (3) • Search with MAMs • range query • selection of all objects up to specified distance (radius) • k-NN query • selection of k nearest neighbors objects • interval query • selection of objects in specified distance (between minimum and maximum radius) • set of interval queries is used for search peptide modifications together with maximum or Hausdorff distance • errors caused by aminoacid mass modifications are not cummulated and correspond direct to average radiuses of interval queries • e.g. if we search two occurences of carbamidated cysteine (+57 Da) in one peptide sequence then the average radiuses are {57, 2x57}

Proposed Method - Step by Step (4) • Summary • preprocesses the protein DB and selects small set of peptide sequence candidates which best correspond to an experimental spectrum • more sophisticated algorithms should be used for selecting the most suitable peptide sequence afterwards • scoring – shared peak count (count of b and y-ions) • suitable only as part of PFF for MS/MS • demanded function must be a metric and it must be simple and fastbecause it is used very often in MAM • algorithms for spectra comparison are too complex and they do not satisfy the triangle inequality

Experimental Results • Test bed • collections of tandem mass spectra from the Quartz project - www.thegpm.org • Amethyst & Opal • human proteins • suitable for validating algorithms that analyze single peptide sequence from mass spectrum (not whole protein sequences from collection of spectra) • peptide sequences correspond to mass spectra are known (and much more...) • Testing • M-tree structure • comparison of distances: • Euclidean, maximum, Hausdorff, logarithmic, cosine sim. • suitability of using the set of interval queries for searching peptide modifications • comparison with existing methods (quality of identification) & sequential access

Experiments - comparison of distances amet: 533 spectra

Experiments - set of interval queries amet: 773 spectra, opal: 622 spectra

Experiments MASCOT – 62% (confirmed)ProteinProspector – 72% • test bed: 50 spectra (web search engines were used) • the quality of identification is slightly worse in comparisonto nowadays most widely used search engines(MASCOT, ProteinProspector) • the speed up of M-tree was about 103 against the sequential algorithm with DB of real size (50 thousand proteins ~ 2,5 mil. peptides)

Conclusions and Future Work • verified ability of metric approach to sequence identification • future work • more sophisticated heuristics • new metrics • improvement of scoring schemes • testing with bigger datasets of experimental spectra which are suitable for identification of the whole protein sequences (not only single peptides) from tandem mass spectra

An Application of the Metric Access Methods to the Mass Spectrometry Data