Data standards from the Proteomics Standards Initiative

Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

Overview • HUPO-PSI background • Data formats • Protein and peptide separations • GelML • spML • Mass spectrometry and proteomics informatics • mzML • mzIdentML • mzQuantML

HUPO-PSI background • HUPO was founded in 2001 with several objectives: • Consolidate worldwide proteome organisations • Assist in the coordination of public proteome initiatives • Engage in scientific and educational activities • Tissue proteome projects and other initiatives: • Plasma, Liver, Brain, Glyco and Antibody initiative • Proteomics Standards Initiative (PSI) • HUPO-PSI “The HUPO Proteomics Standards Initiative (PSI) defines community standards for data representation in proteomics to facilitate data comparison, exchange and verification.” • Main outputs are: • Minimum reporting guidelines (MIAPE modules) • Data exchange formats (usually in XML) • Ontologies or Controlled vocabularies

PSI main outputs • MIAPE – minimum information about a proteomics experiment • Information that should be recorded about a proteomics experiment (Taylor et al. Nature Biotechnology 25, 887-893; 2007) • Modules: gel electrophoresis, gel image informatics, capillary electrophoresis, column chromatography, mass spectrometry, mass spectrometry informatics and molecular interactions • Data formats for: • molecular interactions • mass spectrometry • protein identifications • gel electrophoresis and other separation methods • Plus supporting controlled vocabularies for each format • All outputs must pass a stringent standardisation process • Specifications reviewed by public comment and anonymous review • PSI editor will not sign off specification until reviewers’ comments have been satisfied

PSI data formats Proteomics Informatics Protein separation Mass spectrometry GelML mzML (Mass spec) mzIdentML (Protein Identifications) mzQuantML (Protein Quantifications) • 2007-01-18 GelML 1.0 • Current: GelML 1.1 (no formal release yet) • 2008-06-01 mzML 1.0.0 released • 2009-06-01 mzML 1.1.0 released • Previous /related standards • mzData v1.0.5 (PSI) • mzXML (from ISB) • 20-08-2009 mzIdentML 1.0.0 • Early drafting only spML • 2007 - milestone 2 • No active development... MI (molecular interactions) Version 2.5

GelML Data format for exchanging protocols and image data resulting from gel electrophoresis, extension of FuGE • Contents: • Models of 1D and 2D separation, electrophoresis protocol, detection, and includes DIGE • Status: • v1.0 was built by extending complete FuGE model; version 1.1 extends from “FuGElight” • v1.1 simplified protocols e.g. for electrophoresis (free-text not parameterized) • v1.1 shares the same CV structure as mzML and mzIdentML • v1.1 implemented in ProteoRed MIAPE database, beta implementation in MIAPEGelDB (SIB)

spML Data exchange format for non-gel based separations, extension of FuGE • Contents: • Multi-dimensional chromatography, generic model for other types of separation (capillary electrophoresis, rotofors, centrifugation etc.) • Status: • Milestone 2 extended from FuGE; • some work has been done to convert this to same structure as GelML v1.1 • No active development for some time, decision to be taken at next PSI meeting about community requirement for format

mzML History Early Development mzData 1.05 dataXML 0.6 mzML 0.90 mzXML 3.0 SFO 2006-05 DC 2006-09 ISB 2006-11 Lyon 2007-04 EBI 2007-06 Final Development mzML 1.0.0 mzML 1.1.0RC5 mzML 1.1.0 mzML 0.91 mzML 0.99 RC PSI Doc Proc 2007-11 Turku 2009-04 Release! 2008-06 Toledo 2008-04 Release! 2009-06

Each spectrum contains a header with scan information and optionally precursor information, followed by two or more base 64 encoded binary data arrays. mzML cvList referenceableParamGroupList spectrum sampleList spectrumDescription instrumentConfigurationList precursorList softwareList scan dataProcessingList binaryDataArray acquisitionSettingsList binaryDataArray run • • • spectrumList spectrum spectrum • • • chromatogram chromatogramList binaryDataArray chromatogram binaryDataArray chromatogram • • • Chromatograms may be encoded in mzML in a special element that contains cvParams to describe the type of chromatogram, followed by two base64-encoded binary data arrays.

mzML implementations

mzIdentML overview • Various software packages for searching: • MASCOT, SEQUEST, X!Tandem, Omssa, Inspect... • Each piece of software has own output format • User interacts with results formatted as web pages • Not easy to submit to databases or re-analyse results • mzIdentML • Standard format for results of searches with mass spec data • Can capture results from PMF and tandem MS • Flexible model of peptide and protein identifications • Capture search engine parameters, scores and modifications using controlled vocabulary terms <Modification location="7" residues="M" monoisotopicMassDelta="15.994919"> <cvParam accession="UNIMOD:35" name="Oxidation" cvRef="UNIMOD" />

mzIdentML Schema overview mzIdentML cvList Software packages AnalysisSoftwareList Biological samples AnalysisSampleCollection DB entries of protein / peptide sequences SequenceCollection AnalysisCollection inputs = external spectra1..n output = SpectrumIdentificationList1 SpectrumIdentificationProtocol SpectrumIdentification AdditionalSearchParams ProteinDetection ModificationParams Inputs= SpectrumIdentificationLists output =ProteinDetectionList Enzymes AnalysisProtocolCollection DatabaseFilters SpectrumIdentificationProtocol AnalysisData All identifications made from searching one spectrum ProteinDetectionProtocol SpectrumIdentificationList SpectrumIdentificationResult One (poly)peptide-spectrum match SpectrumIdentificationItem DataCollection Inputs ProteinDetectionList The database searched and the input file converted to mzIdentML ProteinAmbiguityGroup AnalysisData ProteinDetectionHypothesis A set of related protein identifications e.g. conflicting peptide-protein assignments A single protein identification

mzIdentML Peptide identifications mzIdentML SequenceCollection DBSequence Accession = “HSP7D_MANSE” Seq = “MAKAPAVGIDLGTTYSCVGVF... “ DBSequence Accession = “HSP70_ECHGR” Seq =“MMSKGPAVGIDLGTTFSCVGV...” Peptide Seq = “DAGMISGLNVLR” Mod = Methionine oxidation (pos 4) SpectrumIdentificationList 1 SpectrumIdentificationResult 1 SpectrumIdentificationItem 1_1 PeptideEvidence 1_1_A start=161 end=172 pre=K post=I external data PeptideEvidence 1_1_B start=160 end=171 pre=K post=L spectrum Score = 67.2 E-value = 0.000867 Rank = 1 spectrum spectrum spectrum Score = 54.4 E-value = 0.026 Rank = 2 spectrum SpectrumIdentificationItem 1_2 PeptideEvidence 1_2_A start=54 end=65 pre=K post=T

mzIdentML SequenceCollection mzIdentML Protein identifications DBSequence Accession = “HSP7D_MANSE” Seq = “MAKAPAVGIDLGTTYSCVGVF... “ DBSequence Accession = “HSP70_ECHGR” Seq =“MMSKGPAVGIDLGTTFSCVGV...” SpectrumIdentificationList SpectrumIdentificationResult 1 • Protein ambiguity group • Groups proteins that share the same set of peptides (protein inference problem) • Protein Detection Hypothesis • - One potential protein hit supported by peptide evidence SpectrumIdentificationResult 2 SpectrumIdentificationItem 2_1 PeptideEvidence 2_1_A SpectrumIdentificationItem 1_1 SpectrumIdentificationResult 3 PeptideEvidence 1_1_A SpectrumIdentificationItem 3_1 PeptideEvidence 1_1_B PeptideEvidence 3_1_A PeptideEvidence 3_1_B ProteinDetectionList ProteinAmbiguityGroup 1 ProteinDetectionHypothesis 1_1 PeptideHypothesis (3_1_A) PeptideHypothesis (2_1_A) PeptideHypothesis (1_1_A) ProteinDetectionHypothesis 1_2 Score = 141 Peptide coverage = 17% E-value = 0.0034 PeptideHypothesis (1_1_B) PeptideHypothesis (3_1_B) Score = 85 Peptide coverage = 12% E-value = 0.055 ProteinAmbiguityGroup 2 ProteinDetectionHypothesis 2_1

mzIdentML SequenceCollection DBSequence Accession = “HSP7D_MANSE” Seq = “MAKAPAVGIDLGTTYSCVGVF... “ DBSequence Accession = “HSP70_ECHGR” Seq =“MMSKGPAVGIDLGTTFSCVGV...” mzIdentML Protein identifications SpectrumIdentificationList SpectrumIdentificationResult 1 SpectrumIdentificationResult 2 SpectrumIdentificationItem 2_1 ProteinDetectionHypothesis 1_1 has 3 peptides: ESTLHLVLR TLSDYNIQK TITLEVEPSDTIENVK ProteinDetectionHypothesis 1_2 has 2 peptides: ESTLHLVLR TLSDYNIQK Stronger evidence supporting hypothesis 1 but they are placed within the same ambiguity group PeptideEvidence 2_1_A SpectrumIdentificationResult 3 SpectrumIdentificationItem 3_1 PeptideEvidence 3_1_A SpectrumIdentificationItem 1_1 PeptideEvidence 3_1_B PeptideEvidence 1_1_A PeptideEvidence 1_1_B ProteinDetectionList ProteinAmbiguityGroup 1 ProteinDetectionHypothesis 1_1 ProteinDetectionHypothesis 1_2 PeptideHypothesis (3_1_A) PeptideHypothesis (1_1_B) PeptideHypothesis (2_1_A) PeptideHypothesis (3_1_B) Score = 85 Peptide coverage = 12% E-value = 0.055 PeptideHypothesis (1_1_A) Score = 141 Peptide coverage = 17% E-value = 0.0034 ProteinAmbiguityGroup 2 ProteinDetectionHypothesis 2_1

mzIdentML now available for export from Mascot in the next release

Sequest converter produced by MPC (Germany) as part of ProDac consortium: http://www.medizinisches-proteom-center.de Thermo also working on an “official” exporter • Basic scripts available for converting other search engine formats (X!Tandem, Omssa, pepXML) • Export in next version of Scaffold • Database implementation in PRIDE is coming...

mzQuantML • Format to capture proteins quantified from MS data • Very early drafting • Many methods of quantification • Label/tag based • Stable isotopes (SILAC) • Tags: ICAT / iTRAQ • Label-free • Extracted ion chromatogram – align parallel runs • Spectral counting • Methods still in flux • New methods reported frequently in the literature • Will need to reference back to spectra (+chromatograms) and identifications • Needs more community input – please offer to help!

Acknowledgements • PSI workgroups: • Protein separation • Chair: Juan-Pablo Albar (ProteoRed) • Mass spectrometry • Chair: Eric Deutsch (ISB) • Proteomics Informatics • Chair: Andy Jones (Liverpool) • Co-Chair: David Creasy (Matrix Science) • Molecular interactions • Chair: Henning Hermajakob (and chair of PSI) • and many developers worldwide... See: http://www.psidev.info/

Data standards from the Proteomics Standards Initiative

Data standards from the Proteomics Standards Initiative

Presentation Transcript

Common Core Standards Initiative

Justice XML Standards Initiative

SMC Specs Standards Initiative

The Common Core State Standards Initiative

Data Standards Implementation

Common Core State Standards Initiative

Data Standards Development

The data standards soup ….

Data Standards Workflow

Exposure Data Standards

The Common Core State Standards Initiative

EBI Proteomics Services Team – Standards, Data, and Tools for Proteomics

XML Standards for Proteomics Data

XML Data: From Research to Standards

PEIMS Data Standards

The Global Electronic Standards Initiative

HEALTHCARE DATA STANDARDS

Microseismic Data Standards or Non-standards!!

10. Standards in Proteomics

Joint Standards Initiative

Kentucky’s Standards-in-Action Initiative