1 / 55

PRIDE: The Proteomics Identifications Database

PRIDE: The Proteomics Identifications Database. Phil Jones EMBL-EBI. PRIDE: Current Progress and Future Developments. The PRIDE Team The Science Supported by PRIDE: PRIDE Scope Data included in PRIDE PRIDE - Current Status Tools and analyses that PRIDE offers The Future

kemal
Télécharger la présentation

PRIDE: The Proteomics Identifications Database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PRIDE: The Proteomics Identifications Database Phil Jones EMBL-EBI

  2. PRIDE: Current Progress and Future Developments • The PRIDE Team • The Science Supported by PRIDE: • PRIDE Scope • Data included in PRIDE • PRIDE - Current Status • Tools and analyses that PRIDE offers • The Future • An Introduction to BioMart • The PRIDE BioMart

  3. The PRIDE Team Richard Côté (OLS, Protein A/C Mapping tool) Lennart Martens (Started PRIDE when a student at the EBI, ProDac) Overseen by: Henning Hermjakob Phil Jones (Technical lead Developer, Proteome Harvest) David Thorneycroft (Database Curator)

  4. With Contributions to Implementation from: • William Derache (PRIDE DAS Server – First Version) • Rafael Jimenez (Dasty2 – to be ‘plugged in’ to PRIDE) • Sebastian Klie (Experiment Set Comparison) • Patrick Nitschke (PRIDE DAS Server Version 2) • Antony Quinn (XSLT transformations)

  5. A Typical Proteomics Workflow: Protein Identification Based Upon Fragmentation Spectra VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHK denature digest with trypsin V HLTPEEK VH LTPEEK VHL TPEEK VHLT PEEK VHLTP EEK VHLTPE EK VHLTPEE K select mass, collide VHLTPEEK SAVTALWGK VNVDEVGGEALGR LLVVYPWTQR FFESFGDLSTPDAVMGNPK VK AHGK K VLGAFSDGLAHLDNLK GTFATLSELHCDK LHVDPENFR LLGNVLVCVLAHHFGK EFTPPVQAAYQK VVAGVANALAHK mass spectrum compare with theoretical peptide spectra; ID = best similarity

  6. The Need for PRIDE – Publish and Vanish Proteomics data is only made available as arbitrarily formatted PDF tables, carrying important limitations: • Source data (mass spectra) are not made available • No peer review validation possible • Very little raw materials for testing innovative in silico techniques are available • Automated (re-)processing of the identifications is impossible

  7. Sample generation Origin of sample hypothesis, organism, environment, preparation, paper citations • Sample processing, gel informatics Gels (1D/2D), columns, ‘chips’, other methods images, gel type and ranges, band/spot coordinates, quantitation stationary and mobile phases, flow rate, temperature, fractionation • Mass Spectrometry  ‘mzData’ machine type, ion source, voltages • Mass Spectrometry Informatics peak lists, database name + version, partial sequence, search parameters, search hits, accession numbers, quantitation • Data dissemination and Comparison PRIDE peak lists, protein and peptide identifications, post-translational modifications Where PRIDE Fits In…

  8. Data In PRIDE • Current Statistics: • 1892 Experiments • 285,943 Protein Identifications • 1,223,094 Peptide Identifications • 252,464 Spectra • Large Public Datasets: • HUPO Plasma Proteome Project • HUPO Brain Proteome Project (including spectra) • HUPO Liver Proteome Project (including spectra) • Human Cerebrospinal Fluid (Jing Zhang, U Washington School of Medicine).

  9. PRIDE Components • Consists of an XML schema for data exchange, a Java API to allow software engineers to exploit the core of PRIDE, a relational database for data storage (platform independent) and a web-application front-end. • Built entirely from open source components and is an open source project itself, licensed under the Apache Licence. • The Web interface is available for querying, reporting and data upload • Provides a DAS Service with annotation of identified peptides • “Proteome Harvest” Excel Spreadsheet available for data submission

  10. Some ‘Use Cases’ for PRIDE • Allows public access to the detailsof a proteomics experiment supporting a journal publication. • Allows comparison of data sets describing the same (for example) tissue but from different laboratories • Allows referees to examine and validate data prior to publication • Can be used to allow private sharing of data by collaborating laboratories.

  11. Simplified Schema Project * Experiment * <<mzData>> Sample Species Tissue Disease state Cellular component Developmental stage Protocol <<mzData>> Instrumentation & Associated Software Protein Identifications * 1..* * Ordered Steps * 0..1 Peptide Identifications * <<mzData>> Mass Spectra 0..1 * Protein Modifications (PTMs)

  12. Data Standards and PRIDE • The Human Proteome Organisation (HUPO) has been central in the development of data standards for proteomics through the HUPO Proteomics Standards Initiative. View PSI documentation at: • http://psidev.sourceforge.net • Relevant current PSI deliverables include: • The MIAPE documentation – A series of documents describing the ‘Minimal Information about a Proteomics Experiment’ • The mzData Mass Spectrometry XML data exchange standard for the communication of mass spectrometry instrument settings, sample details, peak lists and related data. • The PSI-MS Ontology (Controlled Vocabulary) that is used to extend the mzData XML format. • The PSI-MOD Ontology of protein modifications / post-translational modifications.

  13. Data Standards and PRIDE: Future Developments • Formats currently under development by PSI include: The analysisXML standard for capturing search engine output, including the identification of proteins, peptides and post translational modifications. This format will replace a large part of the existing PRIDE XML format once it is finalised. The gelML XML format for the capture of gel electrophoresis data and the related GinML (Gel informatics) XML format. • These formats are all acutely relevant to the PRIDE project, which is committed to implementing the PSI formats as soon as possible after each is ratified. You can expect to see all of these formats become part of PRIDE over the next two years.

  14. HUPO Proteomics Standard Initiative:Forthcoming Formats FuGE Based PSI Formats: • AnalysisXML – ready for submission for PSI document approval process early 2007 ? • gelML – has entered PSI document approval process. • All three formats will be implemented in PRIDE as high priority development goals when they are ratified. • PRIDE to become a FuGE implementation as a consequence ?

  15. PRIDE Views

  16. The PRIDE Web Pages

  17. Data Security and Privacy in PRIDE • Users need to register to upload data. • All personal data is encrypted on the database. • All experimental data is linked to the person who has uploaded it. • Data can be marked as private. The person uploading the data may give a date upon which the data becomes public. • Group access to private data can be granted by creating a 'collaboration'. • Curator accounts are automatically created for all newly uploaded experiments. • Optional reviewer accounts can be created for peer review of journal submissions.

  18. Collaboration Journal peer review process Unit A Unit B author reviewer 1 Unit C reviewer 2 Publication accepted Release date Public availability Public availability Organise peer review of the identifications based on the peak lists Publish the identifications in an accessible format Collaboration and Peer Review

  19. Comparing Data Sets in PRIDE

  20. Experiment Comparison Tool n=2

  21. Experiment Comparison Tool n=3

  22. Ionisation Separation Depletion Search engine Fourier transform Alkylation Ionisation Global ComparisonAlgorithm Development

  23. PRIDE ‘Spin Out’ Project:Protein Accession Mapping Tool

  24. Protein Accession Mapping Tool • PRIDE is a submission database – with a nasty consequence. • Submitters may search any protein sequence database, including their own ‘in house’ databases and then submit the identifications to PRIDE. (This is as it should be of course…) • However, this makes searching PRIDE by protein accession in a meangingful way difficult / impossible…

  25. Protein Accession Mapping Tool • Current developing a generalised mapping tool that will allow any protein accession to be mapped to a UniProt accession. (Possibly requiring the protein sequence for obscure protein databases). • Will be used directly by PRIDE and IntAct to map all submitted proteins on to UniParc / UniProt. • Extensions to the UniParc database being implemented to assist with this mapping, including taxonomy ID. • Provides a SOAP web service to allow you to incorporate this into your applications / workflows.

  26. Protein Identifier Mapping Service: Home Page

  27. Protein Identifier Mapping Service: Progress… • Multiple levels of caching to increase the speed of searching. • Does access SOAP (web) services at the NCBI which can result in a short wait for the cross reference.

  28. Protein Identifier Mapping Service: Detailed Results • Black – Current, active match for the same species • Red – Same species, but a deleted entry from the referenced database • Blue – Current active match – species different or unknown (match based upon the DR lines in the UniProt entry)

  29. Protein Identifier Mapping Service: Summary Results • If you don’t want everything in UniParc… • Black – Current, active match for the same species • Red – Same species, but a deleted entry from the referenced database • Blue – Current active match – species different or unknown (match based upon the DR lines in the UniProt entry)

  30. DAS: Distributed Annotation Services & PRIDE • DAS provides a simple mechanism to allow multiple institutions to share annotation on nucleic acid and protein sequences (e.g. peptide identifications, domain architecture etc.) • Sequence information is obtained from a reference server, e.g. the UniProt DAS Reference Server: • http://www.ebi.ac.uk/das-srv/uniprot/das • Annotations on the sequence can be obtained from any number of annotation servers. • A DAS clientcan be used to seamlessly collect this data from multiple sources and allow it to be viewed or analysed as if it is a single set of data

  31. Multiple other DAS Annotation Servers… Multiple other DAS Annotation Servers… Multiple other DAS Annotation Servers… Multiple other DAS Annotation Servers… Multiple other DAS Annotation Servers… Multiple DAS Annotation Servers… A General DAS Architecture DAS XML retrieved asynchronously and displayed as it is retrieved XML XML DAS Reference Server: Sequence Visualisation using a DAS client

  32. DAS Specification 1.53http://biodas.org/documents/spec.html

  33. What Information is Available? • DAS servers are queried by HTTP: • Information is requested by constructing a simple URL • Information is retrieved as an XML file • Several different kinds of request supported: • ‘Entry Points’ – List of available chromosomes / contigs / proteins as appropriate. • Sequence – for one or more molecules. As well as the sequence, returns a version number and start / end coordinates. Reference Servers only. • Features – for one or more molecules. Either ‘positional’ with coordinates or ‘non-positional’ being annotation of the entire molecule. • Types – essentially a summary of the annotated features.

  34. Sequence Request Example • version – might be a date, a version number or a digest / hash. • start / stop: • Protein DAS servers, start normally 1, with the End indicating the length of the protein. • Nucleic acid servers, shows the coordinates of the gene / contig etc. in relation to the entry point (e.g. chromosome) • moltype – Protein, DNA, RNA. http://www.ebi.ac.uk/das-srv/uniprot/das/aristotle/sequence?segment=Q12345

  35. The Feature Request • Positional or non-positional features. • Coordinates (start and stop). Set to 0 for non-positional features. • Type and method to indicate the nature and source of the feature. • Orientation and phase for nucleic acid features. • Can include notes, hyperlinks and a score for the feature.

  36. Feature Request Example Non-positional feature Co-ordinated feature

  37. DAS Registry Service • Very large numbers of available services providing annotation of proteins and nucleic acid. • Need for a central service to catalogue and document all available services: • DAS Registration Server • http://www.dasregistry.org/ • Can be queried directly by the more recent DAS client applications (e.g. SPICE, Ensembl, Dasty) • Currently 219 separate DAS servers registered.

  38. DAS: Distributed Annotation Services & PRIDE DAS XML retrieved asynchronously and displayed as it is retrieved XML XML XML Uniprot DAS Reference Server: Sequence Swiss-Prot Annotations Interpro Domains trEMBL predictions PRIDE DAS Annotation Server Peptides Protein Modifications Experiment details Multiple other DAS Annotation Servers… Multiple other DAS Annotation Servers… Multiple other DAS Annotation Servers… Multiple other DAS Annotation Servers… Multiple other DAS Annotation Servers… Multiple other DAS Annotation Servers…

  39. Dasty 2: Protein DAS Client • Dasty 2 to be released in February 2007 • Incorporation into PRIDE pages March 2007 • Powerful DAS client / viewer that will run in any modern internet browser • Designed to work as a stand-alone web application or to be incorporated into another web application • PRIDE requirements have been included in development considerations

  40. Dasty 2 in PRIDE • Will allow you to view annotation from a multitude of protein DAS servers alongside PRIDE annotation, including: • Positional features: • Peptide identifications; • Protein modification / PTM identifications; • Non-positional features: • Species, tissue etc. that the protein has been identified in. • Links to literature references supported by the PRIDE experiment. • Will also provide direct links to view PRIDE data in SPICE in a structural context.

  41. Rows can be re-ordered • Zoom into features • View ‘non-positional’ annotation • Links to external data from DAS tracks • Compare relative position of annotation from different sources • Hovering over features highlights annotated sequence

  42. BioMart BioMart (http://www.biomart.org) is a query-oriented data management system. Developed jointly by the European Bioinformatics Institute (EBI) and Cold Spring Harbour Laboratory (CSHL) Powered by BioMart software: • Central Server • Ensembl • HapMap • Dictybase • UniProt • Reactome • Array Express • Wormbase • Gramene • GermOnLine • DroSpeGe • PRIDE (soon!)

  43. BioMart and PRIDE • BioMart provides the user with the ability to perform powerful and fast queries across large, complex data sets: • possibly specifying complex filters involving multiple attributes of the data; • with the ability to specify precisely which attributes or ‘columns’ of data are included in the output; • and the ability to specify the format of the output, including: • HTML table (with links) • Excel spreadsheet • Tab-delimited file • Comma separated format

  44. XML BioMart provides a web-service • BioMart also provides a web-service that allows data integration across remote BioMarts and integration with software packages such as Taverna. MartView MartService 80 3306 X 3306 3306 Local Mart (e.g. PRIDE) Remote Mart (e.g. UniProt)

  45. Typical BioMart Usage Step 1 (Dataset): Choose your dataset Step 2 (Attributes): Specify what information you want to include in the output Step 3 (Filters): Restrict your query Step 4 (Results): Preview (including a simple count) and output or download the results in your chosen format.

  46. Dimension Table Dimension Table FK FK Main Table PK Dimension Table Dimension Table FK FK Building your own: Simple database schema * * Query Optimised “Inverse Star” schema * *

  47. MartEditor tool to define your user interface (No programming!)

  48. BioMart is ‘Skinnable’

  49. PRIDE BioMart – Dataset Page

  50. PRIDE BioMart – Defining a Complex Filter

More Related