1 / 14

Open source tools for data analysis

Achim Treumann. Open source tools for data analysis. General Workflow. Data will now be mzIdent and mzTab. Using OpenMS ( knime version) perform MSGF+ searches. Download all data from MassIVE. Using msconvert convert to mzML.

bowser
Télécharger la présentation

Open source tools for data analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Achim Treumann Open source tools for data analysis

  2. General Workflow Data will now be mzIdent and mzTab Using OpenMS (knime version) perform MSGF+ searches Download all data from MassIVE Using msconvert convert to mzML Using R convert mzTab results into more accessible peptide.tsv (thanks to Julianus Pfeuffer) Using Perseus generate heatmap Using R perform a full join of the q-values for all files into one large table

  3. msconvert Part of the proteowizard suite: http://proteowizard.sourceforge.net/tools.shtml • Used default parameters for conversion of all files into mzML • This retains most (all?) of the information in the files, including metadata about data acquisition • Proteowizard does use libraries that have been supplied by MS manufacturers • Files increase in size between 1.5 and 6-fold (file sizes for HeLa digests between 0.5 GByte and 8 GByte. This can be avoided by specifying the number of peaks per MSMS spectrum (600 is sensible) in the conversion process • 42 files converted

  4. OpenMS • Platform that allows you to do almost everything with your MS data (particularly within proteomics) • Works with data from all manufacturers • Tutorial is here:http://open-ms.sourceforge.net/wp-content/uploads/2016/01/handout1.pdf • Used this to search all datasets and calculate FDR

  5. OpenMS • Workflows are constructed within Knime (v 3.3.1) • Each worknode can have many parameters that can be set (e.g. for a search) • Default parameters do not always work and need to be tuned, but it is possible to generate a workflow that produces results for all datasets

  6. mzTab tsv conversion • The default output for search results is mzIdentML, a format that is great for computers and contains all metadata, but not very human readable or useable • A more usable output standard is mzTab • mzTab contains protein lists and peptide lists in one mixed table – not good for further processing • Julianus Pfeuffer and Lars Nilse (OpenMS team) have written and R script that I have modified to generate a table of only peptide results, discarding Q-Values > 0.01. • This script is called make_tsv.R and it generates files that are called psm.tsv (one file for each dataset) • Now not necessary anymore – the OpenMS team has developed an improved mzTab exporter

  7. Summarise all data • All psm.tsv files were pulled together in one large file that contains all identified peptides (q<0.01) • Using the dplyr library in R, we performed a full join of all individual tables and extracted for each dataset only the Q-Value (as a measure of identification confidence) • Then we used Perseus to visualise the data in a heatmap

  8. Visualise results • Results could be visualised using an R script, but I did not have time, so we used Perseus (not open source, but free for academics and several papers published) • Perseus tutorials and lectures on youtube:http://www.coxdocs.org/doku.php?id=perseus:user:tutorials

  9. Heatmap of peptide identifications • Red colour codes for high confidence identifications, blue for lower confidence • Grey are missing values • Clustering was performed with a Euclidean distance function • I think that this heatmap does show reasonable reproducibility • Don’t know yet for sure how to get the best interpretation

  10. Conclusions • We have learnt a great deal about improving our QC experiments and procedures • Cross-platform data analysis for QC is difficult, but can be implemented • Commercial standards (external or internal) cost money, but are important (cross-laboratory reproducibility) • ID based and non-ID based QC parameters are very complementary • For phase II we want to produce a generally applicable data analysis workflow that can be distributed to all participants (providing qcML output)

  11. Thank you

More Related