Open source tools for data analysis

Achim Treumann Open source tools for data analysis

General Workflow Data will now be mzIdent and mzTab Using OpenMS (knime version) perform MSGF+ searches Download all data from MassIVE Using msconvert convert to mzML Using R convert mzTab results into more accessible peptide.tsv (thanks to Julianus Pfeuffer) Using Perseus generate heatmap Using R perform a full join of the q-values for all files into one large table

msconvert Part of the proteowizard suite: http://proteowizard.sourceforge.net/tools.shtml • Used default parameters for conversion of all files into mzML • This retains most (all?) of the information in the files, including metadata about data acquisition • Proteowizard does use libraries that have been supplied by MS manufacturers • Files increase in size between 1.5 and 6-fold (file sizes for HeLa digests between 0.5 GByte and 8 GByte. This can be avoided by specifying the number of peaks per MSMS spectrum (600 is sensible) in the conversion process • 42 files converted

OpenMS • Platform that allows you to do almost everything with your MS data (particularly within proteomics) • Works with data from all manufacturers • Tutorial is here:http://open-ms.sourceforge.net/wp-content/uploads/2016/01/handout1.pdf • Used this to search all datasets and calculate FDR

OpenMS • Workflows are constructed within Knime (v 3.3.1) • Each worknode can have many parameters that can be set (e.g. for a search) • Default parameters do not always work and need to be tuned, but it is possible to generate a workflow that produces results for all datasets

mzTab tsv conversion • The default output for search results is mzIdentML, a format that is great for computers and contains all metadata, but not very human readable or useable • A more usable output standard is mzTab • mzTab contains protein lists and peptide lists in one mixed table – not good for further processing • Julianus Pfeuffer and Lars Nilse (OpenMS team) have written and R script that I have modified to generate a table of only peptide results, discarding Q-Values > 0.01. • This script is called make_tsv.R and it generates files that are called psm.tsv (one file for each dataset) • Now not necessary anymore – the OpenMS team has developed an improved mzTab exporter

Summarise all data • All psm.tsv files were pulled together in one large file that contains all identified peptides (q<0.01) • Using the dplyr library in R, we performed a full join of all individual tables and extracted for each dataset only the Q-Value (as a measure of identification confidence) • Then we used Perseus to visualise the data in a heatmap

Visualise results • Results could be visualised using an R script, but I did not have time, so we used Perseus (not open source, but free for academics and several papers published) • Perseus tutorials and lectures on youtube:http://www.coxdocs.org/doku.php?id=perseus:user:tutorials

Heatmap of peptide identifications • Red colour codes for high confidence identifications, blue for lower confidence • Grey are missing values • Clustering was performed with a Euclidean distance function • I think that this heatmap does show reasonable reproducibility • Don’t know yet for sure how to get the best interpretation

Conclusions • We have learnt a great deal about improving our QC experiments and procedures • Cross-platform data analysis for QC is difficult, but can be implemented • Commercial standards (external or internal) cost money, but are important (cross-laboratory reproducibility) • ID based and non-ID based QC parameters are very complementary • For phase II we want to produce a generally applicable data analysis workflow that can be distributed to all participants (providing qcML output)

Thank you

Open source tools for data analysis

Open source tools for data analysis

Presentation Transcript

Commercial vs Open Source BI Tools

Open Source Network Management Tools

Open Source Semantic Web Tools

ArTe Open Source Software Tools for creativity

ArTe Open Source Software Tools for creativity

Open Source Software Tools for Law Libraries

Open Source Network Monitoring Tools

ArTe Open Source Software Tools for creativity

OPEN SOURCE TOOLS

ArTe Open Source Software Tools for creativity

WattDepot: Open Source Software for Energy Data Collection and Analysis

Open Source Security Tools

Open Source Bring-Up Tools

It’s Free !! Open Source Tools For Online Reference

Selenium open source tools

Open Standards Open Source Open Data

Open Source Genomic Analysis

Threat Intelligence with Open Source tools

The Tools of Open Source:

Open Source Tools for Data Analysis

ArTe Open Source Software Tools for creativity

Source Code Analysis Tools