On Proteogenomic and its implications on the field of Bioinformatics Hugo Willy
What is proteogenomics? • A combination of the words Proteomics and Genomics. • Proteogenomicscommonly refer to studies that use proteomic information, often derived from mass spectrometry, to improve gene annotations.
A brief introduction to Protein Mass Spectrometry • It is used to characterize protein sequence. • The basic idea is to ionize proteins and let it “fly” in a vacuum chamber. • The mass/charge (m/z) ratio of the ion can be deduced from the Time of Flight (TOF) of the ion (to reach a detector) or the frequency in which it is circling in a magnetic field.
A brief introduction to Protein Mass Spectrometry • Some Mass Spectrometry technique ionize whole proteins but the current popular method is to chop a protein into peptides. • The peptides are separated by their masses before ionization and sequenced independently. • The peptide sequences are mapped back to known protein sequences or used for de novo sequencing (very much like genome sequencing) • The peptide lengths – according to the people I met is around 7-15 amino acids
The pros and cons of Protein Mass Spectrometry • Pros: • It is accurate in determining mass. • It can surely point, assuming unambiguous mapping to a protein sequence, to those proteins that are translated in the cell – this can point which mRNAs get translated and which are not. • It can be used to quantify the amount of different proteins in the sample – as opposed to predicting it from the mRNA levels using microarray
The pros and cons of Protein Mass Spectrometry • Pros: • It can identify Post Translational Modification i.e • If proteins are phosphorylated (then it is Kinase related) • If proteins are methylated and acetylated (important in Histone code) • If proteins are ubiquitinated (related to protein degradation) • It can detect (ribosomal) programmed frameshift and alternative splicing events.
The pros and cons of Protein Mass Spectrometry • Cons: • It is still expensive (but some expert in RECOMB Satellite for Computational Proteomics said it is just as expensive as RNA-Seq). • It is hard to distinguish amino acids with similar mass sum (most notably Leucine and Isoleucine) • We do not have reliable way to amplify proteins in the sample (serious problem)
What does proteogenomics offer? • Accurate prediction of Translation Start Site. • Accurate prediction of programmed frameshifts. • Accurate prediction of post translational modification. • A confirmation if a (pseudo)gene is actually translated. • Observation: most current algorithms on gene prediction are not based on proteomic data (because they were not available)
What does proteogenomicsstruggle with? • For a novel protein, mapping the peptides from the Mass Spectrometry experiments to the exomes/genomes (similar problem as RNA-Seq) • Currently they try to collect exomes (regions that is assumed to be exons) and translate them in 6 different frames (3 in each DNA strand). • They also build a exon splice graph which models different splicing alternatives of a single gene
Exon splice graph Each box represents a single exon and the arrows represent possible combinations of them in the translated protein product. They developed a program to search a peptide in this graph called Inspect. Can be found athttp://proteomics.ucsd.edu/Inspect
Current works in proteogenomics • Revising gene models – hence their annotations. • Finding novel peptides that maps to non-exonic regions – novel genes?
Some papers and reviews on this field • Nitin Gupta et al. Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomicannotation. Genome Res 2007. • Proteogenomics: Annotating Genomes using the Proteome. Natalie Castellana. Poster in RECOMB CP 2011. http://proteomics.ucsd.edu/recombcp2011/Posters/Poster_B19.pdf • Tutorial: Proteogenomics. Natalie Castellana. http://bix.ucsd.edu/projects/recombcp10_tutorials/RECOMBCP_Tutorial_Castellana.pdf • Most of the work are done by PavelPevzner and other groups in UC San Diego. Here is their website http://proteomics.ucsd.edu/
Comparative Proteogenomics • Is a branch of proteogenomics that compares proteomic data from multiple related species concurrently and exploits the homology between their proteins to improve annotations with higher statistical confidence. • In a sense – this is the approximate peptide matching problem. • However, it needs to take residue conservation at different part of the proteins into account e.g sites which are post translationally modified must be preserved to maintain function.
Comparative Proteogenomics • Some work in comparative proteogenomics: • Nitin Gupta et al. Comparative proteogenomics: Combining mass spectrometry and comparative genomics to analyze multiple genomes. Genome Res 2008. • GenoMS (Castellanaet al. MCP 2010) – This is a program to map peptides to the genome of other related organism
Metaproteomics • Metaproteomics (also Community Proteomics, Environmental Proteomics, or Community Proteogenomics) is the study of all protein samples recovered directly from environmental samples. • This involves simultaneous mapping of peptides to all known genomes and proteomes to get the identity of different organisms present in a sample. • Example work in this field is by Wilmes P, Bond PL. Metaproteomics: studying functional gene expression in microbial ecosystems. Trends Microbiol.2006.
De Novo Novel Protein Sequencing • CSPS (Bandeira et al. Nat. Biot. 2009)
Mass Spectra Database • MassBank • http://www.massbank.jp/en/document.html
Discussion on the problems and possible future directions • I notice that Hoang’s problem – the one which may be able to store multiple reference genomes is going to be very relevant. • RNA-Seq - Mass Spectrometry = Non-coding RNA? • Anything else?