An Introduction to Proteomics

An Introduction to Proteomics The PROTEin complement of the genOME. Judson Hervey UT-ORNL GST Graduate Student whervey@utk.edu

What is Proteomics? • Defined as “the analysis of the entire protein complement in a given cell, tissue, or organism.” • Proteomics “also assesses activities, modifications, localization, and interactions of proteins in complexes.” • Proteomes of organisms share intrinsic differences across species and growth conditions.

Alternative View • Definition by Mike Tyers (U Toronto): • Lumping everything “post-genomic” together, and eluding to proteomics as “protein chemistry on an unprecedented, high-throughput scale.” • In any case, no matter which definition you accept, consider proteomics as the “next step” in modern biology.

Importance of Proteins: • they serve as catalysts that maintain metabolic processes in the cell, • they serve as structural elements both within and outside the cell, • they are signals secreted by one cell or deposited in the extracellular matrix that are recognized by other cells, • they are receptors that convey information about the extracellular milieu to the cell, • they serve as intracellular signaling components that mediate the effects of receptors, • they are key components of the machinery that determines which genes are expressed and whether mRNAs are translated into proteins, • they are involved in manipulation of DNA and RNA through processes such as: DNA replication, DNA recombination, RNA splicing or editing. http://www-users.med.cornell.edu/~jawagne/proteins_&_purification.html

But what about the Genome? • What does having the genome of an organism give us? • A great diagram, or “blueprint,” of the genes within an organism. • Think of the genome as code that needs compiled into functional units. • The genome gets “compiled” into the proteome via the central dogma of biology. • Proteomic strategies attempt to utilize information from the genome in an attempt to conceptualize protein function.

Experimental Platforms Tyers and Mann, pg 194 “Systems biology is an approach to studying complex biological systems made possible through technological breakthroughs such as the human genome project. …systems biology simultaneously studies the complex interaction of many levels of biological information to understand how they work together.” http://www.systems biology.org/

Challenges facing Proteomic Technologies • Limited/variable sample material • Sample degradation (occurs rapidly, even during sample preparation) • Vast dynamic range required • Post-translational modifications (often skew results) • Specificity among tissue, developmental and temporal stages • Perturbations by environmental (disease/drugs) conditions • Researchers have deemed sequencing the genome “easy,” as PCR was able to assist in overcoming many of these issues in genomics.

The Peptide Bond

Protein Structure Figure 3-35. Three levels of organization of a protein. (Alberts - Molecular Biology of the Cell)

Amino Acid Properties

Pillar Proteomic Technologies • Amino Acid Composition • Array-based Proteomics • 2D PAGE • Mass Spectrometry • Structural Proteomics • Informatics (and the challenges facing the Human Proteome Project)

Amino Acid Composition (Edmund) • Pioneering method of obtaining information from proteins. • Cumbersome and tedious by today’s standards. • Requires the use of terrible smelling ß-mercaptoethanol.  • Not “high-throughput” by today’s standards, hence, aa comp is no longer the most widely used technique.

Protein Sequencingstep 1, fragmenting into peptides

Protein Sequencingstep 2, sequencing the peptides by Edmund degradation. Separation by HPLC and detect by absorbance at 269nm.

Array-based Proteomics • Employ two-hybrid assays • Use GFP, FRET, and GST • GFP = green florescent protein • FRET = florescence resonance energy transfer • GST = glutathione S-transferase, a well characterized protein used as a marker protein.

Array-based Proteomics

Array-based Proteomics • Offer a high-throughput technique for proteome analysis. • These small plates are able to hold many different samples at a time. • Current research is ongoing in an attempt to interface array methodologies with Mass Spectrometry at ORNL.

Two-Hybrid Assay Figure 12-35. Griffiths et. al. Modern Genetic Analysis.

2D PAGE • 2-D gel electrophoresis is a multi-step procedure that can be used to separate hundreds to thousands of proteins with extremely high resolution. • It works by separation of proteins by their pI's in one dimension using an immobilized pH gradient (first dimension: isoelectric focusing) and then by their MW's in the second dimension.

2D PAGE • 2-D gel electrophoresis process consists of these steps: • Sample preparation • First dimension: isoelectric focusing • Second dimension: gel electrophoresis • Staining • Imaging analysis via software

2D PAGE product of Hs plasma http://us.expasy.org/ch2dothergifs/publi/elc.gif

Drawbacks of 2D PAGE • Technique precision lacks reliable reproduction. • Spots often overlap, making identifications difficult. • More of “an art” than “a science.” • Slow and tedious. • Process contains may “open” phases where contamination is possible.

Structural Proteomics • Pioneering work is undergoing by Baumeister et al, which can significantly reduce the amount of painstaking labor in the crystallization of proteins. • Current techniques are not considered “high throughput” within the structural realm. • Novel solutions combine current technologies, such as NMR and XRC.

Informatics • Significant improvements are needed in: • Data presentation standards and formatting • Software infrastructure • ISB - have created many powerful software packages that interpret data from different techniques. • EBI and HUPO have come together to promote uniform data storage and analysis: • http://psidev.sourceforge.net • The proteomics community has, over the course of the past four years, become slightly “less proprietary.” Ron Beavis of U. Manitoba has developed x! tandem, an open-source search algorithm as an alternative to SEQUEST. • Development of novel software for both analysis and strategies [for biologists ] to manage the data are two fronts that I can see as opportunities for folks with a CS background.

Clinical Proteomics • This area of proteomics focuses on accelerating drug development for diseases through the systematic identification of potential drug targets. • How could this be accomplished? • Hopefully, we will have more specific information, instead of raw genes, that will make those complex differential equations much simpler in the coming years.

Mass Spectrometry • Mass Spectrometry is another tool to analyze the proteome. • In general a Mass Spectrometer consists of: • Ion Source • Mass Analyzer • Detector • Mass Spectrometers are used to quantify the mass-to-charge (m/z) ratios of substances. • From this quantification, a mass is determined, proteins are identified, and further analysis is performed.

“Mass Spec” Analyses can be run in Tandem • MS/MS refers to two MS experiments performed “in tandem.” • Among other things, MS/MS allows for the determination of sequence information, usually in the form of peptides (small parts of a protein). • This information is used by algorithms to identify a protein on the basis of mass of a constituent peptide.

If you are lost…. • Consider an example: calculating a person’s weight, without them knowing. • If we have a backpack that we know is 10 pounds, we could have them put it on. • Then, walk the subject over a hidden scale in the floor. • The weight of the person could be obtained by subtracting the weight of the backpack.

In a similar manner: • Mass spectrometers allow the determination of a mass-to-charge ratio of the analyte. • By knowing the charged state of the analyte through the addition of protons (the backpack in the example), the mass can be calculated after deconvolution of the spectrum.

LCQ Mass Spectrometer

Example MS/MS Spectrum This spectrum shows the fragmentation of a peptide, which is used to determine the sequence of the peptide, via a search algorithm.

Typical MS experiment:

Algorithmic approaches to “tag” identification • Peptide sequence tags (Mann): extract and unambiguous sequence tag for ID. • Cross Correlation (Eng et al. - SEQUEST): comparison between observed and theoretically generated spectra. • Probability-based matching (Perkins, and the proprietary Mascot by Matrix Science): takes into statistical significance of fragmentation. • Which one of these approaches would you employ? (Hint: Discussion fostering question.) • Could DP be employed in the searching for post-translational modifications in future designs? • Could it be done in advance in order to factor account for PTMs to speed up the time of the search?

De novo algorithms

Sequence tagging Algorithms

+/- Sequence Tagging Algorithms

Other Proteomic Tools FYR

My $0.02: • Proteomics is undoubtedly a critical component of systems biology, however: • The lack of hypothesis-driven experiments isn’t necessarily “good” for science. Discovery-based science should be guided by hypotheses, IMO. • Along these lines, as with the HGP, when it comes to literature, what do you do, just publish the whole thing? • This is another stumbling block of what to do with all of this information. • Proteomics needs its “own PCR,” or “miracle” tool, to increase the throughput. • A new technology, or instrument that combines other approaches, would be useful, esp. in structural proteomics, quantification, and sample reproduction.

References • Nature Insight: Proteomics. Nature 422: 191-237. • Zhu, H. et al. Proteomics. Annual Review of Biochemistry 72: 783-812. • Griffiths et al. Modern Genetic Analysis. Online: http://ncbi.nih.gov

An Introduction to Proteomics