Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry

Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

Synopsis • MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. • Key concepts: • Spectrum acquisition is unbiased • Direct observation of amino-acid sequence • Sensitive to small sequence variations

Synopsis • MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. • Applications: • Cancer biomarkers • Genome annotation

Mass Spectrometry for Proteomics • Measure mass of many (bio)molecules simultaneously • High bandwidth • Mass is an intrinsic property of all (bio)molecules • No prior knowledge required

Sample + _ Detector Ionizer Mass Analyzer Mass Spectrometer ElectronMultiplier(EM) Time-Of-Flight (TOF) Quadrapole Ion-Trap MALDI Electro-SprayIonization (ESI)

100 % Intensity 0 m/z 250 500 750 1000 High Bandwidth

Mass is fundamental!

Mass Spectrometry for Proteomics • Measure mass of many molecules simultaneously • ...but not too many, abundance bias • Mass is an intrinsic property of all (bio)molecules • ...but need a reference to compare to

Mass Spectrometry for Proteomics • Mass spectrometry has been around since the turn of the century... • ...why is MS based Proteomics so new? • Ionization methods • MALDI, Electrospray • Protein chemistry & automation • Chromatography, Gels, Computers • Protein / genome sequences • A reference for comparison

Enzymatic Digest and Fractionation Sample Preparation for Peptide Identification

Single Stage MS MS m/z

Tandem Mass Spectrometry(MS/MS) m/z Precursor selection m/z

Tandem Mass Spectrometry(MS/MS) Precursor selection + collision induced dissociation (CID) m/z MS/MS m/z

Peptide Identification • For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well • Peptide sequences from (any) sequence database • Swiss-Prot, IPI, NCBI’s nr, ESTs, genomes, ... • Automated, high-throughput peptide identification in complex mixtures

Peptide Identification ...can provide direct experimental evidence for the amino-acid sequence of functionalproteins. Evidence for: • Functional protein isoforms • Translation start and frame • Proteins with short open-reading-frames

Why is this useful for ...... genome annotation? • Evidence for SNPs and alternative splicing stops with transcription • No genomic or transcript evidence for translation start-site. • Conservation doesn’t stop at coding bases! • Statistical gene-finders struggle with micro-exons, translation start-site, and short ORFs.

Why is this useful for ...... cancer biomarkers? • Alternative splicing is the norm! • Only 20-25K human genes • Each gene makes many proteins • Some splicing is believed to be silencing • Lots of splicing in cancer • Proteins have clinical implications • Statistical biomarker discovery • Putative malfunctioning proteins

What can be observed? • Known coding SNPs • Novel coding mutations • Alternative splicing isoforms • Microexons ( non-cannonical splice-sites ) • Alternative translation start-sites ( codons ) • Alternative translation frames • “Dark” open-reading-frames

Splice Isoform • Human Jurkat leukemia cell-line • Lipid-raft extraction protocol, targeting T cells • von Haller, et al. MCP 2003. • LIME1 gene: • LCK interacting transmembrane adaptor 1 • LCK gene: • Leukocyte-specific protein tyrosine kinase • Proto-oncogene • Chromosomal aberration involving LCK in leukemias. • Multiple significant peptide identifications

Splice Isoform

Novel Splice Isoform

Novel Mutation • HUPO Plasma Proteome Project • Pooled samples from 10 male & 10 female healthy Chinese subjects • Plasma/EDTA sample protocol • Li, et al. Proteomics 2005. (Lab 29) • TTR gene • Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis. • Familial amyloidotic polyneuropathy • late-onset, dominant inheritance

Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy

Novel Mutation

Translation Start-Site • Human erythroleukemia K562 cell-line • Depth of coverage study • Resing et al. Anal. Chem. 2004. • THOC2 gene: • Part of the heteromultimeric THO/TREX complex. • Initially believed to be a “novel” ORF • RefSeq mRNA in Jun 2007, no RefSeq protein • TrEMBL entry Feb 2005, no SwissProt entry • Genbank mRNA in May 2002 (complete CDS) • Plenty of EST support • ~ 100,000 bases upstream of other isoforms

Translation Start-Site

Easily distinguish minor sequence variations Two B. anthracis Sterne α/β SASP annotations • RefSeq/Gb: MVMARN... (7441 Da) • CMR: MARN... (7211 Da) • Intact proteins differ by 230 Da • 7441 Da vs 7211 Da • N-terminal tryptic peptides: • MVMAR (606.3 Da), MVMARNR (876.4 Da), vs • MARNR (646.3 Da) • Very different MS/MS spectra

Bacterial Gene-Finding • Find all the open-reading-frames... …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stopcodon Stopcodon ...courtesy of Art Delcher

Bacterial Gene-Finding • Find all the open-reading-frames......but they overlap – which ones are correct? Reversestrand Stopcodon …ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT… …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stopcodon Stopcodon ShiftedStop ...courtesy of Art Delcher

Coding-Sequence “Score” ...courtesy of Art Delcher

Glimmer3 trained & compared to RefSeq genes with annotated function Correct STOP: 99.6% Correct START: 84.3% “Not all the genomes necessarily have carefully/accurately annotated start sites, so the results for number of correct starts may be suspect.” Glimmer3 Performance

N-terminal peptides • (Protein) N-terminal peptides establish • start-site of known & unexpected ORFs Use: • Directly to annotate genomes • Evaluate and improve algorithms • Map cross-species

N-terminal peptide workflows • Typical proteomics workflows sample peptides from the proteome “randomly” • Caulobacter crescentus (70%) • 3733 Proteins (RefSeq Genome annot.) • 66K tryptic peptides (600 Da to 3000 Da) • 2085 N-terminal tryptic peptides (3%)

Protect protein N-terminus Digest to peptides Chemically modify free peptide N-term Use chem. mod. to capture unwanted peptides N-terminal peptide workflow Nat Biotech, Vol. 21, pp. 566-569, 2003.

Multiple (digest) enzymes: trypsin-R: 60% (80%) acid + lys-C + trypsin:85% (94%) Repeated LC-MS/MS Precursor Exclusion / Inclusion lists MALDI / ESI Protein separation and/or orthogonal fractionation Increasing N-terminal peptide coverage Anal Chem, Vol. 76, pp. 4193-4201, 2004.

Proteomics Informatics • Search spectra against: • Entire bacterial genome; • All Met initiated peptides; or • Statistically likely Met initiated peptides. • Easily consider initial Met loss PTM, too • Off-the-shelf MS/MS search engines (Mascot / X!Tandem / OMSSA)

Other Practical Issues • Suitable for commonly available instrumentation • Only the sample prep. is (somewhat) novel. • Need living organism • Stage of life-cycle? • Bang for buck? • N-terminal peptides / $$$$ • In discussions with JCVI (ex TIGR) • Possible pilot project?

Other Research Projects • Improving peptide identification by MS/MS • Spectral matching using HMMs • Combining search engine results • Spectral matching for detection and quantitation • Microorganism identification using MS • Live public web-site and database • (Inexact) uniqueness guarantees • Primer/Probe oligo design • Pathogen detection (DNA & Peptide) • Significant false-positive peptide identifications

Spectral Matching • Detection vs. identification • Increased sensitivity • No novel peptides • NIST GC/MS Spectral Library • Identifies small molecules, • 100,000’s of (consensus) spectra • Bundled/Sold with many instruments • “Dot-product” spectral comparison • Current project: Peptide MS/MS

Peptide DLATVYVDVLK

Hidden Markov Models for Spectral Matching • Capture statistical variation and consensus in peak intensity • Capture semantics of peaks • Extrapolate model to other peptides • Good specificity with superior sensitivity for peptide detection • Assign 1000’s of additional spectra (w/ p-value < 10-5)

www.RMIDb.org

www.RMIDb.org Statistics: • 16.7 x 106 (6.4 x 106) protein sequences • ~ 40,000 organisms, ~ 19,700 species • 557 (415) complete genomes Sources: • TIGR’s CMR, SwissProt, TrEMBL, Genbank Proteins, RefSeq Proteins & Genomes • Inclusive Glimmer3 predictions on Genomes • Pfam and GO assignments using BOINC grid

www.RMIDb.org Accessed from all over the world...

Uniqueness guarantees • 20-mer oligo signatures for B. anthracis • In all available strains as exact match • No (inexact) match to other Bacillus species

Uniqueness guarantees • Human genome primer design problem • “4-unique” DNA 20-mers: • Edit-distance ≥ 5 to any non-specific hybridization site • No such valid loci on Chr. 22! • Currently analyzing entire genome • “3-unique” DNA 20-mers: • Initial experiments suggest ~ 0.01% valid • Approx. 1 valid oligo every 10,000 bases

Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry

Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry

Presentation Transcript

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Measurement of cotinine in urine by liquid chromatography tandem mass spectrometry

PEAKS: De Novo Sequencing using Tandem Mass Spectrometry

Protein Sequencing and Identification by Mass Spectrometry

Mass Spectrometry

Tandem Mass Spectrometry

Protein Sequencing and Identification by Mass Spectrometry

Mass Spectrometry

Using mass spectrometry for protein-protein interaction studies

Mass Spectrometry

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

TANDEM OBSERVATION MODELS

Spaghetti: Visualization of Observed Peptides in Tandem Mass Spectrometry

Peptide Identification via Tandem Mass Spectrometry Sorin Istrail

Protein Identification Using Tandem Mass Spectrometry

Analysis of Protein Complexes by Mass Spectrometry

Protein sequencing and Mass Spectrometry

PROTEIN IDENTIFICATION BY MASS SPECTROMETRY

host cell protein analysis mass spectrometry

Analysis of Protein Complexes by Mass Spectrometry

Mass Spectrometry

Protein Sequencing and Identification by Mass Spectrometry