Protein Sequence

Protein Sequence • Amino Acid Composition • IEC • RP HPLC • Ancient Sequencing methods • Modern Sequencing methods • Sequencing the Gene • Then what?

Amino Acid Composition • 1952 - Complete Acid Hydrolysis • Ion Exchange Chromatography with programmed buffer changes (~3 hr) • Post-column derivatization with • Ninhydrin • Fluorescamine • 1980 - Complete Acid Hydrolysis • Precolumn derivatization to Phenylthiohydantoins • Reversed-Phase HPLC (~30 min)

Sequencing • Sanger Endgroup Analysis • Modify the protein with fluorodinitrobenzene (amines), aka FDNB, Sanger’s reagent. • Alternative reagent, dansyl chloride, fluorescent. • Hydrolyze protein • Separate by TLC • Identify N-terminal amino acid by Rf • Treat protein with Aminopeptidase • Repeat until the end gets ragged • Use proteolytic fragments for simplicity

Sequencing • Generate proteolytic fragments • Use more than one protease in separate experiments • Trypsin cleaves after Arg and Lys residues • Chymotrypsin cleaves after Phe, Tyr, Trp • Separate fragments (HV paper electrophoresis/HPLC) • Sequence all peptides independently • Assemble the sequence using overlap info Trypsin Chtr

Automated Sequencing • Use proteolytic fragments • Sequence each peptide using automated Edman Degradation • Each Edman cycle removes one amino acid • Converts it to PTH amino acid for HPLC • Assemble the sequence using overlap info Trypsin Chtr

N-Terminal Edman Degradation Peptide Attack on Phenylisothiocyanate + H+ Rearrangement Analino- thiazolinone amino acid + PTH-amino acid Absorbs 260-275 nm RP-HPLC compatible Peptide N-1

C-Terminal Edman Degradation - Activation of carboxyl by acetic anhydride Attack by thiocyanate Peptide N-1 +H2O - TH-amino acid Hydrolysis

Alternative Sequencing - MS • Use non-fragmenting ionization • Electrospray Ionization + traditional mass Spec • Matrix-assisted laser desorption-ionization + time-of-flight mass spec (MALDI-TOF) • Measures mass of mature, intact protein and/or complexes

Sequencing the Gene • DNA synthesis in vitro requires • Template (the DNA you want to sequence) • Primer (complementary to region up stream of where you want to sequence) • Polymerase • dXTP’s, Mg++ • Primer pairs with template, free 3’-OH group ready for action • As dXTP’s basepair with template, the 3’-OH attacks the a-phosphate of the dXTP, displacing PPi, making a phosphodiester, extending the nascent DNA chain by one base

P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P The Polymerase Reaction Elongation of a primer that is base-paired with a template Requires a free 3’-0H group OH 5’ PP P T G C C G A T A T C G C G A T T A T A A T A T A T A C T A G A A T T C A 3’ 5’

Di-deoxy Terminators • If 2’, 3’-dideoxy nucleoside triphosphates were used, the reaction would proceed for only one cycle because there would be no free 3’-OH group to attack the next dXTP • If a fraction of a percent of ONE 2’, 3’-dideoxy nucleoside triphosphate (say ddTTP) were used • SOME polymer would be terminated EACH time that base was incorporated, i.e., each time dA occurs in the template. • If 1/1000th of the dTTP were ddTTP, then 1/1000th of the polymers would terminate at each dA in the template… the rest would continue • You would get many polymers of different sizes, each corresponding to the occurrence of a dA in the template • Use four separate reactions, one with ddTTP, one with ddATP, one with ddGTP, and one with ddCTP (and all other components) • One of the reaction mixtures would contain a polymer that terminated at each base

ddATP ddTTP ddCTP ddGTP Dideoxy Terminators Sequence of template Base in polymer • Use fluorescent or radioactive primer so you can see every polymer • Separate them by size (gel electrophoresis) • Read sequence of polymers from gel • Infer the sequence of the template by Watson-Crick 3’ A T G T C A C A G G A C A G A 5’ 5’ TACA G T C T C C T G T C T 3’ small large Agarose gel

A, T, G, and C. What are the Amino Acids?Standard Genetic Code

ORFs - Look for longest uninterrupted sequence

So, you’ve got the sequence…So what? Next topic: Bioinformatics Inferences based on homology

Questions • Has the gene been sequenced before? (Will I be able to publish?) • What is the sequence of the protein encoded by the gene? • Has the protein been sequenced before? • Is the gene similar to one that has been sequenced before? • Did I sequence the right gene? • Will I be able to find structural or functional relatives? • Is the protein similar to one that has been sequenced before? • How similar? • What does the similarity mean? • Can I predict the function of the gene product, or is the predicted function consistent with what I know about the protein? • Can I get information about structural features of the gene product? • Secondary structure • Folding domains or other common patterns • Hydropathy profiles • How might predicted helices and/or sheet pack? • Is it likely to be a membrane protein, a transmembrane protein?

Answers: Sequence Similarities and Similarity Searches Search sequence databases for homologous proteins. Find families of proteins that are similar to your protein. Use information about the structure and properties of the similar protein(s) to establish inferences about your protein. If the exact sequence is in the database, the similarity search routines will find that, too. Determine whether two sequences are related (or identical) by aligning them so that homologous regions are adjacent. For two identical sequences: MGKARSMVLKHSTKARS MGKARSMVLKHSTKARS

But, what about: Imperfect homology MGKARSMLLKHSTKARS MGKARTMVLKHSTRARS Gaps/insertions MGKARSMLLKHSLKARS MGRA LKHSLRART And, how homologous is homologous

Need • Similarity scores for pairs amino acids • Method for dealing with gaps • Algorithms for comparing a sequence with a database • Ways to assess the degree of homology • Ways to link structural info with sequence info

Dynamic Programming Needleman-Wunsch Algorithm Compares similarity of two proteins a & b at positions i & j: NWi,j = max(NWi-1, j-1 + s(aibj); NWi-1, j; +g;NWi, j-1 +g) NWi-1, j-1= running total s(aibj)= similarity between residue i of protein a and residue j of protein b g = gap penalty http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html

Fill a Matrix with all possibilities Simple example: s = 1,0 and g = 0

Smith-Waterman • Always compare NW terms to zero so that it doesn’t get too small. NWi,j = max (NWi-1, j-1 + s(aibj); NWi-1, j; + g;NWi, j-1 + g; 0)

BLAST & FASTA • FASTA - great, we won’t talk about it • much faster and more selective than SW, but less sensitive • Basic Local Alignment Search Tool • less selective and more sensitive than FASTA, • i.e., you may get more hits, but some of them may be wrong

BLAST • Divide sequence into “words” of length W (eg. BLASTp, initial W = 3) • Compare all W-length words • Retain only pairs with similarity above a threshold,T • Call them High-Scoring Pairs • Increase W, repeat with HSPs • Keep going • remaining above a minimum similarity, • and compare to random probability (E)

Scoring Matrices- Making similarity quantitative • Compare the actual frequency to the frequency expected by chance alone. • Probablilty that alanine appears at position x in a protein • = fraction of Ala in all proteins • pAla • Probability that one protein has Ala at position x, and another protein has Gly? • =pAlapGly • The frequency due to chance, alone.

Similarity • qAla,Gly = ACTUAL frequency that Ala and Gly are at position x in two proteins (in your database) • Ri,j = qi,j/pipj • Score: Si,j = log2(Ri,j) = log2(qi,j/pipj) • “Log-Odds Scores” • Remember Chou & Fasman?

PAM Matrices • Margaret Dayhoff assembled the Atlas of Protein Structure • Evolutionarily-accepted mutations • Calculated qi,j for all aa’s in closely-related proteins • These were accepted by Nature as similar/close enough • Generate half matrices: Point Accepted Mutation/Percent Accepted Mutations • Scale, so PAM1 reflects 1 mutation per 100 residues, PAM50, 50 allowed mutation/100

BLOSUM • Henikoff and Henikoff • BLOcks of Amino Acid SUbstitution Matrix • BLOCKS is a database of related proteins

BLAST Search • Go to BLAST Website • Enter Nucleotide or AA sequence • Choose BLAST type • Nucleotide-nucleotide; BLASTn • Protein-protein, BLASTp • 6-frame-translated nucleotide-Protein:BLASTx • others

Then? • Does it make sense? • Multisequence Alignment • Secondary structure prediction • Domains • Families

Caveat It ain't what you don't know that'll kill you, it's what you know that ain't so.

Protein Sequence

Protein Sequence

Presentation Transcript

From Protein Sequence to Function:

Function Prediction from Protein Sequence

Protein Sequence Analysis - Overview

Bioinformatics and Protein Sequence Analysis

Protein Sequence Databases

Protein Sequence-Structure-Function

PROTEIN SEQUENCE ANALYSIS

Protein Sequence Analysis - Overview

Phylogenetics workshop: Protein sequence phylogeny

Protein Sequence Analysis - Overview -

Protein sequence analysis

Protein Identification by Sequence Database Search

Protein Primary Sequence

Day 1b: Protein Sequence Analysis

Recent Advances in Protein Sequence Analysis

B. Protein sequence alignment

Protein Evolution and Sequence Analysis

Protein sequence databases

Protein Sequence Domain Boundary Detection

Protein Sequence Motifs

Protein Evolution and Sequence Analysis

Protein Sequence Analysis - Overview