350 likes | 575 Vues
Neural Networks for Protein Structure Prediction Brown, JMB 1999. CS 466 Saurabh Sinha. Outline. Goal is to predict “secondary structure” of a protein from its sequence Artificial Neural Network used for this task Evaluation of prediction accuracy. What is Protein Structure?.
 
                
                E N D
Neural Networks for Protein Structure PredictionBrown, JMB 1999 CS 466 Saurabh Sinha
Outline • Goal is to predict “secondary structure” of a protein from its sequence • Artificial Neural Network used for this task • Evaluation of prediction accuracy
http://academic.brooklyn.cuny.edu/biology/bio4fv/page/3d_prot.htmhttp://academic.brooklyn.cuny.edu/biology/bio4fv/page/3d_prot.htm
http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.pnghttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png
Protein Structure • An amino acid sequence “folds” into a complex 3-D structure • Finding out this 3-D structure is a crucial and challenging task • Experimental methods (e.g., X-ray crystallography) are very tedious • Computational predictions are a possibility, but very difficult
“Strand” “Helix” http://www.wiley.com/college/pratt/0471393878/student/structure/secondary_structure/secondary_structure.gif
“Helix” “Strand” http://www.npaci.edu/features/00/Mar/protein.jpg
Secondary structure prediction • Well, the whole 3-D “tertiary” protein structure may be hard to predict from sequence • But can we at least predict the secondary structural elements such as “strand”, “helix” or “coil”? • This is what this paper does • .. and so do many other papers (it is a hard problem !)
A survey of structure prediction • The most reliable technique is “comparative modeling” • Find a protein P whose amino acid sequence is very similar to your “target” protein T • Hope that this other protein P does have a known structure • Predict a similar structure similar to that of P, after carefully considering how the sequences of P and T differ
A survey of structure prediction • Comparative modeling fails if we don’t have a suitable homologous “template” protein P for our protein T • “Ab initio” tertiary methods attempt to predict the structure without using a protein structure • Incorporate basic physical and chemical principles into the structure calculation • Gets very hairy, and highly computationally intensive • The other option is prediction of secondary structure only (i.e., making the goal more modest) • These may be used to provide constraints for tertiary structure prediction
Secondary structure prediction • Early methods were based on stereochemical principles • Later methods realized that we can do better if we use not only the one sequence T (our sequence), but also a family of “related sequences” • Search for sequences similar to T, build a multiple alignment of these, and predict secondary structure from the multiple alignment of sequence
What’s multiple alignment doing here ? • Most conserved regions of a protein sequence are either functionally important or buried in the protein “core” • More variable regions are usually on surface of the protein, • there are few constraints on what type of amino acids have to be here (apart from bias towards hydrophilic residues) • Multiple alignment tells us which portions are conserved and which are not
hydrophobic core http://bio.nagaokaut.ac.jp/~mbp-lab/img/hpc.png
What’s multiple alignment doing here ? • Therefore, by looking at multiple alignment, we could predict which residues are in the core of the protein and which are on the surface (“solvent accessibility”) • Secondary structure then predicted by comparing the accessibility patterns associated with helices, strands etc. • This approach (Benner & Gerloff) mostly manual • Today’s paper suggest an automated method
The PSI-PRED algorithm • Given an amino-acid sequence, predict secondary structure elements in the protein • Three stages: • Generation of a sequence profile (the “multiple alignment” step) • Prediction of an initial secondary structure (the neural network step) • Filtering of the predicted structure (another neural network step)
Generation of sequence profile • A BLAST-like program called “PSI-BLAST” used for this step • We saw BLAST earlier -- it is a fast way to find high scoring local alignments • PSI-BLAST is an iterative approach • an initial scan of a protein database using the target sequence T • align all matching sequences to construct a “sequence profile” • scan the database using this new profile • Can also pick out and align distantly related protein sequences for our target sequence T
The sequence profile looks like this • Has 20 x M numbers • The numbers are log likelihood of each residue at each position
Preparing for the second step • Feed the sequence profile to an artificial neural network • But before feeding, do a simply “scaling” to bring the numbers to 0-1 scale
Intro to Neural nets (the second and third steps of PSIPRED)
Artificial Neural Network • Supervised learning algorithm • Training examples. Each example has a label • “class” of the example, e.g., “positive” or “negative” • “helix”, “strand”, or “coil” • Learns how to predict the class of an example
Artificial Neural Network • Directed graph • Nodes or “units” or “neurons” • Edges between units • Each edge has a weight (not known a priori)
Layered Architecture http://www.akri.org/cognition/images/annet2.gif Input here is a four-dimensional vector. Each dimension goes into one input unit
Layered Architecture http://www.geocomputation.org/2000/GC016/GC016_01.GIF (units)
What a unit (neuron) does • Unit i receives a total input xi from the units connected to it, and produces an output yi = fi(xi) where fi() is the “transfer function” of unit i wi is called the “bias” of the unit
Weights, bias and transfer function Unit takes n inputs Each input edge has weight wi Bias b Output a Transfer function f() Linear, Sigmoidal, or other
Weights, bias and transfer function • Weights wij and bias wi of each unit are “parameters” of the ANN. • Parameter values are learned from input data • Transfer function is usually the same for every unit in the same layer • Graphical architecture (connectivity) is decided by you. • Could use fully connected architecture: all units in one layer connect to all units in “next” layer
Where’s the algorithm? • It’s in the training of parameters ! • Given several examples and their labels: the training data • Search for parameter values such that output units make correct predictions on the training examples • “Back-propagation” algorithm • Read up more on neural nets if you are interested
Step 2 • Feed the sequence profile to the input layer of an ANN • Not the whole profile, only a window of 15 consecutive positions • For each position, there are 20 numbers in the profile (one for each amino acid) • Therefore ~ 15 x 20 = 300 numbers fed • Therefore, ~ 300 “input units” in ANN • 3 output units, for “strand”, “helix”, “coil” • each number is confidence in that secondary structure for the central position in the window of 15
e.g., 0.18 0.09 0.67 helix strand 15 coil Input layer Hidden layer
Step 3 • Feed the output of 1st ANN to the 2nd ANN • Each window of 15 positions gave 3 numbers from the 1st ANN • Take 15 successive windows’ outputs and feed them to 2nd ANN • Therefore, ~ 15 x 3 = 45 input units in ANN • 3 output units, for “strand”, “helix”, “coil”
Cross-validation • Partition the training data into “training set” (two thirds of the examples) and “test set” (remaining one third) • Train PSIPRED on training set, test predictions and compare with known answers on test set. • What is an answer? • For each position of sequence, a prediction of what secondary structure that position is involved in • That is, a sequence over “H/S/C” (helix/strand/coil) • How to compare answer with known answer? • Number of positions that match