Predicting protein function from heterogeneous data

Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

One-minute response • Wasn’t sure what the relevance was of the dot product in the feature space. • I think more examples would be helpful. Anything visual is helpful. • Confused about how exactly to use kernel to determine separating plane. • More time on last three slides. • Confused on how the weights in the kernel function will be used in the final prediction. • Please include a toy example with numbers for the Bayesian network. • Also, a biologically relevant motivating example for the SVM. • This was the first time I understood the kernel thing. • I am not sure when to use the SVM versus other clustering approaches. • For the kernel trick, the main thing that is missing is the motivation for not enumerating the entire feature space. • The kernel discussion was hard to follow. More math background would have helped. • I was good with everything up to the “Kernel function as dot product” slide. I’m not sure what the purpose of phi is. • I liked the concrete examples. • I got a good feel for the big picture but failed to fully grasp everything about the kernel section. • I was distracted by jargon in parts of the lecture. Better to introduce the term “kernel” when it is first used. • Still a bit shaky on the weights and the SVM optimization. • What are some examples of other common methods that use kernels? • Draw bigger on the board. • Hope you go into why we choose certain k over others.

Outline • Support vector machines • Diffusion / message passing

Kernel function • The kernel function plays the role of the dot product operation in the feature space. • The mapping from input to feature space is implicit. • Using a kernel function avoids representing the feature space vectors explicitly. • Any continuous, positive semi-definite function can act as a kernel function. Proof of Mercer’s Theorem: Intro to SVMs by Cristianini and Shawe-Taylor, 2000, pp. 33-35.

Learning gene classes Training set Eisen et al. 2465 Genes Learner Model 79 experiments MYGD Eisen et al. 3500 Genes Predictor Class 79 experiments Test set

Class prediction

SVM outperforms other methods

Predictions of gene function Fleischer et al. “Systematic identification and functional screens of uncharacterized proteins associated with eukaryotic ribosomal complexes” Genes Dev, 2006.

Overview • 218 human tumor samples spanning 14 common tumor types • 90 normal samples • 16,063 “genes” measured per sample • Overall SVM classification accuracy: 78%. • Random classification accuracy: 1/14 = 9%.

Summary: Support vector machine learning • The SVM learning algorithm finds a linear decision boundary. • The hyperplane maximizes the margin; i.e., the distance from any training example. • The optimization is convex; the solution is sparse. • A soft margin allows for noise in the training set. • A complex decision surface can be learned by using a non-linear kernel function.

Cost/Benefits of SVMs • SVMs perform well in high-dimensional data sets with few examples. • Convex optimization implies that you get the same answer every time. • Kernels functions allow encoding of prior knowledge. • Kernel functions handle arbitrary data types. • The hyperplane does not provide a good explanation, especially with a non-linear kernel function.

Vector representation • Each matrix entry is an mRNA expression measurement. • Each column is an experiment. • Each row corresponds to a gene.

Similarity measurement • Normalized scalar product • Similar vectors receive high values, and vice versa. Similar Dissimilar

Kernel matrix

Sequence kernels • We cannot compute a scalar product on a pair of variable-length, discrete strings. >ICYA_MANSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH >LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALE KFDKALKALPMHIRLSFNPTQLEEQCHI

Pairwise comparison kernel

protein 1 0 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 protein Protein-protein interactions • Pairwise interactions can be represented as a graph or a matrix.

Linear interaction kernel • The simplest kernel counts the number of interactions between each pair. 1 0 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 3

Diffusion kernel • A general method for establishing similarities between nodes of a graph. • Based upon a random walk. • Efficiently accounts for all paths connecting two nodes, weighted by path lengths.

Hydrophobicity profile • Transmembrane regions are typically hydrophobic, and vice versa. • The hydrophobicity profile of a membrane protein is evolutionarily conserved. Membrane protein Non-membrane protein

Hydrophobicity kernel • Generate hydropathy profile from amino acid sequence using Kyte-Doolittle index. • Prefilter the profiles. • Compare two profiles by • Computing fast Fourier transform (FFT), and • Applying Gaussian kernel function. • This kernel detects periodicities in the hydrophobicity profile.

Combining kernels B A B A A:B K(A) K(B) Identical K(A:B) K(A)+K(B)

Semidefinite programming • Define a convex cost function to assess the quality of a kernel matrix. • Semidefinite programming (SDP) optimizes convex cost functions over the convex cone of positive semidefinite matrices.

Semidefinite programming According to a convex quality measure: Learn K from the convex cone of positive-semidefinite matrices or a convex subset of it : Integrate constructed kernels Learn a linear mix Large margin classifier (SVM) Maximize the margin SDP

Integrate constructed kernels Learn a linear mix Large margin classifier (SVM) Maximize the margin

Markov Random Field • General Bayesian method, applied by Deng et al. to yeast functional classification. • Used five different types of data. • For their model, the input data must be binary. • Reported improved accuracy compared to using any single data type.

Yeast functional classes

Six types of data • Presence of Pfam domains. • Genetic interactions from CYGD. • Physical interactions from CYGD. • Protein-protein interaction by TAP. • mRNA expression profiles. • (Smith-Waterman scores).

Results MRF SDP/SVM (binary) SDP/SVM (enriched)

Pros and cons • Learns relevance of data sets with respect to the problem at hand. • Accounts for redundancy among data sets, as well as noise and relevance. • Discriminative approach yields good performance. • Kernel-by-kernel weighting is simplistic. • In most cases, unweighted kernel combination works fine. • Does not provide a good explanation.

Top performing methods

GeneMANIA • Normalize each network (divide each element by the square root of the product of the sums of the rows and columns). • Learn a weight for each network via ridge regression. Essentially, learn how informative the network is with respect to the task at hand. • Sum the weighted networks. • Assign labels to the nodes. Use (n+ + n-)/n for unlabeled genes. • Perform label propagation in the combined network. Mostafavi et al. Genome Biology. 9:S4, 2008.

Toy example Round 0 0 0.8 1 0.9 0.5 0.2 0 0 0 0.7 α = 0.95

Toy example Round 0 Round 1 0 0.8 0.8 0.8 1 1 0.9 0.9 0.5 0.5 0.2 0.2 0 0 0.2 0 0 0.5 0.7 0.7 α = 0.95

Toy example Round 1 Round 2 0.8 ? 0.8 0.8 1 1 0.9 0.9 0.5 0.5 0.2 0.2 0.2 0 0.5 ? 0.7 0.7 α = 0.95

Toy example Round 1 Round 2 0.8 0.8 0.8 0.8 1 1 0.9 0.9 0.5 0.5 0.2 0.2 0.2 0 ? 0.5 0.5 0.7 0.7 α = 0.95

Toy example Round 1 Round 2 0.8 0.8 0.8 0.8 1 1 0.9 0.9 0.5 0.5 0.2 0.2 0.2 0 0.884 ? 0.5 0.5 0.7 0.7 0.2 + (0.95 * 0.8 * 0.9) α = 0.95

Toy example Round 1 Round 2 0.8 0.8 0.8 0.8 1 1 0.9 0.9 0.5 0.5 0.2 0.2 0.2 0 0.884 0.133 0.5 0.5 0.7 0.7 0 + (0.95 * 0.2 * 0.7) α = 0.95

Predicting protein function from heterogeneous data