130 likes | 237 Vues
This study presents a regression-based K-nearest neighbors (KNN) approach for predicting gene functions using diverse data sources, including microarray expression, protein interactions, and evolutionary history. By estimating the likelihood that a pair of genes are in the same functional class, the method employs a robust voting scheme influenced by confidence scores derived from chosen predictors. The findings indicate that combining various predictors yields superior results, with KNN outpacing traditional SVM techniques in accuracy for gene classification across multiple functional categories.
E N D
Regression based KNN for gene function prediction using heterogeneous data sources Zizhen Yao, Larry Ruzzo yzizhen, ruzzo @cs.washington.edu
Background • E. Coli classification schemes • KEGG , COG, MultiFun • Common functional classes (10-19 classes) • Metabolism, Translation, Transporter, Cell Motility • Biological information used for inference • Microarray expression, protein interaction, evolutionary history • Methods • Support vector machine, Bayesian, Rule-based
Introduction to KNN • Idea – for each query instance • Choose k nearest neighbors • Choose the class voted by majority of the neighbors. • Design issues • Similarity / Distance metric • Voting schemes
Algorithm Flow Chart Training Testing Training Data Testing Data For every pair of training genes, calculate the predictors. Calculate the predictors values using and training data Learn Similarity Metric Choose k nearest neighbors Voting A list of predictions with confidence scores.
Predictors • Microarray Expression Data • Expression correlation • Sequencing Data • Chromosomal position • Chromosomal distance • Transcription direction • Block indicator • Protein sequence similarity • Paralog indicator
Similarity (Distance) Metric • Classical metrics are not appropriate because predictors are • heterogeneous data type, scale • different relevance • correlated • Goal: estimate the likelihood that a pair of genes are in the same class based on predictors
Learning Similarity Metric • Regression methods • Response • Find f • Logistic regression • Local regression
Probabilistic voting scheme • Goal: estimate the probability that the query gene belong to each class. • Range: [0 ~ 1] • Assigns higher confidence score to predictions voted by more neighbors, or neighbors with higher credibility. • Report predictions that are above certain threshold value.
Results Summary • Combining all 4 predictors yields the best result. • Using expression data only, regression based KNN methods outperforms SVM. • Performance varies with different function classes • Confidence scores are strongly correlated with accuracy.
Contribution • KNN • Simplicity, efficiency, flexibility • Easy to interpret the results, useful to guide case studies • Similarity metric • integrate heterogeneous data sources • voting scheme • Statistic inference • A general framework to incorporate other information.