1 / 13

Regression based KNN for gene function prediction using heterogeneous data sources

Regression based KNN for gene function prediction using heterogeneous data sources. Zizhen Yao, Larry Ruzzo yzizhen, ruzzo @cs.washington.edu. Background. E. Coli classification schemes KEGG , COG, MultiFun Common functional classes (10-19 classes)

honora
Télécharger la présentation

Regression based KNN for gene function prediction using heterogeneous data sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regression based KNN for gene function prediction using heterogeneous data sources Zizhen Yao, Larry Ruzzo yzizhen, ruzzo @cs.washington.edu

  2. Background • E. Coli classification schemes • KEGG , COG, MultiFun • Common functional classes (10-19 classes) • Metabolism, Translation, Transporter, Cell Motility • Biological information used for inference • Microarray expression, protein interaction, evolutionary history • Methods • Support vector machine, Bayesian, Rule-based

  3. Introduction to KNN • Idea – for each query instance • Choose k nearest neighbors • Choose the class voted by majority of the neighbors. • Design issues • Similarity / Distance metric • Voting schemes

  4. Algorithm Flow Chart Training Testing Training Data Testing Data For every pair of training genes, calculate the predictors. Calculate the predictors values using and training data Learn Similarity Metric Choose k nearest neighbors Voting A list of predictions with confidence scores.

  5. Predictors • Microarray Expression Data • Expression correlation • Sequencing Data • Chromosomal position • Chromosomal distance • Transcription direction • Block indicator • Protein sequence similarity • Paralog indicator

  6. Similarity (Distance) Metric • Classical metrics are not appropriate because predictors are • heterogeneous data type, scale • different relevance • correlated • Goal: estimate the likelihood that a pair of genes are in the same class based on predictors

  7. Learning Similarity Metric • Regression methods • Response • Find f • Logistic regression • Local regression

  8. Probabilistic voting scheme • Goal: estimate the probability that the query gene belong to each class. • Range: [0 ~ 1] • Assigns higher confidence score to predictions voted by more neighbors, or neighbors with higher credibility. • Report predictions that are above certain threshold value.

  9. Performance comparison

  10. Functional Classes ROC analysis (KEGG)

  11. Confidence Score vs. Accuracy

  12. Results Summary • Combining all 4 predictors yields the best result. • Using expression data only, regression based KNN methods outperforms SVM. • Performance varies with different function classes • Confidence scores are strongly correlated with accuracy.

  13. Contribution • KNN • Simplicity, efficiency, flexibility • Easy to interpret the results, useful to guide case studies • Similarity metric • integrate heterogeneous data sources • voting scheme • Statistic inference • A general framework to incorporate other information.

More Related