Download
machine learning for functional genomics ii n.
Skip this Video
Loading SlideShow in 5 Seconds..
Machine Learning for Functional Genomics II PowerPoint Presentation
Download Presentation
Machine Learning for Functional Genomics II

Machine Learning for Functional Genomics II

190 Vues Download Presentation
Télécharger la présentation

Machine Learning for Functional Genomics II

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Machine Learning for Functional Genomics II Matt Hibbs http://cbfg.jax.org

  2. Functional Genomics Identify the roles played by genes/proteins Sealfon et al., 2006.

  3. Promise of Computational Functional Genomics Data & Existing Knowledge Laboratory Experiments Computational Approaches Predictions

  4. Computational Solutions • Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.

  5. Computational Solutions • Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.

  6. Bayesian Networks Raining? Jim brought umbrella Cloudy this morning Rain in forecast Encodes dependence relationships between observed and unobserved events

  7. Bayesian Network Overview • Graphical representation of relationships • Probabilistic information from data to concepts

  8. Bayesian Network Overview • Graphical representation of relationships • Probabilistic information from data to concepts

  9. Bayesian Network Overview P(FR | CE, AP, Y2H) P(FR | CE=yes, AP=yes, Y2H=yes) = α P(FR) P(CE=yes|FR) Σ P(PI|FR) P(AP=yes|PI) P(Y2H=yes|PI) Bayes’ Rule: P(A|B) ~ P(A) P(B|A) P(FR=yes) + P(FR=no) = 0.0105α + 0.0216α  P(FR) = .327 (up from 0.10)

  10. Naïve Bayes No internal hidden nodes Greatly simplifies problem, reduces computational complexity and time Imposes independence assumption

  11. Naïve Bayes P(FR | D1, D2, D3, D4) = α P(FR) P(D1|FR) P(D2|FR) P(D3|FR) P(D4|FR) Bayes’ Rule: P(A|B) ~ P(A) P(B|A) Assumes that all measures are independent

  12. Learning Naïve Bayes Nets

  13. Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network

  14. Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network

  15. Gold Standard Construction • Gene Ontology annotations used to define known functional relationships Threshold for positive relationships Threshold for negative relationships Myers et al., 2006

  16. Gold Standard Used For Training positive relationships negative relationships Global Gold Standard

  17. Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network

  18. Gene-Gene Scores • Binary data • PPI, co-localization, synthetic lethality • Can use binary scores • Can use profiles to generate scores (dot product) • Continuous data • Profile distance metrics • Binning results • Converts everything to discrete case

  19. Distance Metrics Euclidean Distance Pearson Correlation Spearman Correlation • Choice of distance measure is important for quantifying relationships in datasets • Pair-wise metrics – compare vectors of numbers • e.g. genes x & y, ea. with n measurements

  20. Distance Metrics Euclidean Distance Pearson Correlation Spearman Correlation

  21. Sensible Binning • Commonly used Pearson correlation yields greatly different distributions of correlation • These differences complicate comparisons Histograms of Pearson correlations between all pairs of genes DeRisi et al., 97 Primig et al., 00

  22. Sensible Binning • Fisher Z-transform, Z-score equalizes distributions • Increases comparability between datasets Histograms of Z-scores between all pairs of genes

  23. Pre-calculation and Storage Pair-wise distances only need to be calculated once, even if using different binnings Typical mouse microarray ~5-20k genes 16M pair-wise distances ~50-700 MB of storage for one dataset ~800 datasets in GEO ~200 GB for all datasets

  24. Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network

  25. Counting & Learning • Conceptually straightforward • Counting • Just look at all of the pairs in each dataset, see which bin it falls into, increment a counter • But… you need to do this 16M times/dataset • “Dumb” parallelization – each dataset is independent • Learning CPTs • Fractions based on counts

  26. Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network

  27. Inference • Also pretty straightforward • For all pairs of genes… • For each dataset • Look-up value from pre-calculated distances • Determine bin and value from CPT • Multiply probability into product • Do this for FR=yes and FR=no • Normalize out α • Store Result • 1.5GB result file

  28. Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network

  29. Evaluation Metrics TPs, FPs, TNs, FNs Agnostic to pairs not appearing in standard ROC curves: Sensitivity-Specificity PR curves: Precision-Recall

  30. Precision Recall Curves Ordered Predictions 1 Precision TP TP TP + FP TP + FN 0 1 0 Recall

  31. Summary Statistics • AUC – area under the (ROC) curve • equivalent to Mann-Whitney U • Average Precision – average of the precisions calculated at each true positive • quantized version of area under precision recall curve (AUPRC) • Precision @ n% recall

  32. Cross Validation

  33. Steps for Bayesian network integration Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network

  34. Graph Analysis for Predictions gi ci = confidence of function S = set of genes in function G = set of all genes wi,j = weight of edge

  35. Steps for Our Evaluation Construct a gold standard Convert data to pair-wise format Count positive/negative pairs in each dataset Create CPTs to define Bayes net Inference to calculate all pair-wise probabilities Evaluate performance Predict functions given network

  36. Bayesian Network Integration Gene expression dataset 1 Gene expression dataset 2 Gene expression Gene expression dataset N Data integration via a Bayesian network Yeast two-hybrid dataset 1 Probabilistic, weighted networks of gene function Physical interactions Co-precipitation dataset 1 Synthetic lethality dataset Synthetic rescue dataset Genetic interactions User-selected query focuses search Transcription factor bin sites New genes predicted to interact with known mitochondrial genes Localization Other Curated literature Results displayed Myers et al., 2005; Huttenhower et al., 2006; Guan et al., 2008

  37. Basic Approach Applied Several Times Huttenhower et al., 2009 Myers et al., 2005; 2007 Guan et al., 2008 Huttenhower et al., 2007

  38. Limitations and Improvements • Original work designed for yeast, and general notion of functionally related • Ignores reality that some genes are related only under certain conditions • Treats multi-cellular organisms as big single-celled organisms • Increased specificity can be used to improve results • 2nd iteration of bioPIXIE included biological processes into gold standards • Currently working on 2nd generation mouseNET to account for tissue and developmental stages

  39. General mouseNET Approach

  40. Global Gold Standard positive relationships negative relationships Global Gold Standard

  41. Specific Gold Standards • Not all datasets capture all functional relationships • Process/Pathway specific • Functionally related genes aren’t always functionally related • Tissue specific • Developmental stage specific

  42. Specific Gold Standard Construction positive relationships negative relationships Global Gold Standard Specific Gold Standard

  43. Tissue/Stage Gold Standards • Based on data from GXD • Cross reference Theiler stages with mammalian anatomy hierarchy • 729 total intersections • ranging from 50 to ~3500 genes • not including post-natal stages

  44. Initial Computational Evaluations

  45. Preliminary Results training evaluation test evaluation Running 4-fold cross validation using tissue/stage specific GO-based gold standards

  46. Preliminary Results training evaluation test evaluation Accounting for developmental stage helps

  47. Preliminary Results training evaluation test evaluation Many specific tissue/stage combinations are overfitting

  48. Preliminary Results Folds were randomly generated, are biased, need to balance positives and negatives

  49. New Visualization Interface Graphle

  50. Simple Things  Long Times • No single step is too complicated • Mostly O(G2D) • 16M * 800 * 4 • Evaluating one fold ~7 hours • So far have results for ~200 tissue/stages • Should take ~3 days on the cluster • Actually took ~15 days