1 / 23

Weighing Evidence in the Absence of a Gold Standard

Weighing Evidence in the Absence of a Gold Standard. Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir Friedman and Edison Liu.). Problem: Ortholog mapping. Pair genes in one organism with their equivalent counterparts in another

arella
Télécharger la présentation

Weighing Evidence in the Absence of a Gold Standard

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir Friedman and Edison Liu.)

  2. Problem: Ortholog mapping • Pair genes in one organism with their equivalent counterparts in another • Useful for supporting medical research using animal models

  3. A little molecular biology • DNA has nucleotides (A, C, T and G) arranged linearly along chromosomes • Regions of DNA, called genes, encode proteins • Proteins biochemical workhorses • Proteins made up of amino acids • also strung together linearly • fold up to form 3D structure

  4. Mutations and evolution • Speciation often roughly as follows: • one species separated into two populations • separate populations’ genomes drift apart through mutation • important parts (e.g. genes) drift less • Orthologs have common evolutionary ancestor • Genes sometimes copied • original retains function • copy drifts or dies out • Both fine-grained and coarse-grained mutations

  5. Evidence of orthology • (protein) sequence similarity • comparison with third organism • conservation of synteny ...

  6. Conserved synteny • Neighbor relationships often preserved • Consequently, similarity among their neighbors evidence that a pair of genes are orthologs

  7. Plan • Identify numerical features corresponding to • sequence similarity • common similarity to third organism • conservation of synteny • “Learn” mapping from feature values to prediction

  8. Problem – no “gold standard” • for mouse-human orthology, Jackson database reasonable • for human-zebrafish? human-pombe?

  9. Another “no gold standard” problem: protein-protein interactions • Sources of evidence: • Yeast two-hybrid • Rosetta Stone • Phage display • All yield errors . . .

  10. Related Theoretical Work [MV95] – Problem • Goal: • given m training examples generated as below • output accurate classifier h • Training example generation: • All variables {0,1}-valued • Y chosen randomly, fixed • X1,...,Xnchosen independently with Pr(Xi = Y) = pi, where piis • unknown, • same when Y is 0 or 1 (crucial for analysis) • only X1,...,Xngiven to training algorithm

  11. Related Theoretical Work [MV95] – Results • If n ≥ 3, can approach Bayes error (best possible for source) as m gets large • Idea: • variable “good” if often agrees with others • can e.g. solve for Pr(X1 = Y) as function of Pr(X1 = X2),Pr(X1 = X3), and Pr(X2 = X3) • can estimate Pr(X1 = X2),Pr(X1 = X3), and Pr(X2 = X3) from the training data • can plug in to get estimates of Pr(X1 = Y),...,Pr(Xn = Y) • can use resulting estimates of Pr(X1 = Y),...,Pr(Xn = Y) to approximate optimal classifier for source

  12. In our problem(s)... • Pr(Y = 1) small • X1,...,Xncontinuous-valued • Reasonable to assume X1,...,Xn conditionally independent given Y • Reasonable to assume Pr(Y = 1 | Xi = x) increasing in x, for all i • Sufficient to sort training examples in order of associated conditional probabilities that Y = 1

  13. Ui =1 Ui =0 Key Idea • Suppose Pr(Y = 1) known • For variable i, • Set threshold so that Pr(Ui = 1) = Pr(Y = 1) • Then Pr(Y = 1 and Ui = 0) = Pr(Y = 0 and Ui = 1) • Can solve for these error probabilities for all i in terms of probabilities Ui’sagree,... - - - - - - - - - - - - - + -- - - + + - + - - + + +

  14. Final Plan (informal) • Assume various values of Pr(Y = 1); predict orthologs given each • For pairs of genes predicted to be orthologs even when Pr(Y = 1) assumed small, confidently predict orthology • For pairs of genes predicted to be orthologs only when Pr(Y = 1) assumed pretty big, predict orthology more tentatively

  15. Final Plan – Probabilistic Viewpoint • Consider hidden variable Z: • takes values uniformly distributed in [0,1] • interpretation: “obviously orthologous” • Assumptions • Pr(Y = 1| Z = z) increasing in z • For all z, Pr(Z ≥ z | Xi = x) increasing in x • For various z • Let Vz = 1 if Z ≥ z, Vz = 0 otherwise • Let Uz,i = 1 if Xi ≥θz,i, Uz,i = 0 otherwise, where θz,i chosen so that Pr(Uz,i= 1) = Pr(Vz= 1) • Interpretations: • Vz is “In the top 100(1-z)% most likely to have Y = 1 overall” • Uz,i “In the top 100(1-z) % most likely to have Y = 1 given Xi”

  16. Final Plan - Algorithm • Estimate conditional probability that Vz = 1, i.e. that Z≥ z, given each training example, using estimated probabilities pairs of Uz,i’sagree • Add to estimate Z’s; sort by estimates.

  17. Practical problem • Small errors in estimates of Pr(Uz,i = Uz,j)’s can lead to large errors in estimates of Pr(Uz,i = Vz )’s (in fact, program crashes). • Solution: • when Pr(Vz = 1) small is important case (confident predictions) • can approximate: Pr(Uz,i ≠ Vz ) ~ ½ (Pr(Uz,i ≠Uz,j) + Pr(Uz,i ≠Uz,k) - Pr(Uz,j ≠Uz,k)).

  18. Evaluation: Artificial Source • Examples generated using randomly chosen probability distribution: • Pr(Yz = 1) = 0.1, n = 5 • For each i, • choose μi uniformly from [min,max] • set distributions for ith variable: • Pr(Xi | Y=0) = N(-μi,1), • Pr(Xi | Y=1) = N(μi,1). • Evaluate using area under the ROC curve • Repeat 100 times, average

  19. ROC curve 1 Area under the ROC curve True positives 1 False positives

  20. Results: Artificial Source

  21. Evaluation: mouse-human ortholog mapping • Use Jackson mouse-human ortholog database as “gold standard” • Apply algorithm, post-processing to map each gene to unique ortholog • Compare with analogous BLAST-only algorithm • Plot ROC curve • Treat anything not in database as non-ortholog • some “false positives” in fact correct • error rate overestimated

  22. Results: mouse-human ortholog mapping

  23. Open problems • Given our assumptions, is there an algorithm for learning using random examples that always approaches the optimal AUC given knowledge of the source? • Is discretizing the independent variables necessary? • How does our method compare with other natural algorithms? (E.g. what about algorithms based on clustering?)

More Related