230 likes | 387 Vues
Weighing Evidence in the Absence of a Gold Standard. Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir Friedman and Edison Liu.). Problem: Ortholog mapping. Pair genes in one organism with their equivalent counterparts in another
E N D
Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir Friedman and Edison Liu.)
Problem: Ortholog mapping • Pair genes in one organism with their equivalent counterparts in another • Useful for supporting medical research using animal models
A little molecular biology • DNA has nucleotides (A, C, T and G) arranged linearly along chromosomes • Regions of DNA, called genes, encode proteins • Proteins biochemical workhorses • Proteins made up of amino acids • also strung together linearly • fold up to form 3D structure
Mutations and evolution • Speciation often roughly as follows: • one species separated into two populations • separate populations’ genomes drift apart through mutation • important parts (e.g. genes) drift less • Orthologs have common evolutionary ancestor • Genes sometimes copied • original retains function • copy drifts or dies out • Both fine-grained and coarse-grained mutations
Evidence of orthology • (protein) sequence similarity • comparison with third organism • conservation of synteny ...
Conserved synteny • Neighbor relationships often preserved • Consequently, similarity among their neighbors evidence that a pair of genes are orthologs
Plan • Identify numerical features corresponding to • sequence similarity • common similarity to third organism • conservation of synteny • “Learn” mapping from feature values to prediction
Problem – no “gold standard” • for mouse-human orthology, Jackson database reasonable • for human-zebrafish? human-pombe?
Another “no gold standard” problem: protein-protein interactions • Sources of evidence: • Yeast two-hybrid • Rosetta Stone • Phage display • All yield errors . . .
Related Theoretical Work [MV95] – Problem • Goal: • given m training examples generated as below • output accurate classifier h • Training example generation: • All variables {0,1}-valued • Y chosen randomly, fixed • X1,...,Xnchosen independently with Pr(Xi = Y) = pi, where piis • unknown, • same when Y is 0 or 1 (crucial for analysis) • only X1,...,Xngiven to training algorithm
Related Theoretical Work [MV95] – Results • If n ≥ 3, can approach Bayes error (best possible for source) as m gets large • Idea: • variable “good” if often agrees with others • can e.g. solve for Pr(X1 = Y) as function of Pr(X1 = X2),Pr(X1 = X3), and Pr(X2 = X3) • can estimate Pr(X1 = X2),Pr(X1 = X3), and Pr(X2 = X3) from the training data • can plug in to get estimates of Pr(X1 = Y),...,Pr(Xn = Y) • can use resulting estimates of Pr(X1 = Y),...,Pr(Xn = Y) to approximate optimal classifier for source
In our problem(s)... • Pr(Y = 1) small • X1,...,Xncontinuous-valued • Reasonable to assume X1,...,Xn conditionally independent given Y • Reasonable to assume Pr(Y = 1 | Xi = x) increasing in x, for all i • Sufficient to sort training examples in order of associated conditional probabilities that Y = 1
Ui =1 Ui =0 Key Idea • Suppose Pr(Y = 1) known • For variable i, • Set threshold so that Pr(Ui = 1) = Pr(Y = 1) • Then Pr(Y = 1 and Ui = 0) = Pr(Y = 0 and Ui = 1) • Can solve for these error probabilities for all i in terms of probabilities Ui’sagree,... - - - - - - - - - - - - - + -- - - + + - + - - + + +
Final Plan (informal) • Assume various values of Pr(Y = 1); predict orthologs given each • For pairs of genes predicted to be orthologs even when Pr(Y = 1) assumed small, confidently predict orthology • For pairs of genes predicted to be orthologs only when Pr(Y = 1) assumed pretty big, predict orthology more tentatively
Final Plan – Probabilistic Viewpoint • Consider hidden variable Z: • takes values uniformly distributed in [0,1] • interpretation: “obviously orthologous” • Assumptions • Pr(Y = 1| Z = z) increasing in z • For all z, Pr(Z ≥ z | Xi = x) increasing in x • For various z • Let Vz = 1 if Z ≥ z, Vz = 0 otherwise • Let Uz,i = 1 if Xi ≥θz,i, Uz,i = 0 otherwise, where θz,i chosen so that Pr(Uz,i= 1) = Pr(Vz= 1) • Interpretations: • Vz is “In the top 100(1-z)% most likely to have Y = 1 overall” • Uz,i “In the top 100(1-z) % most likely to have Y = 1 given Xi”
Final Plan - Algorithm • Estimate conditional probability that Vz = 1, i.e. that Z≥ z, given each training example, using estimated probabilities pairs of Uz,i’sagree • Add to estimate Z’s; sort by estimates.
Practical problem • Small errors in estimates of Pr(Uz,i = Uz,j)’s can lead to large errors in estimates of Pr(Uz,i = Vz )’s (in fact, program crashes). • Solution: • when Pr(Vz = 1) small is important case (confident predictions) • can approximate: Pr(Uz,i ≠ Vz ) ~ ½ (Pr(Uz,i ≠Uz,j) + Pr(Uz,i ≠Uz,k) - Pr(Uz,j ≠Uz,k)).
Evaluation: Artificial Source • Examples generated using randomly chosen probability distribution: • Pr(Yz = 1) = 0.1, n = 5 • For each i, • choose μi uniformly from [min,max] • set distributions for ith variable: • Pr(Xi | Y=0) = N(-μi,1), • Pr(Xi | Y=1) = N(μi,1). • Evaluate using area under the ROC curve • Repeat 100 times, average
ROC curve 1 Area under the ROC curve True positives 1 False positives
Evaluation: mouse-human ortholog mapping • Use Jackson mouse-human ortholog database as “gold standard” • Apply algorithm, post-processing to map each gene to unique ortholog • Compare with analogous BLAST-only algorithm • Plot ROC curve • Treat anything not in database as non-ortholog • some “false positives” in fact correct • error rate overestimated
Open problems • Given our assumptions, is there an algorithm for learning using random examples that always approaches the optimal AUC given knowledge of the source? • Is discretizing the independent variables necessary? • How does our method compare with other natural algorithms? (E.g. what about algorithms based on clustering?)