Statistical and data analytical aspects

Statistical and data analytical aspects Wolfgang Huber EBI

Overview • o Some basic statistical concepts that you need for the comparison of studies • o Preprocessing, normalization, quality assessment/control • o Comparison with microarray data • o Epistatic analysis, graphs and networks • o Synthetic analysis • o Software

The Yin and Yang of Statistics: biasaccuracy Sensitivity vs specificity False positives vs false negatives  precision variance

Receiver Operator Characteristic (ROC) curves 1 qualitatively different methods True Positive Rate (TP/Preal) Sensitivity  ‘trivial’ parameter changes  False Positive Rate (FP/Preal) (1-specificity)

The basic dogma of statistics: Can always increase sensitivity on the cost of specificity, or vice versa, the art is to find a method does both well and then to find the best trade-off. X X X X X X X X X

Inter-study comparisons o Phenotypes in HT-experiments are rarely clear-cut yes/no decisions – often they involve some kind of graduation. o Many studies focus their thresholding on minimizing False Positives. o Two studies can produce completely non-overlapping gene lists, yet both could be 100% correct. o Systematic comparison of studies on the basis of gene lists is difficult (or impossible). Gene lists reflect just one particular point along the ROC curve. o Need to understand sensitivity as well as specificity o Need the quantitative scores.

Thresholding Boutros, Kiger, et al. Science 2004

Statistical tests: what is a p-value? o It is difficult to positively prove general statements (“hypotheses”) from induction (i.e. the observation of special cases in nature). It is relatively straightforward to falsify them (Popper). o Statisticians have adapted this idea to situations with noisy data (and other types of uncertainty): the methodology of hypothesis testing.

Statistical tests: what is a p-value? A hypothesis test consists of oa hypothesis oa rule to calculate the probability that a certain observation of nature could arise if the hypothesis were true: p-value. If we make an observation of nature for which the p-value is too small, the hypothesis can be rejected. There are many off-the shelf rules for common set-ups; you can also build your own.

 Limitations of hypothesis testing Boer et al. Genome Res. 2001

 Limitations of hypothesis testing space of all possible hypotheses true state of nature null hypothesis Significance, p-value: only depends on whether or not null hypothesis and true state of nature (TSN) overlap, and on number of observations (N), Effect size: measures the distance between null hypothesis and TSN; expected value is independent of N. For phenotype screens, it is difficult enough to calculate a p-value; and even harder to calculate an effect size score that is comparable across studies / labs (how do you quantify “distance”?)

 Multiple testing Suppose you perform a gene knock-out experiment, and the observed phenotype is different from the wildtype at 5% significance level Suppose you perform a screen with 10,000 gene knock-out experiments. Even if no gene at all had a phenotype, you would stillobserve about 10,000*0.05=500 hits at the p=0.05 level. Multiple testing corrections have become standard tools for bioinformaticians in the context of microarray analysis. But they don’t really solve the problem – they just formalize it. Multiple testing correction inevitably means a large loss of power (sensitivity). Exploratory multivariate data analysis, asking directed (hypothesis driven) questions, integration of prior knowledge / other experiments can help.

Summary: some basic statistical concepts that you need for the comparison of studies o hypothesis testing o sensitivity / specificity o ROC curves o significance vs effect size o multiple testing

Normalization Boutros, Kiger, et al. Science 2004

 (linear) regression models for normalization Example: y: observed signal P: plate effect W: well effect (e.g. spatial) G: probe effect C: condition effect D: overall dye effect F: probe-specific dye effect e: noise X: gene-specific condition effect k: probe / gene c: treatment, condition d: dye i: replicate p=p(kci): plate index w=w(kc): plate index Similar problems/solutions as for microarrays: Small no. of replicates - moderation of the variances Large dynamic range / heteroskedasticity - scale transformations

 ANOVA to study synthetic phenotypes y: observed signal C: drug treatment (0/1, or dose) R: RNAi (0/1, … efficiency) S: synergistic / antagonistic effect e: noise Open questions On what scale to measure y, C, R (normalization)? e.g. log(a+b) ≠ log(a)+log(b)

Genetic interactions • Epistasis • KO of gene A has phenotype a, KO of gene B has phenotype b, KO of A+B has phenotype b. • B is epistatic to A: A and B are in a linear pathway, and B is after A. • Synthetic interaction • KO of gene A or B individually has no phenotype, but KO of A+B has phenotype. • Buffering • KO of gene A or B individually has weak phenotype, but KO of A+B has strong phenotype.

 (Epistatic) Interaction Networks UV Apoptosis different compounds differentiation events Hormone stimulation apoptosis

Cell line E = Stimulus T = Temporal Response = Assay for Response … = assay timepoints SHAPE CHANGE CYCLE ARREST CELL DEATH A E T1 T2 T3 Simple Models T1 B E T2 T3 Amy Kiger RNAi1 No phenotype E RNAi + Gene needed for shape change and cell death (Model A) RNAi 2 Example Results? RNAi 3 Gene needed for shape change (Model B) RNAi 4 Gene needed for cell death (Model A or B) RNAi 5 Gene inhibits cell death or promotes survival (Model B?)

Global Mapping of the Yeast Genetic Interaction Network Amy Hin Yan Tong,…49 other people, …Charles Boone Science 303 (6 Feb 2004)

 Buffering and Genetic Variation In yeast, ~73% of gene deletions are "non-essential" (Glaever et al. Nature 418 (2002) In Drosophila, ~95% had no viability phenotype in cell lines (Boutros et al. Science 303 (2004)) In Human, ca. 1 SNP / 1.5kB Evolutionary pressure for robustness Bilateral asymmetry is positively correlated with inbreeding Most genetic variation is neutral to fitness, but may well affect quality of life Probably mechanistic overlap between buffering of genetic, environmental and stochastic perturbations

 Models for Buffering Comparison of single mutants to double mutants in otherwise isogenic genetic background Synthetic Genetic Array (SGA) analysis (Tong, Science 2001): cross mutation in a "query" gene into a (genome-wide) array of viable mutants, and score for phenotype. Tong 2004: 132 queries x 4700 mutants

 Buffering A buffered by B (i) molecular function of A can also be performed by B with sufficient efficiency (ii) A and B part of a complex, with loss of A or B alone, complex can still function, but not with loss of both (iii) A and B are in separate pathways, which can substitute each other's functions. structural similarity - physically interaction - maybe, but neither is necessary.

 Selection of 132 queries o actin-based cell polarity o cell wall biosynthesis o microtubule-based chromosome segregation o DNA synthesis and repair Reproducibility Each screen 3 times: 3x132x4700 = 1.8 Mio measurements 25% of interactions observed only 1/3 times 4000 interactions amongst 1000 genes confirmed by tetrad or random spore analysis ("FP neglible") FN rate: 17-41%

 Statistics Hits per query gene: range 1...146, average 34 (!) power-law (g=-2) Physical interactions: ~8 Dubious calculation: ~100,000 interactions

 GO Genes with same or "similar" GO category are more likely to interact

 Patterns SGI more likely between genes - withsame mutant phenotype - withsame localization - in same complex (but this explains only 1 % of IAs) - that are homologous (but this explains only 2% of IAs) Genes that have many common SGI partners tend to also physically interact: 30 / 4039 SGI pairs are also physically interacting 27 / 333 gene pairs with >=16 common SGI partners factor: 11

Assignment of function to "new" genes

Bioconductor • an open source and open development software project for the analysis of biomedical and genomic data. • Started in the fall of 2001 by Robert Gentleman (then at Harvard) and now includes 23 core developers in the US, Europe, and Australia. • R and the R package system are used to design and distribute software. • Initial focus on microarray statistics, now also: proteomics, cell-based assays, bioinformatic metadata, graph-theoretic methods

preprocessing / detection of effectors package prada: data import, quality control, primary statistical analysis, and visualization for flow-cytometry based cell-assays Generic: lm, rlm: linear modeling / ANOVA nlme: mixed effects models locfit: non-parametric (`local`) regression

T = C = M = N = A = I = insufficient transfection efficiency few cells mock transfection neutral control activator control inhibitor control screening plate 5 apoptosis assay MAP kinase assay

graph, RBGL, Rgraphviz graphbasic class definitions and functionality RBGLgraph algorithms (e.g. shortest path, connectivity) Rgraphviz graph rendering. Different layout algorithms. Seamlessly combinable with R graphics.

graph, RBGL, Rgraphviz

Current developments Interface to image analysis libraries for detection of complex phenotypes & combination with R for scripting, statistical analysis, data visualization. Extension of interfaces to BGL et al. Stochastic models based on graph theoretic concepts, but suitable for noisy data (e.g. cliques in the presence of FN, shortest paths in the presence of FP)

Reproducible Research and Compendia There is a tendency to accept seemingly realistic computational results, as presented by figures and tables, without any proofof correctness. F. Leisch, T. Rossini, Chance 16 (2003) We re-analyzed the breast cancer data from van‘t Veer et al. (2002). ... Even with some helpof the authors, we were unable to exactly reproduce this analysis. R. Tibshirani, B. Efron, SAGMB (2002)

Re-analysis of a breast cancer outcome study • E. Huang et al., Gene expression predictors • of breast cancer outcome, The Lancet 361 (9369): 1590-6 (2003) • 89 primary breast tumors on Affymetrix Chips (HG-U95av2), among them: 52 with 1-3 positive lymph nodes, 18 led to recurrence within 3 years, 34 did not. • Goal: predict recurrence • Claim: 5 misclassification errors, 1 unclear (leave-one-out cross-validation) • Method: Bayesian binary prediction trees (at the time, unpublished) • http://www.cagp.duke.edu

…we tried to reproduce these results, starting from the published raw data But couldn't. The paper (and supplements) didn't contain the necessary details to re-implement their algorithm. Authors didn't provide comparisons to simple well-known methods. In our hands, all other methods resulted in worse misclassification results. Is their new Bayesian tree method miles better than everything else? Or was their analysis over-optimistic? (over-fitting, selection bias)

A general pattern Analysis of large-scale data sets involves many steps. It is impossible to report all the details in Materials & Methods section. Science works by “standing of the shoulders of giants”. For bioinformatics / mathematical modeling this means being able to reproduce other people’s analysis, to try different variations, and to build further analysis on top. The difficulty to exactly reproduce a published analysis is at best a waste of time, and often a substantial impediment to the progress of the field.

Compendia • Interactive documents that contain: • Primary data • Processing methods (computer code) • Derived data, figures, tables and other output • Text: research report (result, materials and methods, conclusions) • Package compHuang: reanalysis of Huang et al. data, using different classification and preprocessing methods and a correct cross-validation procedure for estimating the prediction error • Based on R/Bioconductor's package and vignette technologies M. Ruschhaupt, W. Huber, A. Poustka, U. Mansmann, Statistical Applications in Genetics and Molecular Biology (2004) PLoS Medicine Feb 2005.

source markup (here: latex & R) processed document (here: PDF) Sweave <<MCRestimate call,eval=FALSE,echo=TRUE>>=r.forest <- MCRestimate(eset,class.label, class.function="RF.wrap", select.fun=red.fct,cross.outer=10, cross.inner=5,cross.repeat=20)@ <<rf.save,echo=FALSE,results=hide>>=savepdf(plot(r.forest, main="Random Forest"),"image-RF.pdf")@ <<result>>=r.forest@ The final document includes results of the calculation, graphicaloutputs, tables, and optionally parts of the R-Code which has beenused. Also the description of theexperiment,the interpretation of theresults, and the conclusion can be integrated. In this example we applied our compendium toT. Golubs ALL/AMLdata~\cite{Golub.1999}. \begin{figure}[h]\begin{center}\includegraphics[width=0.4\textwidth]{image-RF}\end{center}\end{figure}\smallskip <<summary,echo=FALSE>>=method.list <- list(r.forest,r.pam,r.logReg,r.svm)name.list<- c("RF","PAM","PLR","SVM")conf.table <-MCRconfusion(method.list, col.names=name.list)@ <<writinglatex1,echo=FALSE, results=tex>>=xtable(conf.table,"Overall number ofmisclassifications",label="conf.table",display=rep("d",6))@ %\input{samples.1}%\input{conf.table} \begin{thebibliography}{1}\bibitem[Golub et al. ,1999]{Golub.1999} Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES.\newblock Molecular classification of cancer: class discovery and class prediction by gene expression monitoring\newblock\textit{Science} 286(5439): 531-7 (1999). \end{thebibliography}

Structure of a compendium Package directory General info: author, version, … software documentation source markup additional software code data directory function definition manual page data files function definition manual page data files function definition manual page . . . data files . . . . . .

Compendia See also the work by Donald Knuth HP Wolf Günther Sawitzki Friedrich Leisch Robert Gentleman Duncan Temple Lang

References • R www.r-project.org • software (CRAN) • documentation • newsletter: R News • mailing list • Bioconductor www.bioconductor.org • software, data, and documentation (vignettes) • training materials from short courses • mailing list • Group webpage • www.ebi.ac.uk/huber

References Bioinformatics and Computational Biology Solutions using R and Bioconductor. Gentleman, Carey, Huber, Irizarry, Dudoit.Springer (August 2005). www.bioconductor.org – software for computing with and visualizing graphs http://genepath.org – analysis of genetic interactions Evaluating the Effect of Perturbations in Reconstructing Network Topologies. Markowetz, Spang, DSC 2003. Untangling the wires: a strategy to trace functional interactions in signaling and gene networks. Kholodenko et al. PNAS 99(20), 2002.

Acknowledgements • Oleg Sklyar • Jörn Tödling • Raeka Aiyar • Michael Boutros • Amy Kiger • Robert Gentleman • Vince Carey

Is differential expression a good predictor for ’signaling’ function? RIP/IMD pathways RNAi phenotypes Differentially regulated genes R RIP Tak1 280 genes ~ 70 IKK Rel Targets Michael Boutros

Most pathway targets are not required for pathway function RIP/IMD pathways RNAi phenotypes Differentially regulated genes R RIP Tak1 280 genes ~ 70 IKK Rel 3 Targets Michael Boutros

Statistical and data analytical aspects