Integrating Biology and Statistics: Gene Set Methods

Integrating Biology and Statistics: Gene Set Methods BIOS 691-003 Winter/Spring 2010

Philosophical Overture • Integrating biology and statistics • Gene sets: genes whose protein products collaborate on a well-defined function • Vague! • Hard to define ‘function’ or draw boundary on ‘gene sets’ • Statistical methods often ad-hoc • Be skeptical... but optimistic

Historical Motivations • Too many genes are significant • Researchers used to generate a list by p-value and comb for genes that work together • First pathway tools automated this process • Patterns may be more significant than any individual gene • e.g. if most genes in glycogen biosynthesis are up, but none is significant individually (after multiple-comparisons adjustment) • We can infer that glycogen is being made

Goals of Current Practice • Characterize biological meaning of joint changes in gene expression • Organize expression (or other) changes into meaningful ‘chunks’ (themes) • Identify crucial points in process where intervention could make a difference

Gene Sets • Gene Ontology • Biological Process • Molecular Function • Cellular Location • Pathway Databases • KEGG • BioCarta • MSIGDB • Broad Institute

Approaches • Univariate (most of current practice): • Discrete methods based on counting • Continuous methods: summarize gene test statistics by set • Multivariate (promising but unclear): • Compare differences to normal covariation of genes in groups across individuals • Use known biological relationships to construct test statistics

Univariate Approaches • Discrete tests: enrichment for groups in gene lists • Select genes differentially expressed at some cutoff • For each gene group cross-tabulate • Test for significance (Hypergeometric or Fisher test) • Continuous tests: from gene scores to group scores • Compare distribution of scores within each group to random selections • GSEA (Gene Set Enrichment Analysis) • PAGE (Parametric Analysis of Gene Expression)

Discrete Approach – 2 x 2 Table • For each set in turn construct 2 x 2 table of significance vs membership in set: P =

Significance Testing of Categories • Fisher’s Exact Test • Condition on margins fixed • Of all tables with same margins, how many have dependence as or more extreme? • Hard to compute when either n or k are large • Approximations • Binomial (when k/n is small) • Chi-square (when expected values > 5 ) • G2 (log-likelihood ratio; compare to c2 on 1 df)

Practical Issues – I • What is appropriate Null Distribution? • Highly correlated because many overlaps • Must do permutation analysis • How to permute? • Random sets of genes? Or • Random assignments of samples? • P-value or FDR? • Heuristic method • More constrained by annotation than statistics

Practical Issues – II • If a child category is declared significant, how to assess significance of parent category? • Include child category • Consider only genes external to child • In practice big categories are not useful • Small categories may not be well represented on chip • Select categories in middle range: 5-20 represented on chip

Critiques of Discrete Approach • No use of information about size of change • Large t scores count like small t’s • Continuous procedures have more power than discrete procedures on discretized continuous data

GSEA (Gene Set Enrichment Analysis) • Introduced in 2003 by Mootha to address a puzzle in a diabetes data set • No genes significant individually • But Oxidative Phosphorylation mostly up • GSEA tests rank of genes in a gene set against randomly distributed ranks • Kolmogorov-Smirnov test: • Maximum difference between ranks of genes in set and uniform distribution

Based on statistics of ‘Brownian Bridge’ random walk fixed end Maximum difference is test statistic Null distribution known Reformulated by GSEA as difference of CDF – uniform from axis Kolmogorov-Smirnov Test

GSEA

K-S Test Finds Irrelevant Sets • Sometimes ranks concentrated in middle • K-S statistic high, but not meaningful for path change • Fix: ad-hoc weighting by actual t-scores emphasizes departures at extreme ends • No theory • Generate null distribution by permutation

Group Z- or T- Scores • PAGE: log fold-changes over all genes follow ‘close to’ Normal distribution • Can estimate s from overall distribution • T-Profiler: under Null Hypothesis, each gene’s t-score follows t distribution ‘near’ N(0,1) distribution • Hence the sum over genes in a specific set G: • PAGE: T-profiler: • If most genes in a pathway are up-regulated then gene set scores will be significantly high

Issues and Critiques • Same issues as discrete approach • Null distribution by permuting samples • GSEA finally gets that right in 2005 • Null distribution for Z-test assumes IID • Methods assume all meaningful changes in same direction • Don’t use information about normal co-variation

Why Is Covariation Important? • Most cellular processes are homeostatic: • They find a good functional set-point • Coping with variation in inputs … • … AND in specific regulatory couplings • Most of us have regulatory SNP’s that vary expression by a factor of two or more • Other genes are expressed at somewhat different levels to accommodate key processes

Multivariate Approaches • Classical multivariate methods • Multi-dimensional Scaling • Hotelling’s T2 • Machine learning approaches • Topological score relative to network • Prediction by machine learning tool • e.g. ‘random forest’

PCA PCA1 lies along the direction of maximal correlation; PCA 2 at right angles with the next highest variation. Three correlated variables

Multi-Dimensional Scaling • Aim: to represent graphically the most information about relationships among samples with multi-dimensional attributes in 2 (or 3) dimensions • Algorithm: • Transform distances into cross-product matrix • Initial PCA onto 2 (or 3) axes • Deform until better representation • Minimize ‘strain’ measure:

Separating Using MDS Left: distributions of individual variables Right: MDS plot (in this case PCA)

MDS for Pathways • BAD pathway: controlled cell death Normal IBC Other BC • Clear separation between groups • Cancer samples don’t have coherent variation

Hotelling’s T2 • Compute distance between sample means using (common) metric of covariation • Where • Multidimensional analog of t (actually F) statistic

Principles of Kong et al Method • Normal covariation generally acts to preserve homeostasis • The transcription of genes that participate in many processes will be changed • The joint changes in genes will be most distinctive for those genes active in pathways that are working differently

Issues • Not robust to outliers • In practice this may not matter much (?) • Assumes same covariance in each sample • Small samples -> unreliable S estimates • Loss of power • Robust / Regularized Methods improve sensitivity by up to a factor of 10! • Yates & Reimers (in prep)

Overall Assessment • Gene sets are somewhat arbitrary • Most ‘modules’ overlap extensively with others • Many ‘modules’ act by protein modification rather than gene expression • Current methods represent a first attempt to bring biological information to bear on the significance problem

Integrating Biology and Statistics: Gene Set Methods

Integrating Biology and Statistics: Gene Set Methods

Presentation Transcript

Biophysics 101 Genomics and Computational Biology

Chapter 2 Descriptive Statistics: Tabular and Graphical Methods

Gene Expression M.Tevfik DORAK dorak

Welcome Each of You to My Molecular Biology Class

BIOLOGY EOCT REVIEW

Biology Mrs. Schalles

Locating Gene/Protein Information January 11, 2011

Chapter 1 Exploring Data

Introduction to Statistics

Isaac Newton Institute - Cambridge

Statistical Methods for Mining Big Text Data

Carlo Colantuoni carlo@illuminatobiotech

Inferring gene regulatory networks with non-stationary dynamic Bayesian networks

Regulation of Gene Expression

Chapter 18

Chapter 29 Transcription and the Regulation of Gene Expression

Welcome Each of You to My Molecular Biology Class

Gene Prediction: Computational Challenge

GENE THERAPY

Welcome Each of You to My Molecular Biology Class

Descriptive Statistics