Lecture 9: Gene expression analysis/Clustering

CSCI 5461: Functional Genomics, Systems Biology and Bioinformatics (Spring 2012) Lecture 9: Gene expression analysis/Clustering Chad Myers Department of Computer Science and Engineering University of Minnesota cmyers@cs.umn.edu

Outline for today • Finish data normalization/processing for microarrays/RNA seq • Begin section on clustering gene expression data • Paper discussion: M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. PNAS, 95:14863–14868, 1998.

Homework #1 assigned Due: 3/1, before midnight http://www-users.cselabs.umn.edu/classes/Spring-2012/csci5461/index.php?page=homework • Differential expression analysis of breast cancer metastatic vs. non-metastatic tumors • Apply t-test and rank-sum as well as 2 methods of multiple hypothesis correction • Guidelines: • you may work with a partner in teams of at most 2 • turn in 1 copy of your homework with both names • both team members should be able to describe/defend what was done

Microarray fabrication slides online http://www-users.cselabs.umn.edu/classes/Spring-2012/csci5461/files/Microarray_fabrication.pdf Thanks Seung Ho!

Gene expression data normalization and pre-processing(microarrays and RNAseq)

The Starting Point: The Ratio (2-color arrays)

Log ratios treat up- and down-regulated genes equally log2(1) = 0 log2(2) = 1 log2(1/2) = -1 (two-color arrays)

A note about Affymetrix (1-color) pre-processing Typical Affymetrix probe intensity distribution Log transform After log-transform

Normalization overview • Within-array • Correct systematic differences among probes within the same chip (spatial, intensity, …) • Between-array • Correct global biases across multiple arrays, so that comparisons can be made

Within-array normalization: Spatial biases Solution: spatial background estimation/subtraction

Intensity-dependent normalization (Yang, Speed) (Lowess – local linear fit) • Compensate for intensity-dependent biases

Detecting Intensity-dependent biasesM vs A plots (also called R-I plot) • X axis: A – average intensity A = 0.5*log(Cy3*Cy5) • Y axis: M – log ratio M = log(Cy3/Cy5)

M>0: Cy3>Cy5 M<0: Cy3<Cy5 Low intensities High intensities Intensity-dependent bias M = log(Cy3/Cy5) A

We expect the M vs A plot to look like: M = log(Cy3/Cy5) A

Estimated values of log(Cy5/Cy3) as function of log(Cy3*Cy5) LOWESS (Locally Weighted Scatterplot Smoothing) • Local linear regression model • Tri-cube weight function • Least Squares

Other strategies for between-array normalization • Global (linear) scaling • Quantile normalization

Global linear normalization • A linear normalization (scale and/or shift)is computed for balancing chips\channels: Xinorm = k*Xi or log2 R/G  log2 R/G – c (2-color) • Equalizes the mean (median) intensity/ratio and variance among compared chips • Assumption: Total RNA (mass) used is same for both samplesaveraged across thousands of genes, total hybridization should be the same for both samples

Global Normalization (2-color) Un-normalized Normalized Frequency Log-ratios log2 R/G  log2 R/G – c where e.g. c = log2 (∑Ri/ ∑Gi) or c = ∑ log2 (Ri/Gi)

Quantile Normalization • One of the most widely used methods (implemented in many standard tools) • Ignores causes of variation and technical covariates • Shoehorns all data into the same shape distribution – matching quantiles

Motivation: probe intensity distribution across 23 Affy replicates (black is global)

Quantile Normalization Distribution of probe intensities Reference distribution xnorm = F2-1(F1(x)) Density function Assumes: gene distribution changes little F1(x) F2(x) Cumulative Distribution Function x y (Irizarry et al 2002)

Quantile normalization applied to 23 Affy replicates (black is reference distribution)

Which genes/probes should be used for between-array normalization? • All genes on the chip • Housekeeping genes (pre-defined) • “Invariant” genes (estimated from the data) • Spiked-in controls (each method has advantages/disadvantages in different settings)

Normalization - tools • Normalization is typically provided in microarray vendor’s software/core facilities but you should always understand the data you’re working with • How has your data been processed? • Are there any lingering effects? • Bioconductor (both Affymetrix and two-color): • Many different packages implemented in R language: affy, oligo, limma • dChip (Affymetrix): • Quantile, Invariant set • MAANOVA • Microarray ANOVA analysis For Affymetrix arrays specifically: • MAS 5.0 (now GCOS/GDAS) by Affymetrix (compares PM and MM probes) • RMA by Speed group (UC Berkeley) (ignores MM probes)

General rules of thumb for dealing with systematic biases in arrays • Keep track of all possible labels! • Careful experiment design (randomization goes a long way!) • Replicates (both biological and technical) • Investigate for possible systematic effects • Check global intensity, log-ratio distributions across arrays in your dataset • ANOVA (Analysis of Variance) including various experimental factors • Correct obvious effects where possible • Generalized linear models • Physical error models

Summary for normalization • Systematic biases exist in microarray data • Normalization can help to remove these biases, to leave the biology behind • Within-array vs. between-array normalization methods • Design experiments to minimize biases in the first place!

RNA sequencing data processing

Reads in RNA-seq Exon A Exon B Exon C Exon D chromosome ? ? ? ? ? ? ? ? ? ? Exon A Exon B Exon C Exon D transcript Reads’ mappings at chromosome and transcript level (slide from Daniel Nicorici)

General steps of RNA-seq analysis • Filtering of short reads • Aligning the reads against a reference genome • Computational analysis of reads’ alignments • compute summary statistic for the gene/transcript/exon expression • find new/known alternative splicing events • find new/known fusion genes • find new/known SNPs • Visualization (slide from Daniel Nicorici)

RNA Seq analysis workflow Illumina Pipeline (FASTQ) Alignment (BAM) FASTX Toolkit (FASTQ/FASTA) Expression profiles/RNA abundance Splice variants SNP analysis (slide from Alexander Kanapin)

Summary of read mapping tools (slide from Wing Hung Wong)

Summarizing expression values at the gene level C= the number of reads mapped onto the gene's exons N= total number of mapped reads in the experiment L= the sum of the exons in base pairs • Fragments Per Kilobase of exon model per Million mapped fragments • Nat Methods. 2008, Mapping and quantifying mammalian transcriptomes by RNA-Seq. Mortazavi A et al. (slide from Alexander Kanapin)

Examples of RNA-seq visualization Visualization using MapView (slide from Daniel Nicorici)

Examples of RNA-seq visualization – cont’d Coverage plot (slide from Daniel Nicorici)

Examples of RNA-seq visualization – cont’d 130.71 Coverage plot for gene ERBB2 in breast cancer Normalized coverage 0.00 4.41 Coverage plot for gene ERBB2 in normal breast Normalized coverage 0.00 Coverage plot visualization (slide from Daniel Nicorici)

Differential expression analysis for RNA seq • Parametric methods: Counts are modeled using known probability distributions such as Binomial, Poisson, Negative Binomial, etc. • Example software (all available in R’s Bioconductor): • edgeR (Robinson et al., 2010): Exact test based on Negative Binomial distribution • DESeq (Anders and Huber, 2010): Exact test based on Negative Binomial distribution (more flexible dependence of var. on mean) • DEGseq (Wang et al., 2010): MA-plot based method (assume normal distribution conditioned on A (average read count) • baySeq (Hardcastle et al., 2010): empirical Bayesian approach assuming either Poisson or NB distributions.

Variance properties of RNA seq data Gene counts could be modeled with multinomial distribution (or Poisson approximation to mult.) Mean: Variance: Poisson approximation isn’t valid due to overdispersion! http://www.itl.nist.gov/div898/handbook/eda/section3/eda366j.htm Anders and Huber Genome Biology 2010 11:R106

Summary of software tools for RNA seq • Short reads aligners • Stampy, BWA, Novoalign, Bowtie,… • Data preprocessing (reads statistics, adapter clipping, formats conversion, read counters) • Fastx toolkit • Htseq • MISO • samtools • Expression studies • Cufflinks package • RSEQtools • R packages (DESeq, edgeR, baySeq, DEGseq) • Alternative splicing • Cufflinks • Augustus • Commercial software • Partek • CLCBio (slide from Alexander Kanapin)

Clustering analysis of gene expression data

Gene expression data analysis Statistical Analysis K-means Self-Organizing Maps Hierarchical Clustering CLICK Biclustering DBSCAN OPTICS DENCLUE … Unsupervised Analysis – clustering Supervised Analysis Pattern Analysis Visualization & Decomposition

Some Concepts • Unsupervised analysis/learning: clustering, pattern mining, principal components analysis (PCA), ... • Supervised analysis/learning: Classification, regression, differential hypothesis tests, ...

Introduction to clustering • What is clustering? • Process of grouping a set of objects (e.g. tumor samples or genes) into classes of similar objects • Why cluster? • Find natural structure in the data, not just differences that correspond with things we know about (e.g. differential expression)

Example: Clustering samples (or conditions) Tumors with similar signatures may respond to the same treatment! Genes Garber, Troyanskaya et al. Diversity of gene expression in adenocarcinoma of the lung. PNAS 2001, 98(24):13784-9. Samples

Example: clustering genes Genes with similar profiles may perform similar functions! Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000 Dec;11(12):4241-57. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO.

Example: clustering genes Gene expression Environmental conditions

Clustering – why? • Reduce the dimensionality of the problem – identify the major patterns in the dataset • Gene Clusters: Co-expression • Common functional annotations • Co-regulation in regulatory networks • Experiment Clusters: subclasses • Subtypes of tumors • Similar experimental conditions

How do we define a clustering method? • We need to define what we mean by similarity? When are two genes’ profiles similar? • We need to define an algorithm (i.e. a recipe) for grouping “similar” objects

{1,2,3,4,5} {1,2,3} {4,5} {1,2} g1 g2 g3 g4 g5 Hierarchical Clustering • Organize the genes in a structure of a hierarchical tree • Initial step: each gene is regarded as a cluster with one item • Find the 2 most similar clusters and merge them into a common node • The length of the branch is proportional to the distance • Iterate on merging nodes until all genes are contained in one cluster- the root of the tree.

How do we define similarity? • Clustering requires us to define what we mean by similarity (Example: what do you mean when you say two genes have “similar” profiles?) • Possible metrics • Euclidean distance • Pearson correlation • Dot product (cosine similarity) • Mutual information • … • Which metric you pick depends on what kind of similarity you hope to find. Gene expression Time/Environment

Lecture 9: Gene expression analysis/Clustering