Créer une présentation
Télécharger la présentation

Télécharger la présentation
## The second-simplest cDNA microarray data analysis problem

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**The second-simplest cDNA microarray data analysis**problem Terry Speed, UC Berkeley Fred Hutchinson Cancer Research Center March 9, 2001**excitation**scanning cDNA clones (probes) laser 2 laser 1 emission PCR product amplification purification printing mRNA target) overlay images and normalise 0.1nl/spot Hybridise target to microarray microarray analysis**Biological question**Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment 16-bit TIFF files Image analysis (Rfg, Rbg), (Gfg, Gbg) Normalization R, G Estimation Testing Clustering Discrimination Biological verification and interpretation**Some motherhood statements**Important aspects of a statistical analysis include: • Tentatively separating systematic from random sources of variation • Removing the former and quantifying the latter, when the system is in control • Identifying and dealing with the most relevant source of variation in subsequent analyses Only if this is done can we hope to make more or less valid probability statements**The simplest cDNA microarray data analysis problem is**identifying differentially expressed genes using one slide • This is a common enough hope • Efforts are frequently successful • It is not hard to do by eye • The problem is probably beyond formal statistical inference (valid p-values, etc) for the foreseeable future, and here’s why**An M vs. A plot**M = log2(R / G) A = log2(R*G) / 2**Background matters**From Spot From GenePix**No background correction**With background correction From the NCI60 data set (Stanford web site)**Background makes a difference**Background method Segmentation method Exp1 Exp2 S.nbg 6 6 Gp.nbg 7 6 SA.nbg 6 6 No background QA.fix.nbg 7 6 QA.hist.nbg 7 6 QA.adp.nbg 14 14 S.valley 17 21 GP 11 11 Local surrounding SA 12 14 QA.fix 18 23 QA.hist 9 8 QA.adp 27 26 Others S.morph 9 9 S.const 14 14 Medians of the SD of log2(R/G) for 8 replicated spots multiplied by 100 and rounded to the nearest integer.**Normalisation - lowess**• Global lowess (Matt Callow’s data, LNBL) • Assumption: changes roughly symmetric at all intensities.**Normalisation - print tip**Assumption: For every print group, changes roughly symmetric at all intensities.**Normalization (ctd) Another data set**Log-ratios • After within slide global lowess normalization. • Likely to be a spatial effect. Print-tip groups**Taking scale into account**Assumption: All print-tip-groups have the same spread in M True log ratio is mij where i represents different print-tip-groups and j represents different spots. Observed is Mij, where Mij = aimij Robust estimate of ai is MADi = medianj { |yij - median(yij) | }**Normalization (ctd) That same data set**Log-ratios • After print-tip location and scale normalization. • Incorporate quality measures. Print-tip groups**Matt Callow’s Srb1 dataset (#5).**Newton’s and Chen’s single slide method**Matt Callow’s Srb1 dataset**(#8). Newton’s, Sapir & Churchill’s and Chen’s single slide method**The approach of Roberts et al (Rosetta)**Genomic DNA vs. Genomic DNA Data from Bing Ren**The second simplest cDNA microarray data analysis problem is**identifying differentially expressed genes using replicated slides There are a number of different aspects: • First, between-slide normalization; then • What should we look at: averages, SDs t-statistics, other summaries? • How should we look at them? • Can we make valid probability statements? A report on work in progress**Normalization (ctd) Yet another data set**• Between slides this time (10 here) • Only small differences in spread apparent • We often see much greater differences Log-ratios Slides**Lowess Normalized M**Apo A1 Experiments**Lowess Normalized M**Srb1 Experiments**Taking scale into account**Assumption: All slides have the same spread in M True log ratio is mij where i represents different slides and j represents different spots. Observed is Mij, where Mij = aimij Robust estimate of ai is MADi = medianj { |yij - median(yij) | }**Which genes are (relatively) up/down regulated?**Two samples. e.g. KO vs. WT or mutant vs. WT n T C n For each gene form the t statistic: average of n trt Ms sqrt(1/n (SD of n trt Ms)2)**Which genes are (relatively) up/down regulated?**Two samples with a reference (e.g. pooled control) n T C* n C* C • For each gene form the t statistic: • average of n trt Ms - average of n ctl Ms • sqrt(1/n (SD of n trt Ms)2 + (SD of n ctl Ms)2)**Samples: Liver tissue from mice treated by cholesterol**modifying drugs. Question 1: Find genes that respond differently between the treatment and the control. Question 2: Find genes that respond similarly across two or more treatments relative to control. One factor: more than 2 samples T2 T3 T4 T1 x 2 x 2 x 2 x 2 C**Samples: tissues from different regions of the mouse**olfactory bulb. Question 1: differences between different regions. Question 2: identify genes with a pre-specified patterns across regions. One factor: more than 2 samples T6 T1 T5 T2 T4 T3**Two or more factors**6 different experiments at each time point. Dyeswaps. 4 time points (30 minutes, 1 hour, 4 hours, 24 hours) 2 x 2 x 4 factorial experiment. ctl OSM 4 times OSM & EGF EGF**Which genes have changed?When permutation testing possible**1. For each gene and each hybridisation (8 ko + 8 ctl), use M=log2(R/G). 2. For each gene form the t statistic: average of 8 ko Ms - average of 8 ctl Ms sqrt(1/8 (SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2) 3. Form a histogram of 6,000 t values. 4. Do a normal Q-Q plot; look for values “off the line”. 5. Permutation testing. 6. Adjust for multiple testing.**Histogram & qq plot**ApoA1**Apo A1: Adjusted and Unadjusted p-values for the 50 genes**with the largest absolute t-statistics.**Which genes have changed?Permutation testing not possible**Our current approach is to use averages, SDs, t-statistics and a new statistic we call B, inspired by empirical Bayes. We hope in due course to calibrate B and use that as our main tool. We begin with the motivation, using data from a study in which each slide was replicated four times.**M**• t • t M Results from the Apo AI ko experiment**M**• t • t M Results from the Apo AI ko experiment**M**• B • t • M B • t B • t M B Results from SR-BI transgenic experiment**M**• B • t • M B • t B • t M B Results from SR-BI transgenic experiment**Extensions include dealing with**• Replicates within and between slides • Several effects: use a linear model • ANOVA: are the effects equal? • Time series: selecting genes for trends**Rosetta once more:**In vivo Binding Sites of Gal4p in Galactose P <0.001 Un-enriched DNA (Cy3) antibody-enriched DNA (Cy5)**Summary (for the second simplest problem)**• Microarray experiments typically have thousands of genes, but only few (1-10) replicates for each gene. • Averages can be driven by outliers. • Ts can be driven by tiny variances. • B = LOR will, we hope • use information from all the genes • combine the best of M. and T • avoid the problems of M. and T