1 / 62

The second-simplest cDNA microarray data analysis problem

The second-simplest cDNA microarray data analysis problem. Terry Speed, UC Berkeley Bioinformatic Strategies For Application of Genomic Tools to Environmental Health Research, March 5, 2001 NIEHS National Center for Toxicogenomics NCSU Bioinformatics Research Center. Biological question

affrica
Télécharger la présentation

The second-simplest cDNA microarray data analysis problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The second-simplest cDNA microarray data analysis problem Terry Speed, UC Berkeley Bioinformatic Strategies For Application of Genomic Tools to Environmental Health Research, March 5, 2001 NIEHS National Center for Toxicogenomics NCSU Bioinformatics Research Center

  2. Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment 16-bit TIFF files Image analysis (Rfg, Rbg), (Gfg, Gbg) Normalization R, G Estimation Testing Clustering Discrimination Biological verification and interpretation

  3. Some motherhood statements Important aspects of a statistical analysis include: • Tentatively separating systematic from random sources of variation • Removing the former and quantifying the latter, when the system is in control • Identifying and dealing with the most relevant source of variation in subsequent analyses Only if this is done can we hope to make more or less valid probability statements

  4. The simplest cDNA microarray data analysis problem is identifying differentially expressed genes using one slide • This is a common enough hope • Efforts are frequently successful • It is not hard to do by eye • The problem is probably beyond formal statistical inference (valid p-values, etc) for the foreseeable future, and here’s why

  5. An M vs. A plot M = log2(R / G) A = log2(R*G) / 2

  6. Background matters From Spot From GenePix

  7. No background correction With background correction From the NCI60 data set (Stanford web site)

  8. An experiment having within-slide replicates

  9. Background makes a difference Background method Segmentation method Exp1 Exp2 S.nbg 6 6 Gp.nbg 7 6 SA.nbg 6 6 No background QA.fix.nbg 7 6 QA.hist.nbg 7 6 QA.adp.nbg 14 14 S.valley 17 21 GP 11 11 Local surrounding SA 12 14 QA.fix 18 23 QA.hist 9 8 QA.adp 27 26 Others S.morph 9 9 S.const 14 14 Medians of the SD of log2(R/G) for 8 replicated spots multiplied by 100 and rounded to the nearest integer.

  10. Normalisation - lowess • Global lowess (Matt Callow’s data, LNBL) • Assumption: changes roughly symmetric at all intensities.

  11. From the NCI60 data set (Stanford web site)

  12. Ngai lab, UCB

  13. Tiago’s data from the Goodman lab, UCB

  14. From the Ernest Gallo Clinic & Research Center

  15. From Peter McCallum Cancer Research Institute, Australia

  16. Normalisation - print tip Assumption: For every print group, changes roughly symmetric at all intensities.

  17. M vs A after print-tip normalisation

  18. Normalization (ctd) Another data set Log-ratios • After within slide global lowess normalization. • Likely to be a spatial effect. Print-tip groups

  19. Taking scale into account Assumption: All print-tip-groups have the same spread in M True log ratio is mij where i represents different print-tip-groups and j represents different spots. Observed is Mij, where Mij = aimij Robust estimate of ai is MADi = medianj { |yij - median(yij) | }

  20. Normalization (ctd) That same data set Log-ratios • After print-tip location and scale normalization. • Incorporate quality measures. Print-tip groups

  21. Matt Callow’s Srb1 dataset (#5). Newton’s and Chen’s single slide method

  22. Matt Callow’s Srb1 dataset (#8). Newton’s, Sapir & Churchill’s and Chen’s single slide method

  23. The approach of Roberts et al (Rosetta) Genomic DNA vs. Genomic DNA Data from Bing Ren

  24. The second simplest cDNA microarray data analysis problem is identifying differentially expressed genes using replicated slides There are a number of different aspects: • First, between-slide normalization; then • What should we look at: averages, SDs t-statistics, other summaries? • How should we look at them? • Can we make valid probability statements? A report on work in progress

  25. Normalization (ctd) Yet another data set • Between slides this time (10 here) • Only small differences in spread apparent • We often see much greater differences Log-ratios Slides

  26. The “NCI 60” experiments (no bg)

  27. Taking scale into account Assumption: All slides have the same spread in M True log ratio is mij where i represents different slides and j represents different spots. Observed is Mij, where Mij = aimij Robust estimate of ai is MADi = medianj { |yij - median(yij) | }

  28. Which genes are (relatively) up/down regulated? Two samples. e.g. KO vs. WT or mutant vs. WT n T C n For each gene form the t statistic: average of n trt Ms sqrt(1/n (SD of n trt Ms)2)

  29. Which genes are (relatively) up/down regulated? Two samples with a reference (e.g. pooled control) n T C* n C* C • For each gene form the t statistic: • average of n trt Ms - average of n ctl Ms • sqrt(1/n (SD of n trt Ms)2 + (SD of n ctl Ms)2)

  30. Samples: Liver tissue from mice treated by cholesterol modifying drugs. Question 1: Find genes that respond differently between the treatment and the control. Question 2: Find genes that respond similarly across two or more treatments relative to control. One factor: more than 2 samples T2 T3 T4 T1 x 2 x 2 x 2 x 2 C

  31. Samples: tissues from different regions of the mouse olfactory bulb. Question 1: differences between different regions. Question 2: identify genes with a pre-specified patterns across regions. One factor: more than 2 samples T6 T1 T5 T2 T4 T3

  32. Two or more factors 6 different experiments at each time point. Dyeswaps. 4 time points (30 minutes, 1 hour, 4 hours, 24 hours) 2 x 2 x 4 factorial experiment. ctl OSM  4 times OSM & EGF EGF

  33. Which genes have changed?When permutation testing possible 1. For each gene and each hybridisation (8 ko + 8 ctl), use M=log2(R/G). 2. For each gene form the t statistic: average of 8 ko Ms - average of 8 ctl Ms sqrt(1/8 (SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2) 3. Form a histogram of 6,000 t values. 4. Do a normal Q-Q plot; look for values “off the line”. 5. Permutation testing. 6. Adjust for multiple testing.

  34. Histogram & qq plot ApoA1

  35. Apo A1: Adjusted and Unadjusted p-values for the 50 genes with the largest absolute t-statistics.

  36. Which genes have changed?Permutation testing not possible Our current approach is to use averages, SDs, t-statistics and a new statistic we call B, inspired by empirical Bayes. We hope in due course to calibrate B and use that as our main tool. We begin with the motivation, using data from a study in which each slide was replicated four times.

  37. Results from 4 replicates

  38. B=LOR compared

  39. M • t • t M Results from the Apo AI ko experiment

  40. M • t • t M Results from the Apo AI ko experiment

  41. EmpiricalBayes log posterior odds ratio

  42. M • B • t • M  B • t B • t M B Results from SR-BI transgenic experiment

  43. M • B • t • M  B • t B • t M B Results from SR-BI transgenic experiment

  44. Extensions include dealing with • Replicates within and between slides • Several effects: use a linear model • ANOVA: are the effects equal? • Time series: selecting genes for trends

  45. Rosetta once more: In vivo Binding Sites of Gal4p in Galactose P <0.001 Un-enriched DNA (Cy3) antibody-enriched DNA (Cy5)

  46. Summary (for the second simplest problem) • Microarray experiments typically have thousands of genes, but only few (1-10) replicates for each gene. • Averages can be driven by outliers. • Ts can be driven by tiny variances. • B = LOR will, we hope • use information from all the genes • combine the best of M. and T • avoid the problems of M. and T

  47. UCB/WEHI Yee Hwa Yang Sandrine Dudoit Ingrid Lönnstedt Natalie Thorne David Freedman CSIRO Image Analysis Group Michael Buckley Ryan Lagerstorm Ngai lab, UCB Goodman lab, UCB Peter Mac CI, Melb. Ernest Gallo CRC Brown-Botstein lab Matt Callow (LBNL) Bing Ren (WI) Acknowledgments

  48. Some web sites: Technical reports, talks, software etc. http://www.stat.berkeley.edu/users/terry/zarray/Html/ Statistical software R “GNU’s S” http://lib.stat.cmu.edu/R/CRAN/ Packages within R environment: -- Spot http://www.cmis.csiro.au/iap/spot.htm -- SMA (statistics for microarray analysis) http://www.stat.berkeley.edu/users/terry/zarray/Software /smacode.html

  49. Factorial Design Age Effect 2 P01 A1 4 Zone Effect 1 3 5 P04 A 4

  50. Factorial design m m+a Different ways of estimating parameters. e.g. Zeffect. 1 = (m + z) - (m) = z 2 - 5 = ((m + a) - (m)) -((m + a)-(m + z)) = (a) - (a + z) = z 4 + 3 - 5 =…= z 2 P01 A1 4 1 3 5 P04 A 4 m+z m+z+a+za How do we combine the information?

More Related