Biostatistics Article Oncology Journal Club May 28, 2004

M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies”Biometrics 59, 822-828; December 2003 Biostatistics Article Oncology Journal Club May 28, 2004

A couple introductory points • Different kinds of microarrays • Two main distinctions • One-color (e.g. Affymetrix, long oligo) • Two-color (e.g. spotted cDNA) • Some of the statistical tools are the same and some are different • Using two color arrays is slightly more complicated in terms of design

Statistics and Microarrays • Statistical Principles certainly apply to microarray analyses • We should be considering some of the same basic tenets when performing microarray studies • Randomization • Sample size/Replication issues • Experimental design • Good design is critical to making efficient and valid inferences.

Randomization • Might not sound applicable • But… • If you have a ‘treatment’ you are giving, samples should be randomly assigned to treatment groups • Randomize order in which samples are processed • Randomize order in which hybridizations are performed • Randomize the order in which arrays are chosen from array batch. • Example: Dosing study • Looking for genetic changes in cells as a function of dose • Perform all dose=0 experiments first, then dose=1, then dose=2, etc…. • But, as you proceed, you learn more, get better at processing samples, hybridizations, using scanner…. • Your results be associated with dose even if dose has no affect on genetic changes: CONFOUNDING!

Sample Size and Replication • Three types of ‘replication’ in microarrays A. Spotting genes multiple times on same array B. Hybridizing multiple arrays to the same RNA samples C. Using multiple individuals of a certain type • A and B are considered ‘technical’ replicates • C describes ‘random sampling’ from the population • THESE ARE CRITICALLY DIFFERENT!

Sample Size and Replication • Technical replication: • DOES NOT address biological variability • DOES address measurement error of assay • Usually, interested how a condition affects individuals in general • NOT usually interested in how a condition affects any given individual • Example: AML • Do we want to make inferences about differences in gene expression across AML subpopulations? • Or, do we want to make inferences about differences in gene expression in two particular AML patients, each of whom has a different type of AML?

Sample Size and Replication • Why/When would we be interested in technical replication? • Medical diagnosis • Need to know how precise the measures are • Sensitivity and specificity of the assay depend on that

Sample Size and Replication • Biological replicates • Tell us about the variability across samples of the same type. • Biological variability is critical for • finding differences in gene expressions across populations • Classification procedures which try to use gene expression patterns that differentiate individuals of different types • If you use just one sample or cell line to make inferences about the population of interest • You are making a BIG assumption: “Population is relatively homogeneous” • Cannot evaluate your assumption based on the data from the study.

Sample Size and Replication • For a fixed sample size: • It is preferable to sample NEW individuals rather perform technical replicates • Why? It is more efficient in terms of variance, power, etc. • You gain much less by replicates than new samples • But, if it is expensive to sample new individuals • Examples: samples are very rare, recruitment is difficult, procedure for acquiring samples is risky or expensive • In this case, it might be worthwhile to perform some technical replicates due to “cost-benefit” analysis • GENERAL RULE: TRUE REPLICATION BEATS TECHNICAL REPLICATION FOR GAINS IN PRECISION WHEN ESTIMATING PARAMETERS

Pooling of Samples • Often motivated by insufficient quantity of RNA, which is reasonable. • Sometimes, proposed to ‘control’ for biological variability • Bad idea! • We need to understand, not eliminate biological variability • To understand the differences in mean expressions across two populations (e.g. Normal karyotype and t(15:17)), we need to be able to estimate the populations means • We cannot do that if we have pooled RNA • We can estimate mean difference in two groups based on pooled samples • But, we cannot make inferences about whether of not there is a difference in mean expression.

Pooling of Samples • Pooling is ALWAYS bad if your goal is • Finding classification scheme • Discovering unknown subtypes • ‘In between’ strategy for pooling when we are interested in determining if average expression is different in two phenotypes (Kendziorski et al (2003)). • Pooling RNA for use as a ‘reference’ is OK (more in a minute).

Experimental Layout • Discussion specific to two-color arrays • Complicated due to pairing of samples on arrays • One-color array design considerations usually more straightforward • Critical determinant of design efficiency. • Three main types of designs in two-color arrays: • Reference • Loop • Dye swap

Reference Design Type 1 • Each arrow represents an array • Lets say that origin of arrow is green and head of arrow is red • Each sample of interest is paired with the same “reference” sample • AML example: reference was 11 pooled cell lines • Here, each sample is labeled with red (Cy5) and reference is labeled with green (Cy3) • Each sample is only hybridized to ONE array (each reference) Type 2 Reference sample

Loop Design Type 1 • Each sample is paired with a sample of the other type (no reference!) • Each sample is hybridized to TWO arrays and is both red and green • Can compare any two arrays by comparing arrays between them in loop. • Relative efficiency is 4 to 1 comparing loop to reference • Downside: what if just ONE array goes bad? Loop is not a loop anymore! • Good design for small number of samples: uses information very effectively Type 2

Dye Swap Design Type 1 • Each sample is paired with the same sample of the other type TWICE • Each sample is hybridized to TWO arrays • Dyes are swapped • Relative efficiency is 4 to 1 comparing loop to reference • More robust than loop • Less complicated than loop • Direct comparisons are not as easy because samples are not linked through other samples as in other two designs Type 2

Why reference so often? • As population variance increase, loop and dye swaps have less advantage. • Sample comparisons must go ‘through’ loop • Direct comparisons not easy in dye swap if samples are not on same chip. • If you have large number of samples, loop is risky due to ‘bad chips’ • Logically, however, by using reference on every chip, we are ‘wasting’ a resource. • But, less efficiency advantage in complex designs as number of RNAs increases

Robustness • Two robust alternatives: require 2x as many arrays “Double reference” “Double Loop”

Practical Considerations • Simplicity • Large study with many technicians • Extendability • Open-ended • Can add additional samples at a later time depending on what early results suggest • Reference and “symmetric” reference designs • Useful subdesigns • “subgroup analyses” • Example: all AMLs vs. normal karyotype

Biostatistics Article Oncology Journal Club May 28, 2004

Biostatistics Article Oncology Journal Club May 28, 2004

Presentation Transcript

Journal Club

JOURNAL CLUB

JOURNAL CLUB ARTICLE

POWH Journal Club Article Review

Journal Club 14 May 2007

Journal Club

fMRI Journal Club August 31st, 2004

Journal Club

Journal club

Journal Club

JOURNAL ARTICLE

Journal Article, Name of Journal Journal Article, Name of Journal

Journal Club

Journal Club

Journal club

Journal Club

Oncology Journal Club Addressing Imatinib-resistant CML

Medicinal Chemistry Journal Club September 2004

Journal Club Journal of Chemometrics May 2010

Journal Club

Journal Club

UOG Journal Club: May 2012