cDNA Microarray Design and Pre-processing By H. Bjørn Nielsen
Why Experimental Design • To enable statistical hypothesis verification/falsification • To balance the effects from undesired controllable effects • To ensure sufficient statistical power
1. To enable statistical hypothesis verification/falsification Typically, we want to identify differential expressed genes between a set of conditions using t-test or ANOVA like statistics. This implies that we replicate sampling from a set of fixed conditions. Control vs. Treatment Treatment 1, Treatment 2, Treatment 3 Multi factorial Control Treatment Mutant, Mutant Treated
The length of the series or the sampling density may be most important Replications is essential Control vs. Treatment Treatment 1, Treatment 2, Treatment 3 Multi factorial Control Treatment Mutant, Mutant Treated 1. To enable statistical hypothesis verification/falsification But we may also fit to a trend using alternative statistics (Bayesian fit, Boot strapping, ANOVA etc.) Series T0, T1, T2, .... Tn
2. To balance the effects from undesired controllable effects Minimize and Balance Typical controllable effects Labeling dye Microarray slide Sampling time Growth conditions Typical uncontrollable effects Random effects Unintended deviations in sample handling, growth conditions, etc.
t = 2. To ensure sufficient statistical power An appropriate number of replicates are required for distinguishing noise from 'effect' Gene expression studies typically requires +3 replicates • Make sure to replicate over the most important sources of variance • Typical order of noise contributions are: • Biological variation • Sample preparation batch • Hybridization/slide effect • Dye effect/Spot effect
An example Aim: Identify differentially expressed genes between ill and healthy patients. Samples: 4 ill and 4 healthy patients Using a two channel cDNA array. How should we do?
Another example Aim: Identify differentially expressed genes between ill and healthy patients. Samples: 4 ill (2xM +2xF) and 4 healthy (2xM +2F) Using a two channel cDNA array. How should we do?
Yet another example Aim Identify genes differentially affected by starving in obese and lean people Samples: 4 obese (2x starving + 2x not starving) and 4 lean (2x starving +2x not starving) Using a one channel GeneChip. How should we do?
cDNA pre-processing • Background correction • Normalization • Within slide • Between slide
Background correction Is it meaningful? Methods: • subtraction • movingmin (3x3) • normexp • none Ritchie et al. 2007, Bioinformatics
Normalization within array Correct for any bias that follow an undesired uncontrollable effect • Print tip • Microtiter plate • Printing order • Spatial trends (uneven hybridization) As well as intensity dependent biases
M A Normalization between array Correction for intensity dependent biases • Lowess • Qspline • Quantiles • And more