Analysis of DNA Microarray Data: Sensitivity, Specificity, and Other Real-World Issues

Analysis of DNA Microarray Data: Sensitivity, Specificity, and Other Real-World Issues

1. Definitions and basic considerations

DNA microarrays • Major advantage • Simultaneous measurement of level of expression for nearly all transcribed genes within given cell or tissue • Major disadvantage • Cost

Therefore, to get the most bang for the buck, it is imperative to understand the role of uncertainty in measurement…

Categorical tests (yes/no, based upon threshold) • Gene arrays • Is gene expressed or not? • Is gene differentially expressed under two different experimental conditions? • Medical tests • Does patient have disease or not?

Key concepts for categorical tests • Specificity • true negative rate • 1 – FPR (false positive rate) • Sensitivity • TPR (true positive rate)

Specificity provides the answer to questions like… • What fraction of patients who are disease-free are correctly classified as disease-free? • What fraction of genes that are not differentially expressed are correctly classified as being non-differentially expressed?

Specificity • Specificity is defined as true negative rate • Probability that disease-free patient will be correctly categorized as disease-free • False positive rate (FPR) = 1 – specificity • Probability that disease-free patient will be incorrectly categorized as having disease

Sensitivity and specificity deal with distinct sets of patients or genes • Specificity • Healthy patients lacking the disease • Non-expressed genes • Non-differentially expressed genes • Sensitivity • Sick patients having the disease • Expressed genes • Differentially expressed genes

Sensitivity provides the answer to questions like… • What fraction of patients who have a given disease are correctly classified as diseased? • What fraction of genes that are differentially expressed are correctly classified as being differentially expressed?

Sensitivity • Sensitivity is defined as true positive rate • Probability that diseased patient will be correctly categorized as having the disease

Yin and yang of sensitivity and specificity • Improving specificity always worsens sensitivity • Improving sensitivity always worsens specificity

Since when is the world ever ideal?

If we choose a threshold l of 1.5, then...

And if we choose a threshold l of 0.5, then...

2. Sources of uncertainty in categorical tests

SMEASURE = measured signal STRUE = true signal N = noise (error)

Noise-to-Signal (N:S) Ratio • N : S << 1 • reliable and trustworthy measurement • N ~ S • unreliable measurement • N > S • highly unreliable measurement

Sources of uncertainty in categorical measurements • Measurement uncertainty • SMEASURE does not necessarily equal STRUE • N ~ S or N > S • “Overlap” uncertainty • Some patients with disease truly have positive test values • Some patients without disease truly have negative test values

Gene arrays and medical tests have distinct and different sources of uncertainty

Variability in medical tests is mostly “overlap” • Measurement variability • Essentially none (error is of no clinical significance) • N : S << 1 • Hence, perform test once and only once • “Overlap” variability • Ubiquitous and essentially unavoidable • Feature of all medical tests to one degree or another • So what’s the solution? • Search for a better test

Variability in DNA microarrays is mostly measurement uncertainty • Measurement variability • Ever-present • N > S for many genes • “Overlap” variability • None • Absent gene has expression level of zero, whereas present gene has expression level of non-zero • Differentially expressed gene… • So what’s the solution? • Repeated measurements

So how do we improve the N:S ratio?

Take mean of repeated measurements...

Benefits of repeated measurements • Assuming that noise N has a normal (Gaussian) distribution, then the error decreases with square root of number n of measurements • Example: to reduce N : S by half, take mean of 4 measurements

3. Measurements using Affymetrix (MSV 5.0)

Affymetrix Microsoft Suite Version 5.0 (MSV 5.0)

For our analysis, we used...

Signal Log Ratio (SLR) • SLR = logarithm to base 2 of the ratio of the signal for gene under experimental condition A (SA1) to that for the same gene under experimental condition N (SN1)

Examples of SLR SA1 = 4000 SA1 = 2 SN1 = 1000 SN1 = 16 SLR = log2 (4) = 2 SLR =log2 (1/8) = –3

4. Specificity of MSV 5.0

To get a handle on specificity, perform same-versus-same comparisons • SLRTRUE must be zero • log2 (1) = 0 • Hence, SLRMEASURE is all noise

Perform separate analyses for “present” and “absent” genes • Present genes • N : S << 1 • Absent genes • N : S ≥ 1

Experimental system • Primary cultures of peritoneal macrophages from mice of 3 strains • BALB/c (normal) • MRL/+ (autoimmune lupus) • MRL/lpr (autoimmune lupus) • Each array represents mRNA pooled from distinct sets of ~ 6 mice harvested on separate days • Macrophages were stimulated with bacterial endotoxin (lipopolysaccharide, LPS) for 8 or 24 hours

Present genes:same-vs.-same comparison (single array)

Present genes: Same-vs.-same comparison (single array) • Average SLR = ~ 0.02 + 0.04 (~ 1.014-fold) • not different from zero • that’s good! • Standard deviation = ~ 0.69 + 0.30 • ~ 32% genes have SLR > 0.69 (1.61-fold induction) • ~ 4% genes have SLR > 1.38 (2.60-fold induction) • that’s not good

Present genes: Statistical distribution of SLR • Entire distribution • Not normal (p < 0.01, by D statistic) • Central 95% • Normal (p > 0.2, by D statistic) • Highly noteworthy, since D statistic detects tiny tiny deviations from normality • 5% at tails overestimate the SLR

Present genes:same-vs.-same comparison (single array)

If we compare genes in central 95% versus genes in 5% tails… • Center (95% genes) • Mean signal intensity = 1493 • Tails (5% genes) • Mean signal intensity = 620 (p < 10-19, t-test) • Consistent with intuitive idea that measurement variability is inversely related to level of gene’s expression

Absent genes:same-vs.-same comparison(single array)

Absent genes: Same-vs.-same comparison (single array) • Average SLR = ~ 0.33 + 0.31 (~ 1.26-fold induction) • definitely not good • Standard deviation = ~ 1.12 + 0.24 • > 35% genes have SLR > 1.0 (2-fold induction) • > 5% genes have SLR > 2.0 (4-fold induction) • even worse!

Absent genes: Statistical distribution of SLR • Entire distribution • Not normal (p < 0.01, by D statistic) • Central 95% • Not normal (p < 0.01, by D statistic) • Central 60% • Not normal (p < 0.01, by D statistic)

Summary of same-vs.-same comparisons (single array) • Use SLR only for genes that are actually expressed (i.e., “present” genes) • Central 95% normally distributed with standard deviation of ~ 0.69 • 2.5% at each tail exceeds normal distribution • Do not use SLR for genes that are marginally, if at all, expressed (i.e., “absent” genes) • Most of measured signal is noise • SLR is therefore ratio of two small randomly distributed values

Specificity of single array comparisons

Analysis of DNA Microarray Data: Sensitivity, Specificity, and Other Real-World Issues

Analysis of DNA Microarray Data: Sensitivity, Specificity, and Other Real-World Issues

Presentation Transcript

:: Microarray analysis ::

LP SENSITIVITY

Introduction to Microarray Data Analysis BMI/IBGP 730

Module 5 Modeling Decisions Sensitivity Analysis

Some Statistical Issues in Microarray Data Analysis

Statistical Analysis of Microarray Data

Microarray Data Analysis

Sensitivity and Specificity

Sensitivity and Specificity

Microarray data analysis

Literature Survey: Microarray Data Analysis

Introduction to Gene Chips and Microarray Expression Data

RADical microarray data: standards, databases, and analysis

DMD (27- 31 y.a .)

Statistical Analysis of Microarray Data

Tmm: Analysis of Multiple Microarray Data Sets

Microarray Data Analysis

Sensitivity and Specificity

Microarray Cancer Data Visualization Analysis in Relation to Pharmacogenomics

Diagnosis

Microarray