1 / 104

Microarrays: Common Analysis Approaches

Microarrays: Common Analysis Approaches. Outline. Missing Value Estimation Differentially Expressed Genes Clustering Algorithms Principal Components Analysis. Missing Data: Outline. Missing data problem, basic concepts and terminology Classes of procedures Case deletion Single imputation

hasad
Télécharger la présentation

Microarrays: Common Analysis Approaches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microarrays:Common Analysis Approaches

  2. Outline • Missing Value Estimation • Differentially Expressed Genes • Clustering Algorithms • Principal Components Analysis

  3. Missing Data: Outline • Missing data problem, basic concepts and terminology • Classes of procedures • Case deletion • Single imputation • Filling with zeroes • Row averaging • SVD imputation • KNN imputation • Multiple imputation

  4. The Missing Data Problem Causes for missing data • Low resolution • Image corruption • Dust/scratched slides • Missing measurements Why estimate missing values? • Many algorithms cannot deal with missing values • Distance measure-dependent algorithms(e.g., clustering, similarity searches)

  5. Basic concepts and terminology Statistical overview Missing data mechanism Sample of complete data: θs Sample of incomplete data: θi Population of complete data: θ Sample Need to estimate θ from the incomplete data and investigate its performance over repetitions of the sampling procedure

  6. Basic concepts Y = sample data f(Y;θ) = distribution of sample data θ = parameters to be estimated R = indicators, whether elements of Y are observed or missing g(R|Y) = missing data mechanism (maybe with other params) Y = (Yobs, Ymis) Yobs = observed part of Y Ymis = missing part of Y Goal: Propose methods to estimate θ from Yobs and accurately assess its error

  7. Basic concepts (cont.) Classes of mechanisms (cf. Rubin, 1976, Biometrika) • Missing Completely At Random (MCAR) • g(R|Y) does not depend on Y • Missing At Random (MAR) • g(R|Y) may depend on Yobs but not on Ymis • Missing Not At Random (MNAR) • g(R|Y) depends on Ymis

  8. Example • Suppose we measure age and income of a collection of individuals… • MCAR • The dog ate the response sheets! • MAR • Probability that the income measurement is missing varies according to the age but not income • MNAR • Probability that an income is recorded varies according to the income level with each age group Note: we can disprove MCAR by examining the data, but we cannot disprove MAR or MNAR.

  9. Outline • Missing data problem, basic concepts and terminology • Classes of procedures • Case deletion • Single imputation • Filling with zeroes • Row averaging • SVD imputation • KNN imputation • Multiple imputation

  10. Classes of procedures: Case Deletion • Remove subjects with missing values on any item needed for analysis • Advantages • Easy • Valid analysis under MCAR • OK if proportion of missing cases is small and they are not overly influential • Disadvantages • Can be inefficient, may discard a very high proportion of cases (5669 out of 6178 rows discarded in Spellman yeast data) • May introduce substantial bias, if missing data are not MCAR (complete cases may be un-representative of the population)

  11. Classes of procedures: Single Imputation (I) • Replace with zeroes • Fill-in all missing values with zeroes • Advantages • Easy • Disadvantages • Distorts the data disproportionately (changes statistical properties) • May introduce bias • Why zero?

  12. Classes of procedures: Single Imputation (II) • Row averaging • Replace missing values by the row average for that row • Advantages • Easy • Keeps same mean • Disadvantages • Distorts distributions and relationships between variables x x x x x x x x x x x x x x x x x x x x x x

  13. Classes of procedures: Single Imputation (III) • “Hot deck” imputation • Replace each missing value by a randomly drawn observed value • Advantages • Easy • Preserves distributions very well • Disadvantages • May distort relationships • Can use, e.g., “similar” rows to draw random values from (to help constrain distortion) • Depend on definition of “similar”

  14. Classes of procedures: Single Imputation (IV) • Regression imputation • Fit regression to observed values, use it to obtain predictions for missing ones • SVD imputation • Fill missing entries with regressed values from a set of characteristic patterns, using coefficients determined by the proximity of the missing row to the patterns • KNN imputation (more later) • Isolate rows whose values are similar to those of the one with missing values (choosing (i) similarity measure, and (ii) size of this set) • Fill missing values with averages from this set of genes, with weights inversely proportional to similarities • Computationally intensive • May distort relationships between variables (could use Yimp+random residual)

  15. Classes of procedures: Multiple Imputation • Main Idea • Replace Ymis by M>1 independent draws • {Y1mis,…,YMmis } ~ P(Ymis| Yobs ) • Produce M different versions of complete data • Analyse each one in same fashion and combine results at the end, with standard error estimates (Rubin, 1987) • More difficult to implement • Requires (initially) more computations • More work involved in interpreting results

  16. KNN Imputation • Troyanskaya et al., Bioinformatics, 2001 • The Algorithm • 0. Given gene A with missing values • Find K other genes with values present in experiment 1, with expression most similar to A in other experiments • Weighted average of values in experiment 1 from the K closest genes is used as an estimate for the missing value in A

  17. KNN Imputation: Considerations • K – the number of nearest neighbours • Method appears to be relatively insensitive to K within the range 10-20 • Distance metric to be used for computing gene similarity • Troyanskaya: “Euclidean is sufficient” • No clear comparison or reason – would expect that metric to be used depends on the type of experiment • Not recommended on matrices with less than four columns • Computationally intensive! • ~O(m2n) for m rows and n genes • “3.23 minutes on a Pentium III 500 MHz for 6153 genes, 14 experiments with 10% of the entries missing”

  18. KNN Imputation: Expression Profiler

  19. Outline • Missing Value Estimation • Differentially Expressed Genes • Clustering Algorithms • Principal Components Analysis

  20. Identifying Differentially Expressed Genes [Slides courtesy of John Quackenbush, TIGR]

  21. Two vs. Multiple conditions • Two conditions - t-test - Significance analysis of microarrays (SAM) - Volcano Plots • - ANOVA • Multiple conditions - Clustering - K-means - PCA

  22. How Many Replicates?? n = [4(za/2 + zb)2] / [(d/1.4s)2] Where za/2 and zb are normal percentile values at false positive rate aType I error ratefalse negative rate bType II error rate, drepresents the minimum detectable log2 ratio; and s represents the SD of log ratio values. For a = 0.001 and b = 0.05, get za/2 = -3.29 and zb = -1.65. Assume d = 1.0 (2-fold change) and s = 0.25,  n = 12 samples (6 query and 6 control)  (Simon et al., Genetic Epidemiology 23: 21-36, 2002)

  23. Some Concepts from Statistics

  24. the number of “favorable” outcomes for an event the total number of possible outcomes for that event rf = Probability Distributions • The probability of an event is the likelihood of its occurring. • It is sometimes computed as a relative frequency (rf), where The probability of an event can sometimes be inferred from a “theoretical” probability distribution, such as a normal distribution.

  25. σ = standard deviation of the distribution X = μ (mean of the distribution) Normal Distribution

  26. Mean 1 Mean 2 Population 1 Population 2 Sample mean “s” • Less than a 5 % chance that the sample with mean s came from Population 1 • s is significantly different from Mean 1 at the p < 0.05 significance level. • But we cannot reject the hypothesis that the sample came from Population 2

  27. Probability and Expression Data • Many biological variables, such as height and weight, can reasonably be assumed to approximate the normal distribution. • But expression measurements? Probably not. • Fortunately, many statistical tests are considered to be fairly robust to violations of the normality assumption, and other assumptions used in these tests. • Randomization / resampling based tests can be used to get around the violation of the normality assumption. • Even when parametric statistical tests (the ones that make use of normal and other distributions) are valid, randomization tests are still useful.

  28. s Original data set “fake” s “fake” s “fake” s . . . Randomized “fake” data sets Outline of a Randomisation Test 1. Compute the value of interest (i.e., the test-statistic s) from your data set. 2. Make “fake” data sets from your original data, by taking a random sub-sample of the data, or by re-arranging the data in a random fashion. Re-compute s from the “fake” data set.

  29. Original s value could be significant as it exceeds most of the randomized s values Range of randomized s values Outline of a Randomisation Test (II) 3. Repeat step 2 many times (often several hundred to several thousand times) and record of the “fake” s values from step 2 4. Draw inferences about the significance of your original s value by comparing it with the distribution of the randomized (“fake”) s values

  30. Outline of a Randomisation Test (III) • Rationale • Ideally, we want to know the “behavior” of the larger population from which the sample is drawn, in order to make statistical inferences. • Here, we don’t know that the larger population “behaves” like a normal distribution, or some other idealized distribution. All we have to work with are the data in hand. • Our “fake” data sets are our best guess about this behavior (i.e., if we had been pulling data at random from an infinitely large population, we might expect to get a distribution similar to what we get by pulling random sub-samples, or by reshuffling the order of the data in our sample)

  31. The Problem of Multiple Testing (I) • Let’s imagine there are 10,000 genes on a chip, and • none of them is differentially expressed. • Suppose we use a statistical test for differential expression, where we consider a gene to be differentially expressed if it meets the criterion at a p-value of p < 0.05.

  32. The Problem of Multiple Testing (II) • Let’s say that applying this test to gene “G1” yields a p-value of p = 0.01 • Remember that a p-value of 0.01 means that there is a 1% chance that the gene is not differentially expressed, i.e., • Even though we conclude that the gene is differentially expressed (because p < 0.05), there is a 1% chance that our conclusion is wrong. • We might be willing to live with such a low probability of being wrong • BUT .....

  33. The Problem of Multiple Testing (III) • We are testing 10,000 genes, not just one!!! • Even though none of the genes is differentially expressed, about 5% of the genes (i.e., 500 genes) will be erroneously concluded to be differentially expressed, because we have decided to “live with” a p-value of 0.05 • If only one gene were being studied, a 5% margin of error might not be a big deal, but 500 false conclusions in one study? That doesn’t sound too good.

  34. The Problem of Multiple Testing (IV) • There are “tricks” we can use to reduce the severity of this problem. • They all involve “slashing” the p-value for each test (i.e., gene), so that while the critical p-value for the entire data set might still equal 0.05, each gene will be evaluated at a lower p-value. • We’ll go into some of these techniques later.

  35. The Problem of Multiple Testing (V) • Don’t get too hung up on p-values. • Ultimately, what matters is biological relevance. • P-values should help you evaluate the strength of the evidence, rather than being used as an absolute yardstick of significance. • Statistical significance is not necessarily the same as biological significance.

  36. Finding Significant Genes • Assume we will compare two conditions with multiple replicates for each class • Our goal is to find genes that are significantly different between these classes • These are the genes that we will use for later data mining

  37. Finding Significant Genes (II) ??? • Average Fold Change Difference for each gene • suffers from being arbitrary and not taking into account systematic variation in the data

  38. Finding Significant Genes (III) t = signal = difference between means = <Xq> – <Xc>_ noise variability of groups SE(Xq-Xc) • t-test for each gene • Tests whether the difference between the mean of the query and reference groups are the same • Essentially measures signal-to-noise • Calculate p-value (permutations or distributions) • May suffer from intensity-dependent effects

  39. T-Tests A significant difference Probably not

  40. Group A Group B Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Gene 1 Gene 1 Gene 2 Gene 2 Gene 3 Gene 3 Gene 4 Gene 4 Gene 5 Gene 5 Gene 6 Gene 6 • Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B. T-Tests (I) 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?

  41. T-Tests (II) 3. Calculate t-statistic for each gene 4. Calculate probability value of the t-statistic for each gene either from: A. Theoretical t-distribution OR B. Permutation tests.

  42. Group A Group B Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Original grouping Gene 1 Group A Group B Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Randomized grouping Gene 1 T-Tests (III) Permutation tests i) For each gene, compute t-statistic ii) Randomly shuffle the values of the gene between groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B.

  43. T-Tests (IV) Permutation tests - continued iii) Compute t-statistic for the randomized gene iv) Repeat steps i-iii n times (where n is specified by the user). v) Let x = the number of times the absolute value of the original t-statistic exceeds the absolute values of the randomized t-statistic over n randomizations. vi) Then, the p-value associated with the gene = 1 – (x/n)

  44. T-Tests (V) • 5. Determine whether a gene’s expression levels are significantly different between the two groups by one of three methods: • “Just alpha” (a significance level): If the calculated p-value for a gene is less than or equal to the user-input a (critical p-value), the gene is considered significant. • OR • Use Bonferroni corrections to reduce the probability of erroneously classifying non-significant genes as significant. • B) Standard Bonferroni correction: The user-input alpha is divided by the total number of genes to give a critical p-value that is used as above –> pcritical = a/N.

  45. T-Tests (VI) 5C) Adjusted Bonferroni: i) The t-values for all the genes are ranked in descending order. ii) For the gene with the highest t-value, the critical p-value becomes (a/N), where N is the total number of genes; for the gene with the second-highest t-value, the critical p-value will be (a/[N-1]), and so on.

  46. Finding Significant Genes (IV) • Significance Analysis of Microarrays (SAM)- Uses a modified t-test by estimating and adding a small positive constant to the denominator- Significant genes are those which exceed the expected values from permutation analysis.

  47. SAM • SAM can be used to select significant genes based on differential expression between sets of conditions • Currently implemented for two-class unpaired design – i.e., we can select genes whose mean expression level is significantly different between two groups of samples (analogous to t-test). • Stanford University, Rob Tibshiranihttp://www-stat.stanford.edu/~tibs/SAM/index.html

  48. SAM • SAM gives estimates of the False Discovery Rate (FDR), which is the proportion of genes likely to have been wrongly identified by chance as being significant. • It is a very interactive algorithm – allows users to dynamically change thresholds for significance (through the tuning parameter delta) after looking at the distribution of the test statistic. • The ability to dynamically alter the input parameters based on immediate visual feedback, even before completing the analysis, should make the data-mining process more sensitive.

  49. Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 1 Gene 2 Gene 2 Group A Group B Gene 3 Gene 3 Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Gene 4 Gene 4 Gene 5 Gene 5 Gene 6 Gene 6 • Assign experiments to two groups - in the expression matrix below: Experiments 1, 2 and 5 to group A Experiments 3, 4 and 6 to group B SAM Two-class 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?

  50. Group A Group B Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Original grouping Gene 1 Group A Group B Exp 1 Exp 2 Exp 5 Exp 3 Exp 4 Exp 6 Randomized grouping Gene 1 SAM Two-class Permutation tests i) For each gene, compute d-value (analogous to t-statistic). This is the observed d-value for that gene. ii) Randomly shuffle the values of the gene between groups A and B, such that the reshuffled groups A and B have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene

More Related