1 / 89

Statistical Analysis of Microarray Data

Statistical Analysis of Microarray Data. Ka-Lok Ng Asia University. Statistical Analysis of Microarray Data. Ratios and reference samples

fruma
Télécharger la présentation

Statistical Analysis of Microarray Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Analysis of Microarray Data Ka-Lok Ng Asia University

  2. Statistical Analysis of Microarray Data Ratios and reference samples • Compute the ratio of fluorescence intensities for two samples that are competitively hybridized to the same microarray. One sample acts as a control , or “reference” sample, and is labeled with a dye (Cy3) that has a different fluorescent spectrum from the dye (Cy5) used to label the experimental sample. • A convention emerged that two-foldinduction or repression of an experimental sample, relative to the reference sample, were indicative of a meaningful change in gene expression. • This convection does not reflect standard statistical definition of significance • This often has the effect of selecting the top 5% or so of the clones present on the microarray

  3. Statistical Analysis Microarray Data Reasons for adopting ratios as the standard for comparison of gene expression • Microarrays do not provide data on absolute expression levels. Formulation of a ratio captures the central idea that it is a change in relative level of expression that is biological interesting. • removes variation among arrays from the analysis. Differences between microarray – such as (1) the absolute amount of DNA spotted on the arrays, (2) local variation introduced either during the sliding preparation and washing, or during image capture.

  4. Simple normalization of microarray data. The difference between the raw fluorescence is a meaningless number. Computingratiosallows immediate visualization of which genes are higher in the red channel than the green channel, but logarithmic transformation of this measure on the base 2 scale results in symmetric distribution of values. Finally, normalization by subtraction of the mean log ratio adjusts for the fact that the red channel was generally more intense than the green channel, and centers the data around zero. Statistical Analysis of Microarray Data All microarray experiments must be normalized to ensure that biases inherent in each hybridization are removed. True whether use ratios or raw fluorescent intensities are adopted as the measure of transcript abundance.

  5. Statistical Analysis of Microarray Data

  6. Calculate which genes are differentially expressed. Statistical Analysis of Microarray Data Calculate which genes are differentially expressed The fluorescence intensity for the Cy3 or Cy5 channel after background subtraction. Calculate which genes are at least twofold different in their abundance on this array using two different approaches: (a) by formulating the Cy3:Cy5 ratio, and (b) by calculating the difference in the log base 2 transformed values. In both cases, make sure that you adjust for any overall difference in intensity for the two dyes and comment on whether this adjustment affects your conclusions.

  7. Statistical Analysis of Microarray Data Divide by 0.954

  8. Statistical Analysis of Microarray Data Using the ratio method, without adjustment for overall dye effects, genes 2 and 9 appear to have Cy3/Cy5 < 0.5, suggesting that they are differentially regulated. No genes have Cy3/Cy5 > 2. However, the average ratio is 0.95, indicating that overall fluorescence is generally 5% greater in the Cy5 (RED) channel. One way to adjust for this is to divide the individual ratios by the average ratio, which results in the adjusted ratio column. This confirm that gene 2 is underexpressed in Cy3, but not gene 9, whereas gene 5 may be overexpressed.

  9. Statistical Analysis of Microarray Data Using the log transformation method, you get very similar results(-1 and +1). The adjusted columns indicate the difference between the log2 fluorescenec intensity and the mean log2 intensity for the respective dye, and hence express the relative fluorescence intensity, relative to the sample mean. The difference between these values gives the final column, indicating that genes 2 and 5 may differentially expressed by twofold or more.

  10. Statistical Analysis of Microarray Data If you just subtract the raw log2 values, you will see that gene 9 appears to be underexpressed in Cy3, but gene 5 appears to be slightly less than twofold overexpressed.

  11. Finding significant genes • After normalizing, filtering and averaging the data, one can identify genes with expression ratios that are significantly different from 1 or -1 • Some genes fluctuates a great deal more than others (Hughes et al. 2000a, b) • In general the genes whose expression is most variable are those in which expression is stress induced, modulated by the immune system or hormonally regulated (Pritchard et al. 2001) • There is the Missing Value problem in microarray data set • By interpolation • References • Hughes TR, et al. (2000a) Functional discovery via a compendium of expression profiles. Cell 102(1):109-26 • Hughes TR, et al. (2000b) Widespread aneuploidy revealed by DNA microarray expression profiling. Nat Genet 25(3):333-7 • Pritchard et al. 2001 Project normal: Defining normal variance in mouse gene expression. PNAS 98, 13266.

  12. Measure of similarity – definition of distance A measure of similarity - distance • Euclidean distance between two genes • for example: p53 and mdm2

  13. Measure of similarity – definition of distance Non-Euclidean metrics • Any distance dijbe the distance between two vectors, i and j must satisfy a number of rules: • The distance must be positive definite • The distance must be symmetric, dij = dji • An object is zero distance from itself, dii =0 • Triangle inequality dik ≦ dij + djk • Distance measures that obey 1 to 3 but not 4 are referred to as semi-metric. • Manhattan distance (or city block) distance is an example of non-Euclidean distance metric, The Mahattan distance is defined as the sum of the absolute distances between the components (i) of each expression vector, x and y, It measures the route one might have to travel between two points in a place such as Manhattan where the streets and avenues are arranged at right angles to one another. It is known as Hamming distance when applied to data expressed in binary form, e.g. if the expression levels of the genes have been discretized into 1s and 0s.

  14. Measure of similarity – definition of distance • Chebychev distance (The L∞, Chebychev or Maximum metric) between two n-dimensional vectors x = (x1, x2, …., xn) and y = (y1, y2, ….yn) • Chebychev distance will pick the one experiment in which these two genes are most different (the largest difference ) and will consider that value the distance between the genes. • The Chebychev distance behaves inconsistently with respect to outliers since it only looks at one dimension. If any or all other coordinates are changed due to measurement error without changing the maximum difference, the Chebychev distance will remain the same. The Chebychev distance is resilient with respect to noise and outliers. • However, if any one coordinate is affected sufficiently such that the maximum distance changes, the Chebychev distance will change. • The Chebychev distance is in general resilient to small amount of noise even if they affect several coordinates but will affected by a single large change.

  15. Measure of similarity – definition of distance • Minkowski distance is a generalization of the Euclidean distance and is expressed as The parameter p is called the order. The higher the value of p, the more significant is the contribution of the largest components |ai – bi |. p=1  Manhattan distance p=2  Euclidean distance p=∞ Chebychev distance The Mahalanobis metric is defined as: Herman Minkowski (1864-1909) where Cov(D) is the covariance matrix for dataset D. If the covariance matrix Cov(D) is the identity matrix, then the Mahalanobis distance would be equal to the Euclidean. http://library.thinkquest.org/05aug/01273/whoswho.html http://www.comp.lancs.ac.uk/~kristof/research/notes/basicstats/index.html

  16. Measure of similarity – definition of distance The graphical illustration of the Mahattan and Euclidean distances Y Mahattan distance = 3 X O Hamming distance = 3 http://www.comp.lancs.ac.uk/~kristof/research/notes/basicstats/index.html

  17. Measure of similarity – definition of distance The higher the value of p, the more significant is the contribution of the largest components |ai – bi |. Close to 3, that is 3.037 < 3.162 Close to 10

  18. Measure of similarity – definition of distance The Canberra metric is defined as The output ranges from 0 to the number of variables used, that is, in case of yi < 0, the maximum of |xi – yi| is |xi| + |yi| The Canberra distance is very sensitive to small changes near zero., that is when there is a change of sign near zero. double http://www.comp.lancs.ac.uk/~kristof/research/notes/basicstats/index.html

  19. A B chord distance angular distance Measure of similarity – definition of distance • Euclidean distance is one of the most intuitive ways to measure the distance between points in space, but it is not always the most appropriate one for expression profiles. • We need to define distance measures that score as similar gene expression profiles that show similar trend, rather than those that depend on the absolute levels. • Two simple measures that can be used are the angle and chord distances. A B chord distance angular distance

  20. A B B chord distance angular distance Measure of similarity – definition of distance • A = (ax, ay), B = (bx, by) • The cosine of the angle between the two vectors A and B is given by their dot product, and can be used as a similarity measure. In n-dimensional space for vectors A = (a1, …. an) and B = (b1, …. bn), the cosine is defined as The chord distance is defined as the length of the chord between the vectors of unit length having the same directions as the original ones.

  21. Semimetric distance – Pearson correlation coefficient or Covariance Statistics – standard deviation and variance, var(X)=s2, for 1-dimension data • How about higher dimension data ? • It is useful to have a similar measure to find out how much the • dimensions vary from the mean with respect to each other. • Covariance is measured between 2 dimensions, • suppose one have a 3-dimension data set (X,Y,Z), then one can calculate • Cov(X,Y), Cov(X,Z) and Cov(Y,Z) - to compare heterogenous pairs of variables, define the correlation coefficient or Pearson correlation coefficient, -1≦rXY ≦1 -1  perfect anticorrelation 0  independent +1 perfect correlation

  22. Semimetric distance – the squared Pearson correlation coefficient • Pearson correlation coefficient is useful for examining correlations in the data • One may imagine an instance, for example, in which the same TF can cause both enhancement and repression of expression. • A better alternative is the squared Pearson correlation coefficient (pcc), The square pcc takes the values in the range 0 ≦ rsq ≦ 1. 0  uncorrelate vector 1  perfectly correlated or anti-correlated pcc are measures of similarity Similarity and distance have a reciprocal relationship similarity↑  distance↓  d = 1 – r is typically used as a measure of distance

  23. Semimetric distance – Pearson correlation coefficient or Covariance • The resulting rXYvalue will be larger than 0 if a and b tend to increase • together, below 0 if they tend to decrease together, and 0 if they are • independent. • Remark:rXYonly test whether there is a lineardependence, Y=aX+b • if two variables independent  low rXY, • a low rXYmay or may not independent, it may be a non-linear relation • a high rXYis a sufficient but not necessary condition for variable dependence

  24. Semimetric distance – the squared Pearson correlation coefficient • To test for a non-linear relation among the data, one could make a transformation by variables substitution • Suppose one wants to test the relation u(v) = avn • Take logarithm on both sides • log u = log a + n log v • Set Y = log u, b = log a, and X = log v •  a linear relation, Y = b + nX •  log u correlates (n>0) or anti-correlates (n<0) with log v

  25. Semimetric distance – Pearson correlation coefficient or Covariance matrix A covariance matrix is merely collection of many covariances in the form of a d x d matrix:

  26. Spearman’s rank correlation • One of the problems with using the PCC is that it is susceptible to being skewed by outliers: a single data point can result in twogenes appearing to be correlated, even when all the other data points suggest that they are not. • Spearman’s rank correlation (SRC) is a non-parametric measure of correlation that is robust to outliers. • SRC is a measure that ignores the magnitude of the changes. The idea of the rank correlation is to transform the original values into ranks, and then to compute the correlation between the series of ranks. • First we order the values of gene A and B in ascending order, and assign the lowest value with rank 1. The SRC between A and B is defined as the PCC between ranked A and B. • In case of ties assign midranks  both are ranked 5, then assign a rank of 5.5

  27. Spearman’s rank correlation The SRC can be calculated by the following formula, where xi and yi denote the rank of the x and y respectively. An approximate formula in case of ties is given by

  28. Distances in discretized space • Sometimes it is advantageous to use a discreteized expression matrix as the starting point, e.g. to assign values 0 (expression unchanged, 1 (expression increased) and -1 (expression decreased). • The similarity between two discretized vectors can be measured by the notion of Shannon entropy.

  29. Entropy and the Second Law of Thermodynamics: Disorder and the Unavailability of energy Entropy always increase Ice melts, it becomes more disordered and less structured.

  30. Statistical Interpretation of Entropy and the Second Law S = k ln W S = entropy, k = Boltzmann constant, ln W = natural logarithm of the number of microstates Wcorresponding to the given macrostate. L. Boltzmann (1844-1906) http://automatix.physik.uos.de/~jgemmer/hintergrund_en.html

  31. Entropy and the Second Law of Thermodynamics: Disorder and the Unavailability of energy

  32. Concept of entropy Toss 5 coins, outcome 5H0T 1 4H1T 5 3H2T 10 2H3T 10 1H4T 5 0H5T 1 A total of 32 microstates. Propose entropy, S ~ no. of microstates, W, i.e. S ~ W Generate coin toss with Excel The most probable microstates

  33. Shannon entropy • Shannon entropy is related to physical entropy • Shannon ask the question “What is information ?” • Energy is defined as the capacity to do work, not the work itself. Work is a form of energy. • Define information as the capacity to store and transmit meaning or knowledge, not the meaning or knowledge itself. • For example, a lot of information from WWW, but it does not mean knowledge • Shannon suggest entropy is the measure of this capacity Summary Define information  capacity to store knowlege  entropy is the measure  Shannon entropy Entropy ~ randomness ~ measure of capacity to store and transmit knowledge Reference: Gatlin L.L, Information theory and the living system, Columbia University Press, New York, 1972.

  34. Shannon entropy • How to relate randomness and measure of this capacity ? Microstates 5H0T 1 4H1T 5 3H2T 10 2H3T 10 1H4T 5 0H5T 1 Physical entropy S = k ln W Shannon entropy Assuming equal probability of each individual microstate, pi pi = 1/W S = - k ln pi Information ~ 1/pi = W If pi = 1  there is no information, because it means certainty If pi<< 1  there are more information, that is information is a decrease in certainty

  35. Distances in discretized space • Sometimes it is advantageous to use a discretized expression matrix as the starting point, e.g. to assign values 0 (expression unchanged, 1 (expression increased) and -1 (expression decreased). • The similarity between two discretized vectors can be measured by the notion of Shannon entropy. • Shannon entropy, H1 • Probability of observing a particular symbol or event, pi, with in a given sequence Consider a binary system, an element X has two states, 0 or 1 Claude Shannon - father of information theory and • H1 measure the “uncertainty” of a probability • distribution • - Expectation (average) value of information References: 1.http://www.cs.unm.edu/~patrik/networks/PSB99/genetutorial.pdf 2. http://www.smi.stanford.edu/projects/helix/psb98/liang.pdf 3. plus.maths.org/issue23/ features/data/ base 2

  36. Shannon Entropy No information certain Maximal value Uniform probability DNA seq.  n = 4 states, maximum H1 = - 4*(1/4)* log(1/4) = 2 bits Protein seq.  n = 20 states,  maximum = - 20*(1/20)*log(1/20) = 4.322 bits, which is between 4 and 5 bits.

  37. Hmax1 H1 = log2 (n) The Divergence from equi-probability • When all letter are equi-probable, pi = 1/n • H1 = log2 (n)  the maximum value H1 can take • Define Hmax1 = log2 (n) • Define the divergence from this equi-probable state, D1 • D1 = Hmax1 - H1 • D1 tells us how much of the total divergence from the maximum entropy state is due to the divergence of the base composition from a uniform distribution For example, E. coli genome has no divergence from equi-probability because H1Ec= 2 bits, but, for M. lysodeikticus genome, H1Ml = 1.87, then D1 = 2.00 – 1.87 = 0.13 bit Divergence from independence Single-letter events  which contains no information about how these letters are arranged in a linear sequence D1

  38. Divergence from independence – Conditional Entropy Question Does the occurrence of any one base along the DNA seq. alter the probability of occurrence of the base next to it ? • What are the numerical values of the conditional probabilities ? • p(X|Y) = prob. of event X condition on event Y • p(A|A), p(T|A), p(C|A), p(T|A) … etc. • If they were independent, p(A|A) = p(A), p(T|A) = p(T) …. • Extreme ordering case, equi-probable seq., AAAA…TTTT…CCCC…GGGG… • p(A|A) is very high, p(T|A) is very low, p(C|A) = 0, p(G|A) = 0 • Extreme case, ATCGATCGATCG…. • Here p(T|A) = p(C|T) = p(G|C) = p(A|G) = 1, and all others are 0 • Equi-probable state ≠ independent events

  39. Divergence from independence – Conditional Entropy • Consider the space of DNA dimers (nearest neighbor) • S2 = {AA, AT, …. TT} • Entropy of S2, H2 = -[p(AA)log p(AA) + p(AT) log p(AT) + …. + p(TT) log(TT)] • If the single letter events are independent, p(X|Y) = p(X), • If the dimer event is independent, p(A|A)=p(A)p(A), p(A|T)=p(A)p(T), …. • If the dimer isnot independent, p(XY) = p(X)p(Y|X), such as p(AA) = p(A)p(A|A), p(AT) = p(A) p(T|A) … etc. • HInp2 = entropy of completely independent • Divergence from independence, D2 = HInp2 – H2 • D1 + D2 = the total divergence from the maximum entropy state

  40. Divergence from independence – Conditional Entropy • Calculate D1 and D2 for M. phlei DNA, where p(A)=0.164, p(T)=0.162, p(C)=0.337, p(G)=0.337. • H1= -(0.164 log 0.164 + 0.162 log 0.162 + ..) = 1.910 bits • D1 = 2.000 – 1.910 = 0.090 bit • See the Excel file • D2 = HInp2 – H2 • = 3.8216 – 3.7943 = 0.0273 bit • Total divergence, D1 + D2 = 0.090 + 0.0273 = 0.1173 bit

  41. Divergence from independence – Conditional Entropy • Compare different sequences using H • to establish relationships where Y=2x log10Y=xlog102 X=log10Y/log102 • Given the knowledge of one sequence, say X, can we estimate the uncertainty of Y relative to X ? • Relation between X, Y, and the conditional entropy, H(X|Y) and H(Y|X) • conditional entropy is the uncertainty relative to • known information • H(X,Y) = H(Y|X) + H(X) • = uncertainty of Y given • knowledge of X, H(Y|X) + uncertainty of X, • sum to the entropy of X and Y • = H(X|Y) + H(Y) H(Y|X) = H(X,Y) – H(X) = 1.85 – 0.97 = 0.88 bit

  42. Shannon Entropy – Mutual Information Joint entropy H(X,Y) where pij is the joint probability of finding xi and yj • probability of finding (X,Y) • p00 = 0.1, p01 = 0.3, p10 = 0.4, p11 = 0.2 Mutual information entropy, M(X,Y) • Information shared by X and Y, or it can be used as a similarity measure between X and Y • H(X,Y)= H(X) + H(Y) – M(X,Y) like in set theory, A∪B = A + B – (A∩B) • M(X,Y)= H(X) + H(Y) - H(X,Y) • = H(X) – H(X|Y) • = H(Y) – H(Y|X) • = 1.00 – 0.88 • M(X,Y)= H(X) + H(Y) – H(X,Y) = 0.97 + 1.00 – 1.85 = 0.12 bit A small M(X,Y)  X and Y are independent, p(X,Y)=p(X)p(Y) A large M(X,Y)  X and Y are associated and

  43. Shannon Entropy – Conditional Entropy Conditional entropy a particular x H(Y|X) p(Y|X=x) 0 0 1/4 1 0 3/4 0 1 4/6 1 1 2/6 All x’s

  44. Statistical Analysis of Microarray Data

  45. Statistical Analysis of Microarray Data • Normalize each channel separately  Gn-<G> and Rn-<R> • Subtraction of the mean log fluorescence intensity for the channel from each value transforms the measurements such that the abundance of each transcript is represented as a fold increase or decrease relative to the sample mean, namely as a relative fluorescence intensity. • Log Gn - <log Gn>,Log Rn - <log Rn>, where n=1,2,…. and

  46. Central Limit Theorem • Considered the following set of measurements for a given population: 55.20, 18.06, 28.16, 44.14, 61.61, 4.88, 180.29, 399.11, 97.47, 56.89, 271.95, 365.29, 807.80, 9.98, 82.73. The population mean is 165.570. • Now, considered two samples from this population. • These two different samples could have means very different from each other and also very different from the true population mean. • What happen if we considered, not only two samples, but all possible samples of the same size ? • The answer to this question is one of the most fascinating facts in statistics – Central limit theorem. • It turns out that if we calculate the mean of each sample, those mean values tend to be distributed as a normal distribution, independently on the original distribution. The mean of this new distribution of the means is exactly the mean of the original population and the variance of the new distribution is reduced by a factor equal to the sample size n.

  47. Central Limit Theorem • When sampling from a population with mean m and variance s, the distribution of the sample mean (or the sampling distribution X) will have the following properties: • The distribution of distribution X will be approximately normal. The larger the sample is , the more will the sampling distribution resemble the normal distribution. • The mean x of the distribution of X will be equal to m, the mean of the population from which the samples were drawn. • The variance s of distribution X will be equal to s2/n, the variance of the original population of X divided by the sample size. The quantity s is called the standard error of the mean. http://cnx.org/content/m11131/latest/ http://www.riskglossary.com/link/central_limit_theorem.htm http://www.indiana.edu/~jkkteach/P553/goals.html

  48. Statistical hypothesis testing • The expression level of a gene in a given condition is measured several times. A mean x of these measurements is calculated. From many previous experiments, it is known that the mean expression level of the given gene in normal conditions is m. How can you decide which genes are significantly regulated in a microarray experiment? For instance, one can apply an arbitrary cutoff such as a threshold of at least twofold up or down regulation. One can formulate the following hypotheses: • The gene is up-regulated in the condition under study: x>m • The gene is down-regulated in the condition under study: x<m • The gene is unchanged in the condition under study: x=m • Something has gone awry during the lab experiments and the genes measurements are completely off; the mean of the measurements may be higher or lower than the normal: x≠m.

  49. Statistical hypothesis testing When a hypothesis test is viewed as a decision procedure, two types of error are possible, depending on which hypothesis, H0 or H1, is actually true. If a test rejects H0 (and accept H1) when H0 is true, it is called a type I error. If a test fails to reject H0 when H1 is true, it is called a type II error. The following shows the results of the different decisions.

  50. Statistical hypothesis testing • The next step is to generate two hypotheses. The two hypotheses must be mutually exclusive and all inclusive. • Mutually exclusive – the two hypotheses cannot be true both at the same time • All inclusive means that their union has to cover all possibilities • Expression ratios are converted into probability values to test the hypothesis that particular genes are significantly regulated • Null hypothesis H0 that there is no difference in signal intensity across the conditions being tested • The other hypothesis(called alternate or research hypothesis)named H1. If we believe that the gene is up-regulated, the research hypothesis will be H1: x > m, The null hypothesis has to be mutually exclusive and also has to include all other possibilities, therefore, the null hypothesis will be H0: x≦ m. • One assigns a p-value for testing the hypothesis. The p-value is the probability of a measurement more extreme than a certain threshold occurring just by chance. • The probability of rejecting the null hypothesis when it is true is the significance level a , which is typically set at p<0.05, in other words we accept that 1 in 20 cases our conclusion can be wrong.

More Related