1 / 30

Significance testing Lorenz Wernisch

Significance testing Lorenz Wernisch. Compare target score with rest of scores. The scores of unrelated (random) database hits. number. Score 330 lies here, way outside the heap of random scores. scores. Fit a Normal (Gaussian) distribution. m = -47.1. score s = 330. s = 20.8.

sheri
Télécharger la présentation

Significance testing Lorenz Wernisch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Significance testingLorenz Wernisch

  2. Compare target score with rest of scores The scores of unrelated (random) database hits number Score 330 lies here, way outside the heap of random scores scores

  3. Fit a Normal (Gaussian) distribution m = -47.1 score s = 330 s = 20.8

  4. p-value for Normal distribution The red area is the probability that a random N(-47.1,20.8) distributed variable has score > 0 Pr[s > 0] = 0.0117 > 1-pnorm(330,-47.1,20.8) 1.593339e-73

  5. More distributions • Many more distribution functions for fitting • Gamma distribution • Extreme value distribution • Chi-square distribution • t distribution • Some software packages define hundreds of them

  6. The Gamma function • is the continuation of the factorial n! to real numbers: • and is used in many distribution functions • Moreover,

  7. Gamma distribution • density function, pdf: • expectation • variance • pgamma(x,alpha,1/lambda) • dgamma(x,alpha,1/lambda) • rgamma(n,x,alpha,1/lambda) Gamma a = 3, l = 1/4

  8. a = 1 a = 1/3 a = 3 a = 1/5 a = 5 Shape parameter of Gamma distribution

  9. Gamma distribution and Poisson process • The gamma distribution as limiting process: tiny probability p, many trials n for throwing a 1 in the unit interval [0,1]: l = np • How long does it take until the third 1? a = 3 • 0.11% chance of seeing three 1s in the indicated region 0 1 X

  10. Extreme value distribution • cumulative distribution cdf • probability density pdf • No simple form for • expectation and variance Extreme Value m = 3, s = 4

  11. Examples for Extreme Value Distribution EVD • 1) An example from meteorology: • Wind speed is measured daily at noon: • Normal distribution around average wind speed • Monthly maximum wind speed is recorded as well • The monthly maximum wind speed does not follow a normal distribution, it follows an EVD • 2) Scores of sequence alignments (local alignments) often follow an EVD

  12. Scores for local alignment for DNA sequences 5000 scores mean m = 42.29 sd s = 7.62 Normal distribution N(42.29,7.62) does notfit!

  13. p-value for EVD 0.84 • Probability of seeing a • value higher than 10? • Get it from the cumulative • distribution function (cdf): Extreme value m = 3, s = 4 1-pexval(10,3,4)

  14. Extreme value fits much better m = 38.80 s = 6.14 EVD m = 42.29 s = 7.62 Normal p-value for score 90 EVD 0.00024 Normal 1.9e-10 Normal p-value is misleadingly small compared to EVD

  15. c2 distribution • Standard normal random variables Xi, Xi~N(0,1), • The variable • has a cn2 distribution with n degrees of freedom • density • expectation • variance squared! pchisq(x,n) dchisq(x,n) rchisq(num,x,n)

  16. n = 1 n = 2 n = 4 n = 6 n = 10 Shape of c2 distribution is actually Gamma function with a = n/2 and l = 1/2

  17. t distribution • Z ~ N(0,1) independent of U ~ cn2 • then • has a t distribution with n degrees of freedom • density pt(x,n) dt(x,n) rt(num,x,n)

  18. Shape of t distribution n = 10 N(0,1) • Approaches • normal N(0,1) • distribution • for large n • (n > 20 or 30) n = 3 n = 5 n = 1

  19. Define scalable t distribution • Functions for t distribution in R accept only two • arguments x, the data vector, and n, the degrees of freedom. • pt(x,n) • Functions accepting a location parameter m and • and scaling paramter s • ptt <- function(x,m,s,n) pt((x-m)/s,n) • dtt <- function(x,m,s,n) dt((x-m)/s,n)/s • rtt <- function(sz,m,s,n) rt(sz,n)*s + m

  20. Goodness of fit • So many possible distributions to fit? Which one is • the best. Assessing goodness of fit by • eye (very reliable!) • Kolmogorov-Smirnov test • Shapiro-Wilks test of normality

  21. Assessment of fit by eye: histogram • 200 data points • seem to • follow a normal • distribution with • m = -0.017 • s = 1.45 • But something is • not quite right

  22. Sample cumulative distribution function • At each sample • point the sample cdf • raises by 1/n • (n number of points) • Example: uniformly • distributed points

  23. Assessment of fit by eye: CDFs • Normal • distribution • too wide, • probably an effect • induced by • too many • outliers • t distribution?

  24. t distribution fits better: histogram t Normal • t distribution • with • m = -0.046 • s = 1.12 • n = 4.77 • real data • t(0,1,3) • generated

  25. t distribution fits better: CDFs

  26. Formal tests for goodness of fit • Formal tests compare a data set with a suggested • distribution and produce a p-value • If the p-value is small (< 0.05 or < 0.01) it is unlikely that the distribution really fits the data • If the p-value is intermediate (say 0.1 < p < 0.7) there is no strong reason to reject a fit of the distribution, but there might be better ones • If the p-value is high (> 0.7) one might be more confident that the distribution is the right one

  27. Kolmogorov-Smirnovtest • Measures the largest • difference D between • theoretical and • empirical cdf • If data points > 80, there • is a simple rule for • KS test: D

  28. Kolmogorov-Smirnov test of goodness of fit in R • Normal distribution: • ks.test(x,"pnorm", -0.017, 1.45) • Result: p-value = 0.6226 • p-value: 62.3% chance to see such differences in the cdfs of the data and of the normal distribution • t distribution: • ks.test(x,”ptt”,0.046,1.12,4.77) • Result: p-value = 0.9951 • p-value: 99.5% chance to see such differences in the cdfs of the data and of the t distribution!

  29. Shapiro-Wilk normality test in R • shapiro.test(x) • Result: p-value = 0.001027 • Almost no chance that the data come from a • normal distribution! • # generate 200 data points from N(0,1.5) • x <- rnorm(200,0,1.5) • shapiro.test(x) • Result: p-value = 0.2067 • Even normal data get a low p-value!

  30. Conclusions • Problem: Separate interesting, significant signals (scores) from statistical background noise • Solution: Fit a distribution to the data and calculate • the p-value: • Fitting by Maximum Likelihood method • Assessing fit by Kolmogorov-Smirnov • p-value from the cumulative distribution function

More Related