Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning and Bioinformatics機器學習與生物資訊學 Machine Learning & Bioinformatics

Statistics Machine Learning and Bioinformatics

Statistical test • In statistics, a result is called statistically significant if it is unlikely to have occurred by chancealone • Determines what outcomes of an experiment would lead to a rejection of the null hypothesis; helping to decide whether experimental results contain enough information to cast doubt on conventional wisdom • Answers • assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the actually observed one? • that probability is known as the P-value Machine Learning and Bioinformatics

Similar to a criminal trial • A defendant is considered not guilty as long as his or her guilt is not proven • The prosecutor tries to prove the guilt of the defendant • Only when there is enough charging evidence the defendant is convicted • In the start of the procedure, there are two hypotheses • H0: “the defendant is not guilty” • H1: “the defendant is guilty” • The first one is called null hypothesis, and is for the time being accepted • The second one is called alternative (hypothesis), which is the hypothesis one hopes to support Machine Learning and Bioinformatics

The hypothesis of innocence is only rejected when an error is very unlikely, because one doesn’t want to convict an innocent defendant • Such an error is called error of the first kind (i.e. the conviction of an innocent person), and the occurrence of this error is controlled to be rare • As a consequence of this asymmetric behavior, the error of the second kind (acquitting a person who committed the crime), is often rather large Machine Learning and Bioinformatics

Philosopher’s beans • Few beans of this handful are white.Most beans in this bag are white. • Therefore, probably, these beans were taken from another bag. • this is an hypothetical inference • Terminology • the beans in the bag are the population • the handful are the sample • the null hypothesis is that the sample originated from the population Machine Learning and Bioinformatics

The criterion for rejecting the null-hypothesis is the “obvious” difference in appearance (an informal difference in the mean) • Again, assuming that the null hypothesis is true, what is the probability of observing a difference that is at least as extreme as the actually observed one? • To be a real statistical hypothesis test, this example requires the formalities of a probability calculation and a comparison of that probability to a standard Machine Learning and Bioinformatics

Clairvoyant card game • A person (the subject) is testedfor clairvoyance. He is shownthe reverse of a randomly chosen playing card 25 timesand asked which of the four suits it belongs to. • The number of hits, or correct answers, is called X. • As we try to find evidence of his clairvoyance • the null hypothesis is that the person is not clairvoyant • the alternative is, of course, the person is (more or less) clairvoyant null hypothesis? Machine Learning and Bioinformatics

If the null hypothesis is valid, the only thing the test person can do is guess • for every card, the probability (relative frequency) of any single suit appearing is ¼ • If the alternative is valid, the test subject will predict the suit correctly with probability greater than ¼ • Suppose that the observed probability of guessing correctly is p, then the hypotheses, then are • null hypothesis (H0): p = ¼ (just guessing) • alternative hypothesis (H1): p > ¼ (true clairvoyant) Machine Learning and Bioinformatics

What’s the decision? • When the test subject correctly predictsall 25 cards, we will consider himclairvoyant, and reject the null hypothesis.Thus also with 24 or 23 hits. • With only 5 or 6 hits, on the other hand, there is no cause to consider him so. But what about 12 hits, or 17 hits? • what is the critical number, c, of hits, at which point we consider the subject to be clairvoyant? • how do we determine the critical value c? • It is obvious that with the choice c=25 we’re more critical than with c=10 Machine Learning and Bioinformatics

In practice, one decides how critical one will be • one decides how often an error of the first kind (false positive or Type I error) • With c=25 the probability of such an error is very small • Being less critical, with c=10, yields a much grater probability of false positive • There are p-values Machine Learning and Bioinformatics

The probability of Type I error • Before the test is actually performed, the maximum acceptable probability of a Type I error (α) is determined • Depending on this Type I error rate, the critical value c is calculated. For example, if we select an error rate of 1% • from all the numbers c with this property we choose the smallest, in order to minimize the probability of a Type II error (false negative) • for the above example, we select c=13 Machine Learning and Bioinformatics

P-value vs. α Machine Learning and Bioinformatics

About the figure in the last slide Machine Learning and Bioinformatics

Where the distribution (the blue curve) comes from? Machine Learning and Bioinformatics

You have to choose the right one The hardest part for many people But please understand the basic, rather than the practice Machine Learning and Bioinformatics

Normal distribution • A continuous probabilitydistribution, defined on theentire real line, that has a bell-shaped probability density function • Known as the Gaussian function • μ is the mean or expectation (location of the peak); σ2is the variance; σ is known as the standard deviation • The distribution with μ=0 and σ2=1 is called the standard normal distribution or the unit normal distribution • Normal distribution - Wikipedia, the free encyclopedia Machine Learning and Bioinformatics

The normal distribution is considered the most prominent probability distribution in statistics • The normal distribution arises from the central limit theorem • under mild conditions, the mean of a large number of random variables independently drawn from the same distribution is distributed approximately normally, irrespective of the form of the original distribution • Very tractable analytically, that is, a large number of results involving this distribution can be derived in explicit form • For these reasons, the normal distribution is commonly encountered in practice • for example, the observational error in an experiment is usually assumed to follow a normal distribution Machine Learning and Bioinformatics

Machine Learning and Bioinformatics http://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/2000px-Normal_Distribution_PDF.svg.png

Machine Learning and Bioinformatics http://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Normal_Distribution_CDF.svg/2000px-Normal_Distribution_CDF.svg.png

Z-test • Z-test - Wikipedia, the free encyclopedia • For any test statistic of which the distribution under the null hypothesis can be approximated by a normal distribution • Because of the central limit theorem, many test statistics are approximately normally distributed for large samples • Many statistical tests can be conveniently performed as approximate Z-tests if the sample size is large or the population variance known • if the population variance is unknown (and therefore has to be estimated from the sample itself) and the sample size is not large (n < 30), the Student t-test may be more appropriate Machine Learning and Bioinformatics

If T is a statistic that is approximately normally distributed under the null hypothesis • estimate the expected value θ of T under the null hypothesis • obtain an estimate s of the standard deviation of T • calculate the standard score Z = (T − θ) / s • one-tailed and two-tailed p-values can be calculated as Φ(−|Z|) and 2Φ(−|Z|), respectively • Φ is the standard normal cumulative distribution function Machine Learning and Bioinformatics

Z-testExample • Suppose that in a particulargeographic region, the meanand standard deviation of scoreson a reading test are 100 and 12 points, respectively. • Our interest is in the scores of 55 students in a particular school who received a mean score of 96 • We can ask whether this mean score is significantly lower than the regional mean • are the students in this school comparable to a simple random sample of 55 students from the region as a whole • or are their scores surprisingly low? Machine Learning and Bioinformatics

The standard error • The z-score, which is the distance from the sample mean to the population mean in units of the standard error • Looking up the table of the standard normal distribution, the probability of observing a standard normal value below -2.47 is about 0.0068 • with 99.32% confidence we reject the null hypothesis • If instead of a classroom, we considered a sub-region containing 900 students whose mean score was 99, nearly the same z-score and p-value would be observed Machine Learning and Bioinformatics

Hyper-geometric distribution • A discrete probability distribution that describesthe probability of k successes in n draws from afinite population of size N containing m successeswithout replacement • A random variable X follows the hyper-geometric distribution if its probability mass function is given by • N is the population size; m is the number of success states in the population; n is the number of draws; k is the number of successes • Hypergeometric distribution - Wikipedia, the free encyclopedia Machine Learning and Bioinformatics

http://www.statsref.com/HTML/hypergeom.png

Fisher’s exact test • Used in the analysis of contingency tables • Although in practice it is employed whensample sizes are small, it is valid for allsample sizes • It is called exact because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity • Fisher devised the test due to a boast • try google ‘lady tasting tea’ • Fisher's exact test - Wikipedia, the free encyclopedia Machine Learning and Bioinformatics

The test is useful forcategoricaldata thatresult from classifyingobjects in two different ways • It is used to examine the significance of the association (contingency) between the two kinds of classification • The numbers in the cells of the table form a hyper-geometric distribution under the null hypothesis of independence Machine Learning and Bioinformatics

Fisher’s exact testExample • A sample of teenagers might be divided into • male and female • and those that are and are not currently dieting • Test whether the observed difference of proportions is significant • what is the probability that the 10 dieters would be so unevenly distributed between the women and the men? • if we were to choose 10 of the teenagers at random, what is the probability that 9 of them would be among the 12 women, and only 1 from among the 12 men? Machine Learning and Bioinformatics

The probability followsthe hyper-geometricdistribution • the exact probability of this particular arrangement of the data • on the null hypothesis of independence that men and women are equally likely to be dieters • assuming the given marginal totals • We can calculate the exact probability of any arrangement • Fisher showed that to generate a significance level, we need consider only the more extreme cases with the same marginal totals Machine Learning and Bioinformatics

Distribution is “assumed” Different tests may use the same distribution One test statistic could be tested under different assumptions Machine Learning and Bioinformatics

Overlap significance • Determine the degree of theoverlap • ; ; • The above statistics answer the degree but not the confidence of overlap • Consider outside the two leafs • Can you formulize a statistical test based on hyper-geometric distribution? Machine Learning and Bioinformatics

Suppose that we are drawing an area as large as the first leaf • What’s the probability to obtain an area with larger overlap with the second leaf by chance? • N is the size of the entire area • Notice that the p-value answers the confidence when we claim that these two leaves overlapped, but not the degree of the overlap Machine Learning and Bioinformatics

Gene Ontology Enrichment Analysis http://www.nature.com/nrc/journal/v7/n1/images/nrc2036-f1.jpg

Student’s t-test • The test statistic follows aStudent’s t distribution if thenull hypothesis is supported • Commonly applied when the test statistic follows a normal distribution if the value of a scaling term is known • When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic follows a Student’s t distribution • The t-statistic was introduced in 1908 by William Sealy Gosset (“Student” was his pen name) • Student's t-test - Wikipedia, the free encyclopedia Machine Learning and Bioinformatics

Compared to normal distribution • The probability of seeing a normally distributed value far (i.e. more than a few standard deviations) from the mean drops off extremely rapidly • thus, normal distribution is not robust to the presence of outliers (data that are unexpectedly far from the mean, due to exceptional circumstances, observational error, etc.) • data with outliers may be better described using a heavy-tailed distribution such as the Student’s t-distribution • If are independent normally distributed random variables with means μ and variances σ2 Machine Learning and Bioinformatics

The sample mean follows normal distribution • The ratio of the sample mean to the sample standard deviation follows the Student’s t-distribution with n−1 degrees of freedom • this is useful to compare two sets of numerical data • The sum of their squares has the chi-squared distribution with n degrees of freedom Machine Learning and Bioinformatics

Remember? Machine Learning and Bioinformatics

That’s why we have • Choosing the Correct Statistical Test in SAS, Stata and SPSS • GraphPad - FAQ 1790 - Choosing a statistical test • The testing process • Common test statistics • But… Machine Learning and Bioinformatics

Do not use them unless you understand the concepts introduced in this slide Machine Learning and Bioinformatics

Chi-squared distribution • The chi-squared distribution (also chi-square or χ²-distribution) with k degrees of freedom is the distribution of a sum of k independent standard normal random variables • Used in chi-squared tests for • goodness of fit of an observed distribution to a theoretical one • the independence of two criteria of classification • confidence interval estimation for a population standard deviation of a normal distribution from a sample standard deviation • many other statistical tests also use this distribution, like Friedman’s analysis of variance by ranks Machine Learning and Bioinformatics

A special case of the gamma distribution • If are independent, standard normal random variables, then the sum of their squaresis distributed according to the chi-squared distribution with k degrees of freedom • This is usually denoted as or • Chi-squared distribution - Wikipedia, the free encyclopedia Machine Learning and Bioinformatics

Chi-squared tests • Also known as chi-square test or χ² test • Note the distinction between the test statistic and its distribution • The distribution is a chi-squared distribution when the null hypothesis is true, or asymptotically true • the sampling distribution can be approximated to a chi-squared distribution as closely as desired by enlarging the sample size • Often the shorthand for Pearson’s chi-squared test, also known as • the chi-squared goodness-of-fit test • the chi-squared test for independence Machine Learning and Bioinformatics

Pearson’s chi-squared test • Pearson's chi-squared test - Wikipedia • The best-known of several chi-squared tests • Tests the frequencydistributions of events • the considered events must be mutually exclusive and have total probability 1 • e.g., tests the “fairness” of a die • Used to assess two types of comparison • test of goodness of fitanswers if an observed frequency distribution differs from a theoretical one • test of independence answers if paired observations on two variables, expressed in a contingency table, are independent Machine Learning and Bioinformatics

Steps • Calculate the chi-squared test statistic, χ2, which resembles a normalized sum of squared deviations between observed and theoretical frequencies • Determine the degrees of freedom, d, of that statistic, which is essentially the number of frequencies reduced by the number of parameters of the fitted distribution • χ2 is then compared to the critical value in the distribution to obtain a p-value • A test that does not rely on the approximation of χ2 is the Fisher’s exact test, which is more accurate in obtaining a significance level, especially with few observations Machine Learning and Bioinformatics

Test for fit of a distribution • Suppose that there N observations divided among n cells • A simple application is to test the hypothesis that, in the general population, values would occur in each cell with equal frequency • the “theoretical frequency” for any cell (under the null hypothesis of a discrete uniform distribution) is • the reduction in the degrees of freedom is p=1, notionally because the observed frequencies Oiare constrained to sum to N • the degrees of freedom is n-1 degrees of freedom • The value of the test-statistic is , where X2 is a Pearson’s cumulative test statistic, which asymptotically approaches distribution Machine Learning and Bioinformatics

When testing whether observations are random variables whose distribution belongs to a given family of distributions, the “theoretical frequencies” are calculated using a distribution from that family • the reduction in the degrees of freedom is calculated as p=s+1, where s is the number of co-variates used in fitting the distribution • for instance, when checking a normal distribution (where the parameters are mean and standard deviation), p=3 • the degrees of freedom is n-p • It should be noted that the degrees of freedom are not based on the number of observations as with a Student’s t distribution • if testing for a fair, six-sided die, there would be five degrees of freedom because there are six categories • the number of times the die is rolled will have absolutely no effect on the number of degrees of freedom Machine Learning and Bioinformatics

Test of independence • An “observation” consists of the values of two outcomes and the null hypothesis is that the occurrence of these outcomes is statistically independent • Each observation is allocated to one cell of a two-dimensional array of cells (called a table) according to the values of the two outcomes • If there are r rows and c columns in the table, the value of the test-statistic is • Fitting the model of “independence” reduces the number of degrees of freedom by p=r+c−1 • The number of degrees of freedom is equal to the number of cells r×c, minus the reduction in degrees of freedom, p, which reduces to (r − 1)(c − 1). Machine Learning and Bioinformatics

Summary • Statistical test • criminal trial • philosopher’s beans • clairvoyant card game • P-value vs. α • You have to choose the right distribution • normal distribution (z-test) • hyper-geometric distribution (Fisher’s exact test) • Distinguish between distributions and tests • different tests with the same distribution • overlap significance • enrichment analysis • different distributions for the same test statistic • Student’s t-test • Chi-squared tests • goodness of fit • test of independence Machine Learning and Bioinformatics

Feature selection Tests if the selected features are significantly better than other. Upload and test them in our simulation system. Finally, commit your best version and send TA Jang a report before 23:59 1/8 (Tue). Machine Learning & Bioinformatics

Machine Learning and Bioinformatics 機器學習與生物資訊學