Descriptive and inferential statistics

Descriptive and inferential statistics Asst. Prof. Georgi Iskrov, PhD Department of Social Medicine

Before we start http://www.raredis.work/edu/ Lecture slides to be updated!

Outline • Statistics • Sample, population and sampling • Descriptive and inferential statistics • Types of variables and level of measurement • Measures of central tendency and spread • Normal distribution • Confidence intervals • Sample size calculation • Hypothesis testing • Significance, power and errors • Normality tests

Why do we need to use statistical methods? • Why do we need to use statistical methods? • To make strongest possible conclusion from limited amounts of data; • To generalize from a particular set of data to a more general conclusion. • What do we need to pay attention to? • Bias • Probability

Population vs Sample Population Parameters μ, σ Sample / Statistics x, s

Population vs Sample Population includes all objects of interest, whereas sample is only a portion of the population: Parameters are associated with populations and statistics with samples; Parameters are usually denoted using Greek letters (μ, σ) while statistics are usually denoted using Roman letters (X, s). There are several reasons why we do not work with populations: They are usually large and it is often impossible to get data for every object we are studying; Sampling does not usually occur without cost. The more items surveyed, the larger the cost.

Inferential statistics Sampling Population Parameters From population to sample Sample Statistics From sample to population Inferential statistics

Descriptive vs Inferential statistics We compute statisticsand use them to estimate parameters. The computation is the first part of the statistical analysis (Descriptive Statistics) and the estimation is the second part (Inferential Statistics). Descriptive statistics: The procedure used to organize and summarize masses of data. Inferential statistics: The methods used to find out something about a population, based on a sample.

Sampling Individuals in the population vary from one another with respect to an outcome of interest.

Sampling When a sample is drawn, there is no certainty that it will be representative for the population. Sample A Sample B

Sampling Sample B Sample A Population

Sampling Random sample: In random sampling, each item or element of the population has an equal chance of being chosen at each draw. While this is the preferred way of sampling, it is often difficult to do. It requires that a complete list of every element in the population be obtained. Computer generated lists are often used with random sampling. Properties of a good sample: Random selection; Representativeness by structure; Representativeness by number of cases.

Sampling Systematic sampling: The list of elements is “counted off”. That is, every k-th element is taken. This is similar to lining everyone up and numbering off “1,2,3,4; 1,2,3,4; etc”. When done numbering, all people numbered 4 would be used. Convenience sampling: In convenience sampling, readily available data is used. That is, the first people the surveyor runs into.

Sampling Cluster sampling: It is accomplished by dividing the population into groups (clusters), usually geographically. The clusters are randomly selected, and each element in the selected clusters are used. Stratified sampling: It divides the population into groups, called strata. However, this time it is by some characteristic, not geographically. For instance, the population might be separated into males and females. A sample is taken from each of these strata using either random, systematic, or convenience sampling.

Random and systematic errors Random error can be conceptualized as sampling variability. Bias (systematic error) is a difference between an observed value and the true value due to all causes other than sampling variability. Biased sample: Biased sample is one, in which the method used to create the sample results in samples that are systematically different from the population. Accuracy is a general term denoting the absence of error of all kinds.

Sample size calculation Law of Large Numbers: As the number of trials of a random process increases, the percentage difference between the expected and actual values goes to zero. Application in biostatistics: Bigger sample size, smaller margin of error. A properly designed study will include a justification for the number of experimental units (people/animals) being examined. Sample size calculations are necessary to design experiments that are large enough to produce useful information and small enough to be practical.

Sample size calculation Generally, the sample size for any study depends on: Acceptable level of confidence; Power of the study; Expected effect size and absolute error of precision; Underlying scatter in the population.

Sample size calculation For quantitative variables: Z – confidence level; SD – standard deviation; d – absolute error of precision.

Sample size calculation For quantitative variables: A researcher is interested in knowing the average systolic blood pressure in pediatric age group at 95% level of confidence and precision of 5 mmHg. Standard deviation, based on previous studies, is 25 mmHg.

Sample size calculation For qualitative variables: Z – confidence level p – expected proportion in population d – absolute error of precision

Sample size calculation For qualitative variables: A researcher is interested in knowing the proportion of diabetes patients having hypertension. According to a previous study, the actual number is no more than 15%. The researcher wants to calculate this size with a 5% absolute precision error and a 95% confidence level.

Variables • Different types of data require different kind of analyses.

Levels of measurement There are four levels of measurement: Nominal, Ordinal, Interval and Ratio. These go from lowest level to highest level. Data is classified according to the highest level which it fits. Each additional level adds something the previous level did not have. Nominal is the lowest level. Only names are meaningful here; Ordinal adds an order to the names; Interval adds meaningful differences; Ratio adds a zero so that ratios are meaningful.

Levels of measurement Nominal scale – eg., genotype You can code it with numbers, but the order is arbitrary and any calculations would be meaningless. Ordinal scale – eg., pain score from 1 to 10 The order matters but not the difference between values. Interval scale– eg., temperature in C The difference between two values is meaningful. Ratio scale – eg., height It has a clear definition of 0. When the variable equals 0, there is none of that variable. When working with ratio variables, but not interval variables, you can look at the ratio of two measurements.

Central tendency and spread • Central tendency: Mean, mode and median • Spread: Range, interquartile range, standard deviation • Mistakes: • Focusing on only the mean and ignoring the variability • Standard deviation and standard error of the mean • Variation and variance • What is best to use in different scenarios? • Symmetrical data: mean and standard deviation • Skewed data: median and interquartile range

When data are approximately normally distributed: approximately 68% of the data lie within one SD of the mean; approximately 95% of the data lie within two SDs of the mean; approximately 99.7% of the data lie within three SDs of the mean. Normal (Gaussian) distribution

Normal (Gaussian) distribution • Central limit theorem: • Create a population with a known distribution that is not normal; • Randomly select many samples of equal size from that population; • Tabulate the means of these samples and graph the frequency distribution. • Central limit theorem states that if your samples are large enough, the distribution of the means will approximate a normal distribution even if the population is not Gaussian. • Mistakes: • Normal vs common (or disease free); • Few biological distributions are exactly normal.

Confidence interval for the population mean • Population mean:point estimate vs interval estimate • Standard error of the mean – how close the sample mean is likely to be to the population mean. • Assumptions: a random representative sample, independent observations, the population is normally distributed (at least approximately). • Confidence interval depends on: sample mean, standard deviation, sample size, degree of confidence. • Mistakes: • 95% of the values lie within the 95% CI; • A 95% CI covers the mean ± 2 SD.

The duration of time from first exposure to HIV infection to AIDS diagnosis is called the incubation period. The incubation periods (in years) of a random sample of 30 HIV infected individuals are: 12.0, 10.5, 9.5, 6.3, 13.5, 12.5, 7.2, 12.0, 10.5, 5.2, 9.5, 6.3, 13.1, 13.5, 12.5, 10.7, 7.2, 14.9, 6.5, 8.1, 7.9, 12.0, 6.3, 7.8, 6.3, 12.5, 5.2, 13.1, 10.7, 7.2. Calculate the 95% CI for the population mean incubation period in HIV. X = 9.5 years; SD = 2.8 years SEM = 0.5 years 95% level of confidence => Z = 1.96 µ = 9.5 ± (1.96 x 0.5) = 9.5 ± 1 years 95% CI for µ is (8.5; 10.5 years) Confidence interval for the population mean

X = 9.5 years; SD = 2.8 years SEM = 0.5 years 95% level of confidence => Z = 1.96 µ = 9.5 ± (1.96 x 0.5) = 9.5 ± 1 years 95% CI for µ is (8.5; 10.5 years) 99% level of confidence => Z = 2.58 µ = 9.5 ± (2.58 x 0.5) = 9.5 ± 1.3 years 99% CI for µ is (8.2; 10.8 years) Confidence interval for the population mean

Diabetes type 2 study Experimental group: Mean blood sugar level: 103 mg/dl Control group: Mean blood sugar level: 107 mg/dl Pancreatic cancer study Experimental group: 1-year survival rate: 23% Control group: 1-year survival rate: 20% Hypothesis testing Is there a difference?

Hypothesis testing The general idea of hypothesis testing involves: Making an initial assumption; Collecting evidence (data); Based on the available evidence (data), deciding whether to reject or not reject the initial assumption. Every hypothesis test – regardless of the population parameter involved – requires the above three steps.

Null hypothesis – H0 This is the hypothesis under test, denoted as H0. The null hypothesis is usually stated as the absence of a difference or an effect; The null hypothesis says there is no effect; The null hypothesis is rejected if the significance test shows the data are inconsistent with the null hypothesis.

Alternative hypothesis – H1 This is the alternative to the null hypothesis. It is denoted as H', H1, or HA. It is usually the complement of the null hypothesis; If, for example, the null hypothesis says two population means are equal, the alternative says the means are unequal.

Criminal trial Criminal justice system assumes “the defendant is innocent until proven guilty”. That is, our initial assumption is that the defendant is innocent. In the practice of statistics, we make our initial assumption when we state our two competing hypotheses – the null hypothesis (H0) and the alternative hypothesis (HA). Here, our hypotheses are: H0: Defendant is not guilty (innocent); HA: Defendant is guilty; In statistics, we always assume the null hypothesis is true. That is, the null hypothesis is always our initial assumption.

Criminal trial The prosecution team then collects evidence with the hopes of finding “sufficient evidence” to make the assumption of innocence refutable. In statistics, the data are the evidence. The jury then makes a decision based on the available evidence: If the jury finds sufficient evidence – beyond a reasonable doubt – to make the assumption of innocence refutable, the jury rejects H0 and deems the defendant guilty. We behave as if the defendant is guilty. If there is insufficient evidence, then the jury does not reject H0. We behave as if the defendant is innocent.

Making the decision Recall that it is either likely or unlikely that we would observe the evidence we did given our initial assumption. If it is likely, we do not reject the null hypothesis; If it is unlikely, then we reject the null hypothesis in favor of the alternative hypothesis; Effectively, then, making the decision reduces to determining “likely” or “unlikely”.

Making the decision In statistics, there are two ways to determine whether the evidence is likely or unlikely given the initial assumption: We could take the “critical value approach” (favored in many of the older textbooks). Or, we could take the “p-value approach” (what is used most often in research, journal articles, and statistical software).

Suppose we find a difference between two groups in survival: patients on a new drug have a survival of 15 months; patients on the old drug have a survival of 18 months. So, the difference is 3 months. Making the decision

Suppose we find a difference between two groups in survival: patients on a new drug have a survival of 15 months; patients on the old drug have a survival of 18 months. So, the difference is 3 months. Do we accept or reject the hypothesis of no true difference between the groups (the two drugs)? Is a difference of 3 a lot, statistically speaking – a huge difference that is rarely seen? Or is it not much – the sort of thing that happens all the time? Making the decision

A statistical test tells you how often you would get a difference of 3, simply by chance, if the null hypothesis is correct – no real difference between the two groups. Suppose the test is done and its result is that p = 0.32. This means that you’d get a difference of 3 quite often just by the play of chance – 32 times in 100 – even when there is in reality no true difference between the groups. Making the decision

A statistical test tells you how often you’d get a difference of 3, simply by chance, if the null hypothesis is correct – no real difference between the two groups. On the other hand if we did the statistical analysis and p = 0.0001, then we say that you would only get a difference as big as 3 by the play of chance 1 time in 10 000. That is so rarely that we want to reject our hypothesis of no difference: there is something different about the new therapy. Making the decision

Hypothesis testing Somewhere between 0.32 and 0.0001 we may not be sure whether to reject the null hypothesis or not. Mostly we reject the null hypothesis when, if the null hypothesis were true, the result we got would have happened less than 5 times in 100 by chance. This is the ‘conventional’ cutoff of 5% or p <0.05. This cutoff is commonly used but it is arbitrary i.e. no particular reason why we use 0.05 rather than 0.06 or 0.048 or whatever.

Hypothesis testing

Type I and II errors A type I error is the incorrect rejection of a true null hypothesis (also known as a “false positive” finding). The probability of a type I error is denoted by the Greek letter  (alpha). A type II error is incorrectly retaining a false null hypothesis (also known as a "false negative" finding). The probability of a type II error is denoted by the Greek letter  (beta).

Level of significance • Level of significance (α) – the threshold for declaring if a result is significant. If the null hypothesis is true, α is the probability of rejecting the null hypothesis. • α is decided as part of the research design, while p-value is computed from data. • α = 0.05 is most commonly used. • Small α value reduces the chance of Type I error, but increases the chance of Type II error. • Trade-off based on the consequences of Type I (false-positive) and Type II (false-negative) errors.

Power • Power – the probability of rejecting a false null hypothesis. Statistical power is inversely related to β or the probability of making a Type II error (power is equal to 1 – β). • Power depends on the sample size, variability, significance level and hypothetical effect size. • You need a larger sample when you are looking for a small effect and when the standard deviation is large.

Choosing a statistical test Choice of a statistical test depends on: Level of measurement for the dependent and independent variables Number of groups or dependent measures Number of units of observation Type of distribution The population parameter of interest (mean, variance, differences between means and/or variances)

Choosing a statistical test • Multiple comparison – two or more data sets, which should be analyzed • repeated measurements made on the same individuals; • entirely independent samples. • Degrees of freedom – the number of scores, items, or other units in the data set, which are free to vary • One- and two tailed tests • one-tailed test of significance used for directional hypothesis; • two-tailed tests in all other situations. • Sample size – number of cases, on which data have been obtained • Which of the basic characteristics of a distribution are more sensitive to the sample size?

Descriptive and inferential statistics