Introduction to Statistics

Introduction to Statistics Biomedical Sciences Degrees Honours Students Derek Scott d.scott@abdn.ac.uk

Why use statistics? • Statistics are used to analyse populations and predict changes in terms of probability. • Normally, a representative sample is taken, large enough to make likely conclusions about the population as a whole. • Descriptive statistics: summarise the data and describe the population. These values allow you to see how large and how variable the data are. • Inferential statistics: propose null hypothesis and endeavour to disprove it. By looking at these, you can check for error.

When analysing data, you want to make the strongest possible conclusion from limited amounts of data. To do this, you need to overcome 2 problems: • Important differences can be obscured by biological variability and experimental error. This makes it difficult to distinguish real differences from random variability. • The human brain excels at finding patterns, even from random data. Our natural inclination (especially with our own data) is to conclude that any differences are real, and to minimise the contribution of random variability. Statistical rigor prevents you from making this mistake.

Errors • Bias or systematic error: Data go in a predictable direction perhaps due to experimental design or human errors. Can remove the errors if you identify them. • Random error: Unpredictable errors. Can’t get rid of these. • Usually you will quote a measure of error with your data (e.g. standard deviation, standard error of the mean) • EXAMPLE: The mean height of a student in BM4005 is: 1.71 ± 0.20 (43) metres. MEAN VALUE SD or SEM n, the number of samples Units!!!

Independent Sampling 1 • Measure BP in rats, 5 rats per group. • Measure BP 3 times in each animal. • You do not have 15 independent measurements, since triplicate measurements in each animals will be closer to one another than to those in other animals. • You should average values from each rat. • Now have 5 independent mean values.

Independent Sampling - 2 • Perform a biochemical test 3 times, each time in triplicate. • Do not have 9 independent values, as an error in preparing the reagents for 1 experiment could affect all 3 triplicates. • Average the triplicates, and you have 3 independent mean values.

Independent Sampling - 3 • Doing a human exercise study. • Recruit 10 people from the inner-city, and 10 people from the countryside. • Have not independently sampled 20 subjects from one population. • Data from inner-city subjects may be closer to each other than to the data from rural subjects. You have sampled from 2 populations, and need to account for this in your analysis.

Gaussian (Normal) Distribution • Data usually follow a bell-shaped distribution called Gaussian distribution. t-tests and ANOVA tests assume that the population follows an approximately Gaussian distribution. • For example, of we measure the height of everyone in 4th year and plot this, most people would fall in the middle of the curve, with a few at the bottom end, and a few at the top end of the curve. • For Gaussian distribution, we use parametric tests

Gaussian Distribution “Bell-shaped” curve

Outliers • When analysing data, some values can be very different the rest. • Tempting to delete it from analysis. • Was the value typed in correctly? • Was there an experimental problem with that value? • Is it due to biological diversity? • What if answers to these questions are no?

Outliers • If outlier is due to chance, keep it in the data set. • If it is due to a mistake (e.g. bad pipetting, voltage spike, apparatus problem) then you must remove it from the analysis. • If you want to be absolutely sure whether the outlier is due to chance or not, there are specific statistical tests you can do, but usually these basic checks are enough to decide.

Mean • Sample mean will probably not be exactly the population mean. Mean is more accurate if you have a bigger sample size with a low variability. • You may calculate Confidence Intervals (CI’s) telling you the area in which 95% of the population will fall. • EXAMPLE: Mean height of a student in BM4005 is 1.71 metres. The 95% confidence limits for this value are 1.5 and 1.8 metres. These are the upper and lower heights between which 95% of the class will fall.

Confidence Intervals • Nothing magical about 95%. You could do it for any value you liked – 99%, 90% etc. • If you set a value of 99%, then the intervals would be wider because 99% of the class’s heights must fall within that range. • 95% confidence limits mean you have a reasonable level of confidence that the true population mean lies within that range.

Standard Deviation (SD) • Quantifies variability • If data follow Gaussian distribution, then 68% of values lie within one SD of mean (on either side) and 95% of values lie within 2 SD’s of the mean. • So, as a rule of thumb, if 2 points on a graph are more than 2 SD’s away from each other, they are significantly different. • Expressed in same units as data

Standard Error of the Mean (SEM) • Measure of how far sample mean is likely to be from the true population mean. SEM = SD/n • Smaller than SD, so used more to give smaller error bars! • SD quantifies scatter – how much values vary from each other. Doesn’t really change much even if you have a bigger sample size. • SEM quantifies how accurately you know the true mean of the population. SEM gets smaller as sample gets larger

P Values

Student’s t-test • Used to compare the means of two groups of data. • Paired t-test: control expt. and treatment done on same person, animal or cell etc. • Unpaired t-test: control done on 1 group of subjects, with the treatment being done on another separate group. • Can be 1- or 2-tailed.

Iron and zinc evoke electrogenic responses that are pH-dependent Krebs pH 6.0 Krebs pH 7.4 IRON (100mM) ZINC (100mM)

Iron- and zinc-evoked transport is temperature-dependent IRON ZINC 4 oC37 oC

Paired or Unpaired? • Choose paired if the 2 columns of data are matched, e.g. • You measure weight before and after an intervention in the same subjects. • You recruit subjects as pairs, matched for variables such as age, ethnic group, disease severity. One of the pair gets one treatment, the other gets an alternative treatment. • You perform the control experiment in one cell or piece of tissue, and then apply a drug. You measure the effect of the drug in the same cell or tissue. • Shouldn’t be based on the variable you are comparing. For example, if measuring BP, you can match subjects based on their age or postcode, but not on their BP’s.

Student’s t-test • You will probably always use a 2-tailed t-test. • 2-tailed test just asks whether there is a difference between the 2 means. • 1-tailed test predicts whether: • Mean 1 is bigger than Mean 2 or • Mean 2 is bigger than Mean 1. • For 1 tailed you must know which mean will be bigger before you start – not usually possible • Stick to a 2-tailed t-test to be safe!!!

Analysis of Variance (ANOVA) • Used to compare means of 3 or more groups. • Again, can have matched (paired) or unmatched (unpaired) values. • You will probably only use 1-way ANOVA • EXAMPLE: Your null hypothesis is that the average BP for 4 men is equal. ANOVA can compare each subject’s BP and say if they are different or not.

Features of ANOVA • ANOVA produces an F value which tells you how much variation there is in your sample. Higher F value means more variation. • Dunnett’s post test allows you to compare against 1 group e.g. A v B, A v C, A v D. Handy if A is the control group. • Tukey’s post test allows you to compare all columns against one another just to check for any differences between any groups. Good way of finding significant differences that you may not have expected.

The effect of non-selective protein kinase inhibition with staurosporine IRON ZINC  8-Br cGMP + Staurosporine  Staurosporine (0.5 mM)  8-Br cGMP (100 mM)  Control

Non-Gaussian Distribution • Use non-parametric tests for these unusual situations which rank data from low to high and analyse distribution of ranks. • Less powerful than parametric but used when values are too low or high to measure by assigning arbitrary values. Also used if outcome is a rank or score with only a few categories. • P values are usually higher.

Skewness

Correlation +ve correlation -ve correlation Correlation doesn’t tell you about the cause of the effect, it just tells you that there is a link between value X and value Y. The nearer the R value is to 1, the better the correlation.

Regression Regression calculates a line of best fit. Often used to calculate a standard curve which you could use to estimate value x if you know value y. Unknowns must fall within your standard curve’s range.

Correlation and regression • A word of caution about doing regression and finding correlations. • Just because you can draw a line of best fit through some points and make quite a good straight line, it does not necessarily mean there is a relationship. • Correlation does not necessarily imply causation! • For example, the consumption of tropical fruit in the UK since WW2 has increased, and so has the birth rate in the UK. If I plot this on a graph, and did a regression, I would probably get a nice straight line as both increase together. I would probably also show there is a good correlation. • This does not mean that I can say that eating tropical fruit improves your fertility!!! • Use some common sense when interpreting your data!

Summary • This is just a basic introduction. • For extra information, try the Help files on Graphpad Prism (on the University PC’s) • If you end up doing an Honours project with certain types of data (e.g. collecting psychological data, epidemiological studies etc.), your supervisor should inform you about any special tests/calculations they use for that type of data. • Finally, if you are still unsure, make it clear to your supervisor that you do not understand why or what you are doing.

Introduction to Statistics