Laboratory & professional skills for Bioscientists term 3 : Data Analysis in R

Laboratory & professional skills for Bioscientists term 3:Data Analysis in R Revision: Overview and Developing understanding of term 2 statistics

Module Learning Outcomes The successful student will be able to: • Explain the purpose of data analysis • Name, identify and choose classical univariate statistical tests (and some non-parametric equivalents) appropriate to a given scenario and recognise when these are not suitable • Use R to perform these analyses on data in a variety of formats • Interpret, report and graphically present the results of covered tests Meeting the learning outcomes will enable you to: • Write-up your laboratory report • Design and analyse experiments including those for projects in stages 2, 3, and 4 and year-away • Evaluate and interpret the data analysis in papers • Perform well in assessments • Improve your employability!

Plan • What is the biological question • Organising your data • Overview for Choosing tests • Figures – use ggplot • Workshop

Overview: Experiments Something we measure Some things we control, choose or set Dependent variables Response variables The ‘y’ s Independent variables Predictor variables treatments The ‘x’ s inferences made

What is the biological question? • Very many ‘statistics’ problems are biology problems • Think about your questions before you design experiments • What are your null hypotheses? • What is your response variable? • What are your explanatory variables? • Generate dummy data

What is the biological question? Response: blood pressure Explanatory variables you want to understand: drug Explanatory variable for control: sex, age, disease??

Overview: data analysis Some things we control, choose or set Something we measure can be explained by Independent variables Dependent variable can be explained by Response variable can be explained by Predictor variables y can be explained by One or more x y ~ x y ~ x1 + x2 y ~ x1 * x2

What is the biological question? • Tests apply to a particular question • Statistics are just evidence: as references are to your introduction, statistics are to your results. • At this level you should be able to say what your results are before analysis

Collecting data • Design and execute your experiment • Experimental design and data analysis go hand-in-hand • Type of experiment determines statistics • But understanding of statistics informs design • Organise your data in ’tidy’ format • Workshop

Collecting data • Organise your data in ’tidy’ format • Fine to use a spread sheet but save in a plain text format (txt, csv, dat, etc) • Write data straight into a blank tidy format

Response all in one column Additional column for each explanatory i.e., case (individual) per row

Explore your data • Frequency histograms • Scatter plots • Bar charts • Descriptive statistics – often absent from reports. • Sample sizes/number of cases • Variables • Means/medians, s.e • You don’t put all your exploring in your report

Choose a test • Always better to do BEFORE you design experiments and collect data • Consider assumptions • Take appropriate action – transform or chose another test

Execute the test • Do results make sense based on your data exploration? • Chose APPROPRIATE figures and/or tables • Report results scientifically • Appropriate descriptive statistics • Results of test – significance, magnitude, direction • Explaining results – leave for the Discussion

Figures • Should match the test • t-tests, ANOVA tests on means thus figures show means • Wilcoxon, Mann-Whitney tests on ranks thus figures use medians/mean ranks • Correlation should NOT have line of best fit • Regression should have line of best fit • Likewise descriptive statistics

Why chi-squared? • We often count the number of individuals or events in categories and compare the numbers we observe with numbers we expect under a null hypothesis. • H0 might expect numbers to • be the same, or • follow a particular pattern, or • match the pattern in another group • Chi-squared allows us to make the comparison statistically

Two types of scenario thus two types of 2 test • Those with a priori expectations – we know what the proportions should be Goodness of fit • Those without a priori expectations – we know proportions should be same between groups Contingency

2 Goodness of fit test • The expected values (null hypothesis) are derived from some theory • We test the fit of our data to the theory

2 Contingency test • We use these when our expected values are derived from the data • Null hypothesis: proportion of foods taken by each breed is the same, i.e., no association between breed and food type • Expecteds are Row total x column total / grand total • d.f. are (r - 1)(c - 1)

Why t-tests? • Continuous response, category explanatory. • Is there a difference between categories in response • Standard formula for all t-tests Size of difference and variability matter

Assumptions of all t-tests • All t-tests assume the ‘data’ are normally distributed • 0ne-sample – the data themselves • Paired sample – the differences between the pairs • Two-sample – the data themselves • (Actually – the residuals in all cases) • Additionally the two-sample t-test assumes equal variances

Figures t-tests • Based on the mean and s.e. thus figures should have Two sample One sample - probably not needed but possibly...

Non-parametric alternatives • Non-parametric tests make fewer assumptions • Based on the ranks rather than the actual data • Null hypotheses are about the mean rank(not the mean) • More conservative (less likely to be significant) • P values maybe estimates (may generate a warning). You can’t usually ‘fix’ that

Non-parametric alternatives • One sample or paired sample : Wilcoxon signed ranked test • Two sample: Wilcoxon rank sum test aka Mann-Whitney

Why ANOVA? • Like a t-test • Continuous response, category explanatory • Is there a difference between categories in response • But more groups in the categorical variable (one way) • But an additional categorical variable (two-way) • Can test for interaction between the two explanatory variables

Assumptions of one-way ANOVA • Like a t-test • Residuals must be normally distributed • Variance in the treatment groups is assumed to be equal • Both of these can be tested in R

Figures ANOVA • Like a t-test. Based on the mean and s.e. thus figures should have

Non-parametric equivalent: Kruskal-Wallis • Residuals not normal • Repeated values • Unequal variance • Small sample size • Unequal sample size, esp when variances look unequal

Two-way ANOVA: Can be with or without replication Without reps With reps You have usually Cannot test for interaction

Assumptions of 2-way ANOVA • Same as for t-tests and one-way ANOVA • Normality and ‘homoscedascity’ of residuals Normality examine residuals after ANOVA Equalvariance examine residuals after ANOVA Use common sense

A figure Understand the interaction Sex – Sig Spp – Sig Int – Sig Effect of sex is greater in F.concocti ‘effect of one factor depends on the level of another’

Possible results Sex – NS Spp – NS Int – Sig But sex does have an effect! It is just reversed If you have a significant interaction, interpret main effects with care.

Type of question Is there an association/relationship between …..? • Correlation and regression • One continuous explanatory variable

Choosing tests Explanatory variables to explain the response variable. The type of test depends on the type of question and the type of data. t-tests ANOVA Correlation Regression Continuous Categorical

Always do a scatterplot first Correlation • Linear association • No cause and effect • Axes could be swapped

Always do a scatterplot first Regression • Linear relationship • Cause and effect • Axes cannot be swapped Dependent variable / effect/ response Independent variable / cause / explanatory / predictor

Assumptions • Correlation • Bivariate normal • Both variables are randomly sampled • Any association is linear • Regression • Normality and homoscedascity of residuals • y values are independent • x is measured without error • Linear regression assumes linear relationship

Introduction to the practical Scenarios for biological phenomena identify an appropriate design and statistical test for the general research question generate the data using the random number functions that would give the effects specified. create figures to accompany the results

Laboratory & professional skills for Bioscientists term 3 : Data Analysis in R