HOW TO USE STATISTICS IN YOUR RESEARCH

HOW TO USE STATISTICS IN YOUR RESEARCH LIES, DAMNED LIES AND STATISTICS!

What we will cover WHY HOW Graphpad EXCEL • Why do statistics • Descriptive Statistics • Distributions • Sampling & Hypotheses • Presenting Results • Chart junk

Why do you need statistics ? • Why are you doing a research project! • Also very important in everyday life • Measure things • Examine relationships • Make predictions • Test hypotheses • Explore issues • Explain activities or attitudes • Make comparisons • Draw conclusions based on samples • Develop new theories • …

Misuse of statistics Design • Ignoring some ‘inconvenient data points’ • Focus on certain variables and exclude others • Alter scales to present your data in a more positive way • Present correlation as causation

Misuse generally accidental • Bias • Need to be particular careful in ‘questionnaire’ type research • Also when sampling • Using the wrong statistical tests • Making incorrect inferences • In going from your sample to the general case • Incorrect drawing conclusions based on correlations

Descriptive Statistics • Used to describe or summarise what your data shows • Not used to draw any conclusions that extend beyond your own data • Mean • Median • Mode • Variance • Standard Deviation

Mean (Average) • Imagine you have collected some data • From running an algorithm on a problem • By measuring execution time • By asking opinions • You want to summarise your data • Don’t present all the results • mean {-30, 1, 2, 3, 4} = -4 • mean {0, 1, 2, 3, 4} = 2 • Measures centrality Excel: = AVERAGE(A1:A10) Graphpad

The mean is not the whole story.. Emma’s Algorithm Malcolm’s Algorithm

Standard Deviation • Standard Deviation measures something about the spread of your data • Important as it gives you some indication of reliability or variability of your results • sd {-30, 1, 2, 3, 4} = 14.6 • sd {0, 1, 2, 3, 4} = 1.6 • Measures spread Excel: = STDEV(A1:A10)

The mean is not the whole story.. Emma’s Algorithm Malcolm’s Algorithm STD: 0.71 STD: 28.07

True or False ? The majority of Scots have more than the average number of legs

TRUE! Most Scots have more than the average number of legs! • (None have 3 legs) • Most have 2 legs • Some have 1 leg • Some have 0 legs • The average < 2 (~1.9) • The mean is not a relevant measure!

When can I use the mean? • The data that you are sampling should follow a normal distribution • Most values are close to the mean, and a few lie at either extreme • 68% of values within 1 SD of mean • 95% of values within 2 SD In practice, a lot of data does follow this kind of distribution

But not all data has a normal distribution • majority of the data is < m ; • more than half the population has less than the mean value • more than half the population is “below average”! m - sd m m + sd

The Median • Median : item with average rank • Rank the items in order, and pick the middle one • median {-30, 1, 2, 3, 4} = 2 ; • median {0, 1, 2, 2, 2, 3, 4, 10, 27} = 2 EXCEL: =MEDIAN(A1:A10) median

Example: Mean vs Median Suppose we ask 7 students how much money they have on them: Mean: £146 Median: £3 The median is much less affected by outliers in the data It is more representative of the sample

q1 med q3 Measuring Spread in non-normal data • Quartiles (25th and 75th percentiles) are a nonparametric measure of spread • first quartile ( Q1) = lower quartile = cuts off lowest 25% of data • third quartile (designated Q3) = upper quartile = cuts off lowest 75%

Sampling anD experiments

Bag contains 1000 balls • They are either red or black • How can we estimate what proportion is red and what proportion is black without looking at all the balls in the bag ?

Sampling • Most experiments involve taking samples from a much larger “population”of data • 20 people asked to rate a website • An algorithm run 10 times to benchmark speed • A measure of quality of service on 10 consecutive days from a network We want to assume that our sample is representative of the larger population

Sampling • Imagine throwing a dice 600 times… • We know what the distributions of outcome should be theoretically • Assume we throw the die 30 times • We might take ‘good’ samples Frequency 1 2 3 4 6 5

Sampling • Imagine throwing a dice 600 times… • We know what the distributions of outcome should be theoretically • Assume we throw the die 30 times • We might take ‘good’ samples • Or we might be ‘unlucky’ with our samples Frequency 1 2 3 4 6 5

Sampling • Now imagine we have a weighted die… • We make 30 throws • The results look a lot like the ‘unlucky’ results from our previous sample… • How can we tell whether the die is really different or whether we were just unlucky during our sampling… • (In most experiments we don’t know what the underlying distribution actually is) Frequency 1 2 3 4 6 5 Frequency 1 2 3 4 6 5

Statistical Tests – Student TTest • The t-test tells us the probability that the two sets of data came from the same underlying distribution • If the probability is very small (< 5%) then we assume the samples come from DIFFERENT distributions • We can safely say that one experiment is better than the other • But… • If >5%, you have to assume both samples came from the same distribution • Any differences in mean, standard deviation are only due to random sampling • There is no significant difference between the samples Excel: TTEST(Range1, Range2, tails, type) Range 1 – first set of data Range 2 – second set of data Tails: set this to 2 (assume a 2-tailed distribution) Type: set this to 2 (an unpaired t-test) Graphpad

Statistical Tests – Student TTest • Mary and John each write an algorithm to sort a large database. Mary claims hers is faster than Johns. • They each run their algorithms 20 times on the same machine and record the results and some descriptive statistics. • John claim she was wrong – his algorithm is definitely faster • Is he right ? • Two-tailed p value = 0.25 • There is a 25% chance the Mary’s and John’s samples both came from the same distribution • Therefore the difference in results is only down to random variations sampling • There is no statistical difference in performance between John’s and Mary’s algorithms

Another Example • Mary and John both roll a die many time and record the mean score. • Mary claims that John’s die is biased • Is she right ? • Two-tailed p value = 0.00002 • There is a 0.002% chance the Mary’s and John’s samples both came from the same distribution • Therefore the difference in results is statistically significant • We can safely conclude that John’s die is different to Mary’s

Some words of caution… • Strictly speaking, the t-test should only be used if the underlying data distribution is normal • If you don’t think it is, there are similar tests you can use: • Wilcoxon • RankSum

Some more tests • For some experiments, we might have a hypothesis: • Students have no preference as to which of 3 browsers they use when they go in the JKCC • From the hypothesis, we can calculate what we would expect to find in an experiment if the hypothesis was true • A researcher goes into the lab and records which of 3 browsers is being used by 60 students • Would expect to see 20 students using each browser • He records the actual results observed

CHITEST • The CHITEST asks: • What is the probability of finding the observed results is the hypothesis was true ? • It generates a number called the p-value • If p <0.05, we REJECT the hypothesis • If p>0.05, we ACCEPT the hypothesis • In this case, if p < 0.05, the we reject the hypothesis that students have no preference for browsers (i.e. they do have a preference!) • In EXCEL: CHITEST(actualValues, expectedValues)

Chi test • Students have no preference as to which browser they use when they go in the JKCC The two-tailed P value equals 0.1423 By conventional criteria, this difference is considered to be not statistically significant. P > 0.05 so we ACCEPT the hypothesis There is a 14% chance the data was sampled from the expected distribution This is NOT statistically sufficient – we have to assume that students have NO preference as to which browser they use, i.e. the theory is correct (value < 0.05 to be significant)

Linear Regression • Sometimes you want to find correlations between two variables: • QualityOfS & SizeOfNetwork • LinesOfCode & SpeedOfExecution • Age & TimeSpentOnSocialMedia • Show trends • Use to predict future values

Understanding the Graph Variable on x axis (independent variable) y=mx+c y-intercept Variable on y axis (dependent variable) Slope of line mark = 0.6958attendance + 4.9333 R² = 0.98134 Prediction: what mark would a student who attended 75% of time get ? Mark = 0.6958*75+4.9333=57.11 Measure of quality of fit (maximum = 1)

Doing this in Excel • Scatterplot of data • Make sure it is in columns, with independent variable first (x) • Chart Layout: • Add linear trendline • Trendline Options: choose show R value and equation

And finally Presenting your results

Chart Junk “The interior decoration of graphics generates a lot of ink that does not tell the viewer anything new. The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.” “Chartjunk can turn bores into disasters, but it can never rescue a thin data set.”

Examples of chart junk

Much better!

Other bad examples

Too much information! • Too much info

SUMMARY • Remember you need to use statistics to properly analyse your work • Make sure you use the right statistic • Make sure your present your data/statistics well • Don’t lie with statistics !

Dropbox links to slides and a workbook http://bit.ly/MrW1K4http://bit.ly/1fanfQ3

HOW TO USE STATISTICS IN YOUR RESEARCH

HOW TO USE STATISTICS IN YOUR RESEARCH

Presentation Transcript

How to Use App Inventor in Your Classroom

Ethical Use of Statistics in Research

Understanding How to Use Quotations in Your Research Paper

How do you use your research questions to guide your research over multiple texts?

How to Use Statistics for Library Decision-making

How to use your TA

How to use your clicker

HOW TO USE YOUR SOURCES

HOW TO RESEARCH YOUR CROPS

How to use Bibliometrics in your Career

Understanding How to Use Paraphrasing in Your Research Paper

How to Use DNA in Your Genealogical Research

HOW TO USE YOUR MAC

Statistics you can use: Practical use of statistics in reading medical research literature

How Statistics Can Empower Your Research? Part II

TITLE: STATISTICS IN USE

How to present and use statistics

How to present and use statistics

How Statistics Can Empower Your Research? Part II

TITLE: STATISTICS IN USE

How Sociologists Use Statistics

How to use in your classrooms?