1 / 45

HOW TO USE STATISTICS IN YOUR RESEARCH

HOW TO USE STATISTICS IN YOUR RESEARCH. LIES, DAMNED LIES AND STATISTICS!. What we will cover. WHY. HOW. Graphpad EXCEL. Why do statistics Descriptive Statistics Distributions Sampling & Hypotheses Presenting Results Chart junk. Why do you need statistics ?.

idalia
Télécharger la présentation

HOW TO USE STATISTICS IN YOUR RESEARCH

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HOW TO USE STATISTICS IN YOUR RESEARCH LIES, DAMNED LIES AND STATISTICS!

  2. What we will cover WHY HOW Graphpad EXCEL • Why do statistics • Descriptive Statistics • Distributions • Sampling & Hypotheses • Presenting Results • Chart junk

  3. Why do you need statistics ? • Why are you doing a research project! • Also very important in everyday life • Measure things • Examine relationships • Make predictions • Test hypotheses • Explore issues • Explain activities or attitudes • Make comparisons • Draw conclusions based on samples • Develop new theories • …

  4. Misuse of statistics Design • Ignoring some ‘inconvenient data points’ • Focus on certain variables and exclude others • Alter scales to present your data in a more positive way • Present correlation as causation

  5. Misuse of statistics Design • Ignoring some ‘inconvenient data points’ • Focus on certain variables and exclude others • Alter scales to present your data in a more positive way • Present correlation as causation

  6. Misuse of statistics Design • Ignoring some ‘inconvenient data points’ • Focus on certain variables and exclude others • Alter scales to present your data in a more positive way • Present correlation as causation

  7. Misuse generally accidental • Bias • Need to be particular careful in ‘questionnaire’ type research • Also when sampling • Using the wrong statistical tests • Making incorrect inferences • In going from your sample to the general case • Incorrect drawing conclusions based on correlations

  8. Descriptive Statistics • Used to describe or summarise what your data shows • Not used to draw any conclusions that extend beyond your own data • Mean • Median • Mode • Variance • Standard Deviation

  9. Mean (Average) • Imagine you have collected some data • From running an algorithm on a problem • By measuring execution time • By asking opinions • You want to summarise your data • Don’t present all the results • mean {-30, 1, 2, 3, 4} = -4 • mean {0, 1, 2, 3, 4} = 2 • Measures centrality Excel: = AVERAGE(A1:A10) Graphpad

  10. The mean is not the whole story.. Emma’s Algorithm Malcolm’s Algorithm

  11. Standard Deviation • Standard Deviation measures something about the spread of your data • Important as it gives you some indication of reliability or variability of your results • sd {-30, 1, 2, 3, 4} = 14.6 • sd {0, 1, 2, 3, 4} = 1.6 • Measures spread Excel: = STDEV(A1:A10)

  12. The mean is not the whole story.. Emma’s Algorithm Malcolm’s Algorithm STD: 0.71 STD: 28.07

  13. True or False ? The majority of Scots have more than the average number of legs

  14. TRUE! Most Scots have more than the average number of legs! • (None have 3 legs) • Most have 2 legs • Some have 1 leg • Some have 0 legs • The average < 2 (~1.9) • The mean is not a relevant measure!

  15. When can I use the mean? • The data that you are sampling should follow a normal distribution • Most values are close to the mean, and a few lie at either extreme • 68% of values within 1 SD of mean • 95% of values within 2 SD In practice, a lot of data does follow this kind of distribution

  16. But not all data has a normal distribution • majority of the data is < m ; • more than half the population has less than the mean value • more than half the population is “below average”! m - sd m m + sd

  17. The Median • Median : item with average rank • Rank the items in order, and pick the middle one • median {-30, 1, 2, 3, 4} = 2 ; • median {0, 1, 2, 2, 2, 3, 4, 10, 27} = 2 EXCEL: =MEDIAN(A1:A10) median

  18. Example: Mean vs Median Suppose we ask 7 students how much money they have on them: Mean: £146 Median: £3 The median is much less affected by outliers in the data It is more representative of the sample

  19. q1 med q3 Measuring Spread in non-normal data • Quartiles (25th and 75th percentiles) are a nonparametric measure of spread • first quartile ( Q1) = lower quartile = cuts off lowest 25% of data • third quartile (designated Q3) = upper quartile = cuts off lowest 75%

  20. Sampling anD experiments

  21. Bag contains 1000 balls • They are either red or black • How can we estimate what proportion is red and what proportion is black without looking at all the balls in the bag ?

  22. Sampling • Most experiments involve taking samples from a much larger “population”of data • 20 people asked to rate a website • An algorithm run 10 times to benchmark speed • A measure of quality of service on 10 consecutive days from a network We want to assume that our sample is representative of the larger population

  23. Sampling • Imagine throwing a dice 600 times… • We know what the distributions of outcome should be theoretically • Assume we throw the die 30 times • We might take ‘good’ samples Frequency 1 2 3 4 6 5

  24. Sampling • Imagine throwing a dice 600 times… • We know what the distributions of outcome should be theoretically • Assume we throw the die 30 times • We might take ‘good’ samples Frequency 1 2 3 4 6 5

  25. Sampling • Imagine throwing a dice 600 times… • We know what the distributions of outcome should be theoretically • Assume we throw the die 30 times • We might take ‘good’ samples • Or we might be ‘unlucky’ with our samples Frequency 1 2 3 4 6 5

  26. Sampling • Now imagine we have a weighted die… • We make 30 throws • The results look a lot like the ‘unlucky’ results from our previous sample… • How can we tell whether the die is really different or whether we were just unlucky during our sampling… • (In most experiments we don’t know what the underlying distribution actually is) Frequency 1 2 3 4 6 5 Frequency 1 2 3 4 6 5

  27. Sampling • Now imagine we have a weighted die… • We make 30 throws • The results look a lot like the ‘unlucky’ results from our previous sample… • How can we tell whether the die is really different or whether we were just unlucky during our sampling… • (In most experiments we don’t know what the underlying distribution actually is) Frequency 1 2 3 4 6 5 Frequency 1 2 3 4 6 5

  28. Statistical Tests – Student TTest • The t-test tells us the probability that the two sets of data came from the same underlying distribution • If the probability is very small (< 5%) then we assume the samples come from DIFFERENT distributions • We can safely say that one experiment is better than the other • But… • If >5%, you have to assume both samples came from the same distribution • Any differences in mean, standard deviation are only due to random sampling • There is no significant difference between the samples Excel: TTEST(Range1, Range2, tails, type) Range 1 – first set of data Range 2 – second set of data Tails: set this to 2 (assume a 2-tailed distribution) Type: set this to 2 (an unpaired t-test) Graphpad

  29. Statistical Tests – Student TTest • Mary and John each write an algorithm to sort a large database. Mary claims hers is faster than Johns. • They each run their algorithms 20 times on the same machine and record the results and some descriptive statistics. • John claim she was wrong – his algorithm is definitely faster • Is he right ? • Two-tailed p value = 0.25 • There is a 25% chance the Mary’s and John’s samples both came from the same distribution • Therefore the difference in results is only down to random variations sampling • There is no statistical difference in performance between John’s and Mary’s algorithms

  30. Another Example • Mary and John both roll a die many time and record the mean score. • Mary claims that John’s die is biased • Is she right ? • Two-tailed p value = 0.00002 • There is a 0.002% chance the Mary’s and John’s samples both came from the same distribution • Therefore the difference in results is statistically significant • We can safely conclude that John’s die is different to Mary’s

  31. Some words of caution… • Strictly speaking, the t-test should only be used if the underlying data distribution is normal • If you don’t think it is, there are similar tests you can use: • Wilcoxon • RankSum

  32. Some more tests • For some experiments, we might have a hypothesis: • Students have no preference as to which of 3 browsers they use when they go in the JKCC • From the hypothesis, we can calculate what we would expect to find in an experiment if the hypothesis was true • A researcher goes into the lab and records which of 3 browsers is being used by 60 students • Would expect to see 20 students using each browser • He records the actual results observed

  33. CHITEST • The CHITEST asks: • What is the probability of finding the observed results is the hypothesis was true ? • It generates a number called the p-value • If p <0.05, we REJECT the hypothesis • If p>0.05, we ACCEPT the hypothesis • In this case, if p < 0.05, the we reject the hypothesis that students have no preference for browsers (i.e. they do have a preference!) • In EXCEL: CHITEST(actualValues, expectedValues)

  34. Chi test • Students have no preference as to which browser they use when they go in the JKCC The two-tailed P value equals 0.1423 By conventional criteria, this difference is considered to be not statistically significant. P > 0.05 so we ACCEPT the hypothesis There is a 14% chance the data was sampled from the expected distribution This is NOT statistically sufficient – we have to assume that students have NO preference as to which browser they use, i.e. the theory is correct (value < 0.05 to be significant)

  35. Linear Regression • Sometimes you want to find correlations between two variables: • QualityOfS & SizeOfNetwork • LinesOfCode & SpeedOfExecution • Age & TimeSpentOnSocialMedia • Show trends • Use to predict future values

  36. Understanding the Graph Variable on x axis (independent variable) y=mx+c y-intercept Variable on y axis (dependent variable) Slope of line mark = 0.6958attendance + 4.9333 R² = 0.98134 Prediction: what mark would a student who attended 75% of time get ? Mark = 0.6958*75+4.9333=57.11 Measure of quality of fit (maximum = 1)

  37. Doing this in Excel • Scatterplot of data • Make sure it is in columns, with independent variable first (x) • Chart Layout: • Add linear trendline • Trendline Options: choose show R value and equation

  38. And finally Presenting your results

  39. Chart Junk “The interior decoration of graphics generates a lot of ink that does not tell the viewer anything new. The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.” “Chartjunk can turn bores into disasters, but it can never rescue a thin data set.”

  40. Examples of chart junk

  41. Much better!

  42. Other bad examples

  43. Too much information! • Too much info

  44. SUMMARY • Remember you need to use statistics to properly analyse your work • Make sure you use the right statistic • Make sure your present your data/statistics well • Don’t lie with statistics !

  45. Dropbox links to slides and a workbook http://bit.ly/MrW1K4http://bit.ly/1fanfQ3

More Related