1 / 68

Statistics for Science Journalists

Statistics for Science Journalists. Steve Doig Cronkite School of Journalism. Common Research Methods. Randomized experiments: Measure deliberate manipulation of the environment Observational studies: Measure the differences that occur naturally

emlyn
Télécharger la présentation

Statistics for Science Journalists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics for Science Journalists Steve Doig Cronkite School of Journalism

  2. Common Research Methods • Randomized experiments: Measure deliberate manipulation of the environment • Observational studies: Measure the differences that occur naturally • Meta-analyses: Quantitative review of multiple studies • Case Study: Descriptive in-depth examination of one or a few individuals

  3. Simple Measures... ...don’t exist!

  4. Measurement Variability • Variable measurements include unpredictable errors or discrepancies that aren’t easily explained. • Natural variabilityis the result of the fact that individuals and other things are different.

  5. Reasons for variable measures • Measurement error • Natural variability between individuals • Natural variability over time in a single individual Statistics are tools to help us work with measurements that vary

  6. Some Pitfalls in Studies

  7. Deliberate Bias? If you found a wallet with $20, would you: • “Keep it?” (23% would keep it) • “Do the honest thing and return it?” (13% would keep it)

  8. Unintentional Bias? • “Do you use drugs?” • “Are you religious?”

  9. Desire to Please? People routinely say they have voted when they actually haven’t, that they don’t smoke when they do, and that they aren’t prejudiced. One study six months after an election: • 96% of actual voters said they voted. • 40% of non-voters said they voted.

  10. Asking the uninformed? Washington Post poll : “Some people say the 1975 Public Affairs Act should be repealed. Do you agree or disagree that it should be repealed?” • 24% said yes • 19% said no • rest had no opinion

  11. Asking the uninformed? Later Washington Post poll: “President Clinton says the 1975 Public Affairs Act should be repealed. Do you agree or disagree that it should be repealed?” • 36% of Democrats agreed • 16% of Republicans agreed • rest had no opinion

  12. Unnecessary Complexity? • “Do you support our soldiers in Iraq so that terrorists won’t strike the U.S. again?”

  13. Question Order • “About how many times a month do you normally go out on a date?” • “How happy are you with life in general?”

  14. Sampling

  15. Margin of Error 95% of the time, a random sample’s characteristics will differ from the population’s by no more than about where N= sample size

  16. Two Important Concepts about Error Margin • The larger the sample, the smaller the margin of sampling error. • The size of the population being surveyed doesn’t matter.* *Unless the sample is a significant fraction of the population.

  17. Sampling realities • Bigger sample means more cost (money and/or time) • Diminishing return on error margin improvement as sample increases. • N=100: +/- 10 percentage points • N=400: +/- 5 percentage points • N=900: +/- 3.3 percentage points • Sample needs only to be large enough to give a reasonable answer. • Sampling error affects subsamples, too.

  18. Describing data sets

  19. Three Useful Featuresof a Set of Data • The Center • The Variability • The Shape

  20. The Center • Mean (average): Total of the values, divided by the number of values • Median: The middle value of an ordered list of values • Mode: The most common value • Outliers: Atypical values far from the center

  21. Example: Baseball Salaries • Average: $2,827,104 • Median: $950,000 • Mode: $327,000 (also the minimum) • Outlier: $21.7 million (Alex Rodriguez of the NY Yankees)

  22. The Variability Some measures of variability: • Maximum and minimum: Largest and smallest values • Range: The distance between the largest and smallest values • Quartiles: The medians of each half of the ordered list of values • Standard deviation: Think of it as the average distance of all the values from the mean.

  23. What is “normal”? • Don’t consider the average to be “normal” • Variability is normal • Anything within about 3 standard deviations of the mean is “normal”

  24. Bell-Shaped “Normal” Curve

  25. mean Some Characteristics of a Normal Distribution • Symmetrical (not skewed) • One peak in the middle, at the mean • The wider the curve, the greater the standard deviation • Area under the curve is 1 (or 100%)

  26. Percentiles Your percentilefor a particular measure (like height or IQ) is the percentage of the population that falls belowyou. Compared to other American males: • My height (5’ 11”): 75th percentile • My weight (230 lbs.): 85thpercentile • My age (63): 86th percentile Therefore, I am older and heavier than I am tall.

  27. Standardized Scores A standardized score(also called the z-score) is simply the number of standard deviations a particular value is either above or below the mean. The standardized score is: • Positive if above the mean • Negative if below the mean Useful for defining data points as outliers.

  28. The Empirical Rule For any normal curve, approximately: • 68% of values within one StdDevof the mean • 95% of values within two StdDevsof the mean • 99.7% of values within three StdDevsof the mean

  29. Outlier • A value that is more than three standard deviations above or below the mean.

  30. Correlation

  31. Strength of Relationship Correlation (also called the correlation coefficient or Pearson’s r) is the measure of strength of the linear relationship between two variables. Think of strength as how closely the data points come to falling on a line drawn through the data.

  32. Features of Correlation • Correlation can range from +1 to -1 • Positive correlation: As one variable increases, the other increases • Negative correlation: As one variable increases, the other decreases • Zero correlation means the best line through the data is horizontal • Correlation isn’t affected by the units of measurement

  33. r = +.4 r = +1 r = +.8 Positive Correlations r = +.1

  34. Negative Correlations r = -.4 r = -.1 r = -.8 r = -1

  35. Zero correlation r = 0 r = 0

  36. Number of PointsDoesn’t Matter r = .8 r = .8

  37. Important! Correlation does not imply causation. (Churches and liquor stores, shoe size and reading ability)

  38. Correlation of variables • When considering relationships between measurement variables, there are two kinds: • Explanatory (or independent) variable: The variable that attempts to explain or is purported to cause (at least partially) differences in the… • Response (or dependent or outcome) variable • Often, chronology is a guide to distinguishing them (examples: baldness and heart attacks, poverty and test scores)

  39. Some reasons why two variables could be related • The explanatory variable is the direct cause of the response variable Example: pollen counts and percent of population suffering allergies, intercourse and babies

  40. Some reasons two variables could be related • The response variable is causing a change in the explanatory variable Example: hotel occupancy and advertising spending, divorce and alcohol abuse

  41. Some reasons two variables could be related • The explanatory variable is a contributing -- but not sole -- cause Example: birth complications and violence, gun in home and homicide, hours studied and grade, diet and cancer

  42. Some reasons two variables could be related • Confounding variables may exist Example: happiness and heart disease, traffic deaths and speed limits

  43. Some reasons two variables could be related • Both variables may result from a common cause Example: SAT score and GPA, hot chocolate and tissues, storks and babies, fire losses and firefighters, WWII fighter opposition and bombing accuracy

  44. Some reasons two variables could be related • Both variables are changing over time Example: divorces and drug offenses, divorces and suicides

  45. Some reasons two variables could be related • The association may be nothing more than coincidence Example: clusters of disease, brain cancer from cell phones

  46. So how can we confirm causation? The only way to confirm is with a designed (randomized double-blind) experiment.But non-statistical evidence of a possible connection may include: • A reasonable explanation of cause and effect. • A connection that happens under varying conditions. • Potential confounding variables ruled out.

  47. Regression

  48. Linear Regression In addition to figuring the strength of the relationship, we can create a simple equation that describes the best-fit line (also called the “least-squares” line) through the data. This equation will help us predict one variable, given the other.

  49. Best-fit (“least-squares”) Line

More Related