1 / 62

How to Fake Data if you must

How to Fake Data if you must. Rachel Fewster. Department of Statistics. Who wants to fake data?. Electoral finance returns… Toxic emissions reports… Business tax returns…. Land areas of world countries: real or fake?. Land areas of world countries: real or fake?. 1 2 3 4 5 6 7

jesusr
Télécharger la présentation

How to Fake Data if you must

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Fake Data if you must Rachel Fewster Department of Statistics

  2. Who wants to fake data? • Electoral finance returns… • Toxic emissions reports… • Business tax returns…

  3. Land areas of world countries: real or fake?

  4. Land areas of world countries: real or fake? 1 2 3 4 5 6 7 8 9 IIIII III III I I II I

  5. Land areas of world countries: real or fake? 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 IIIII III III I I II I I I III I IIII I II III

  6. Land areas of world countries: real or fake? This one is right! 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 IIIII III III I I II I I I III I IIII I II III This one seems more even… This one has as many 1s as 5-9s put together!

  7. 1 2 3 4 5 6 7 8 9 IIIII III III I I II I Real land areas of world countries 11 of them begin with digits 1 – 4… Only 5 begin with digits 5 – 9…

  8. Friday’s Newspaper: 1 2 3 4 5 6 7 8 9 IIII IIII IIII III IIII II IIII II III 10 out of 34 numbers began with a 1… None out of 34 began with a 9!

  9. The Curious Case of the Grimy Log-books • In 1881, American astronomer Simon Newcomb noticed something funny about books of logarithm tables…

  10. The Curious Case of the Grimy Log-books The first pages are for numbers beginning with digits 1 and 2… The books always seemed grubby on the first pages… The last pages are for numbers beginning with digits 8 and 9… … but clean on the last pages

  11. The Curious Case of the Grimy Log-books Why? People seemed to look up numbers beginning with 1 and 2 more often than they looked up numbers beginning with 8 and 9. Because numbers beginning with 1 and 2 are MORE COMMON than numbers beginning with 8 and 9!!

  12. Newcomb’s Law 30% of numbers begin with a 1 !! < 5% of numbers begin with a 9 !! American Journal of Mathematics, 1881

  13. The First Digits… Over 30% of numbers begin with a 1 Only 5% of numbers begin with a 9

  14. The First Digits… Numbers beginning with a 1 Numbers beginning with a 9 There is the same “opportunity” for numbers to begin with 9 as with 1 … but for some reason they don’t!

  15. 0.301 = log10(2/1) 0.176 = log10(3/2) 0.125 = log10(4/3) Chance of a number starting with digit d

  16. Reactions to Newcomb’s law Nothing! …for 57 years!

  17. Enter Frank Benford: 1938 Physicist with the General Electric Company Assembled over 20,000 numbers and counted their first digits! ‘A study as wide as time and energy permitted.’

  18. Populations Numbers from newspapers Drainage rates of rivers Numbers from Readers Digest articles Street addresses of American Men of Science

  19. About 30% begin with a 1 About 5% begin with a 9

  20. Anomalous numbers !! Benford gave the ‘law’ its name… …but no explanation.

  21. “…The logarithmic law applies to outlaw numbers that are without known relationship, rather than to those that follow an orderly course; and so the logarithmic relation is essentially a Law of Anomalous Numbers.”

  22. What is the explanation? Explanations for Benford’s Law • Numbers from a wide range of data sources have about 30% of 1’s, down to only 5% of 9’s. • Benford called these ‘outlaw’ or ‘anomalous’ numbers. They include street addresses of American Men of Science, populations, areas, numbers from magazines and newspapers. • Benford’s ‘orderly’ numbers don’t follow the law – like atomic weights and physical constants

  23. Popular Explanations These two say that IF there is a universal law, it must be Benford’s. They don’t explain why there should be a law to start with! • Scale Invariance • Base Invariance • Complicated Measure Theory • Divine choice • Mystery of Nature

  24. Complicated Measure Theory In a nutshell … If you grab numbers from all over the place (a random mix of distributions), their digit frequencies ultimately converge to Benford’s Law

  25. That’s why THIS works well

  26. It doesn’t really explain WHAT will work well, nor why • It doesn’t explain why street addresses of American Men of Science works well!

  27. The Key Idea… If a hat is covered evenly in red and white stripes… Photo - Eric Pouhier http://commons.wikimedia.org/wiki/Napoleon

  28. The Key Idea… If a hat is covered evenly in red and white stripes… … it will be half red and half white. Photo - Eric Pouhier http://commons.wikimedia.org/wiki/Napoleon

  29. A Hat

  30. A Hat

  31. A Hat • If the red stripes cover half the base, they’ll cover about half the hat The red stripes and the white stripes even out over the shape of the hat

  32. What if the red stripes cover 30% of the base? 0 0.3 1 1.3 2 2.3 3 3.3 4 4.3 5 5.3 6 Then they’ll cover about 30% of the hat.

  33. What if the red stripes cover precisely fraction 0.301 of the base? Then they’ll cover fraction ~0.301 of the hat. 0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6 0.301 = log10(2/1)

  34. Think of X as a random number… • We want the probability that X has first digit = 1 • Let the ‘hat’ be a probability density curve for X • Then AREAS on the hat give PROBABILITIES for X

  35. Think of X as a random number… • We want the probability that X has first digit = 1 • Let the ‘hat’ be a probability density curve for X • Then AREAS on the hat give PROBABILITIES for X Area = 0.95 from 1 to 5 Pr(1 < X < 5) = 0.95 Total area = 1

  36. In the same way …. 0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6 If the red stripes somehow represent the X values with first digit = 1, and the red stripes have area ~ 0.301, then Pr(X has first digit 1) ~ 0.301.

  37. So X values with first digit=1 somehow lie on a set of evenly spaced stripes? Write X in Scientific Notation:

  38. So X values with first digit=1 somehow lie on a set of evenly spaced stripes? Write X in Scientific Notation: r is between 1 and 10 n is an integer

  39. For example… r is between 1 and 10 n is an integer

  40. For example… For the first digit of X, only r matters!

  41. For example… r > 2 J 1 < r < 2 J For the first digit of X, only r matters!

  42. Take logs to base 10… Or in other words…

  43. r is between 1 and 10 n is an integer

  44. r is between 1 and 10 n is an integer

  45. r is between 1 and 10 n is an integer

  46. n is an integer X has first digit 1 precisely when log(X) is between n and n + 0.301 for any integer n n = 0 : X from 1 to 2 n = 1 : X from 10 to 20 n = 2 : X from 100 to 200

  47. n is an integer X has first digit 1 precisely when log(X) is between n and n + 0.301 for any integer n STRIPES!! n = 0 : n = 1 : n = 2 :

  48. The ‘hat’ is the probability density curve for log(X) 0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6 • X values with first digit = 1 satisfy: n = 0 : and so on! n = 1 : n = 2 :

  49. The ‘hat’ is the probability density curve for log(X) 0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6 • X values with first digit = 1 satisfy: n = 0 : X from 1 to 2 n = 1 : X from 10 to 20 n = 2 : X from 100 to 200

  50. 0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6 So X values with first digit=1 DO lie on evenly spaced stripes, on the log scale! The PROBABILITY of getting first digit 1 is the AREA of the red stripes, ~ approx the fraction on the base, = 0.301.

More Related