240 likes | 385 Vues
Applying Benford’s Law of Leading Digits to Large, Natural Data Sets. Background of Benford’s Law.
E N D
Applying Benford’s Law of Leading Digits to Large, Natural Data Sets
Background of Benford’s Law • Discovered by Simon Newcomb in 1881 and again by Frank Benford in 1938, Benford’s Law of Leading Digits suggests that in a majority of real-life data sets, the leading digits of data entries are logarithmically, rather than uniformly, distributed. • As such, in a base 10 number system we would expect to observe a leading digit of 1 about 30.1% of the time, whereas we would expect to observe a leading digit of 9 only about 4.5% of the time.
Introduction to Benford’s Law • For any positive real number x, we can represent x in scientific notation as x = MB(x) ∙ Bk(x) where MB(x) is called the mantissa of x, and k(x) represents the exponent value. • Benford’s Law of Leading Digits provides us with an expected distribution of the mantissas in a natural data set. According to the law, the probability of observing a data entry beginning with digit d in base B is Prob(first digit is d) = logB(1+1/d).
Benford Base-10 Distribution For a base 10 number system, we expect to see the following distribution of digits in a data set that satisfies Benford’s Law.
Applications of Benford’s Law • Benford’s Law has been a useful tool in detecting fraud and data irregularities in the past due to the fact that humans are notoriously bad random number generators. Discrepancies from the Benford distribution may suggest issues in data validity, such as inconsistencies in data collection methods, rounding errors, or even nefarious activities such as fraud or data distortion. • Knowing that the leading digits of the mantissas should be logarithmically distributed (becoming gradually more uniform the further out we progress), we can compare combinations of the first digits and last digits to the Benford and Uniform distributions, respectively, to judge conformity to the expected digit distribution.
Comparison of Natural Data Sets • In this study, we compare two natural data sets and their conformity to the Benford distribution. • The first data set is an example of a natural data set with an extremely good fit for the Benford distribution. This data set is made up of hydrology and streamflow statistics from the U.S. Geological Survey. • The second is a much more controversial data set that demonstrates considerable discrepancies from the digit distribution we would expect if it were truly Benford. This data set was derived from a paper published by Phil Jones and Michael Mann, two of the researchers accused of data distortion in the 2009 Climategate Scandal.
Issues Arising From Benford Analysis • It is still an open question as to which data sets should conform to Benford’s Law. In general, it suffices for the data set to be large, span multiple orders of magnitude, and have a sufficient number of significant digits. However, it is still possible for the data to fail to be Benford without any nefarious activity despite having met these conditions. • Though the Chi-square statistic is the most popular and well-documented statistic, we must take into account the extreme sensitivity of the Chi-square statistic when dealing with large data sets that have few degrees of freedom (in cases such as these, the Chi-square statistic tends to overestimate the error). For comparison’s sake, we include these values, but rely primarily on the mean absolute deviation (percent deviation from the intended distribution) for our analysis.
Intentions • The primary goal of this study is to get a sense of when Benford’s Law should hold for natural data sets. As such, discrepancies from Benford’s Law need not indicate fraud or nefarious activity, and it is not our intent to accuse anyone of such behavior; our goal is to see whether or not certain data sets follow Benford’s Law, and comment on the results.
Hydrology StatisticsBackground and Data Description • This data set is comprised of streamflow statistics from the U.S. Geological Survey. • The characteristics of this data make it perfect for a Benford analysis: • The data spans a time period of 130 years. • The data set is the largest analyzed in Benford literature to date. • The data set spans nine orders of magnitude. • The methods employed to measure stream flow have not changed at all during the time period, suggesting that there will be no distortions due to data collection method changes.
Hydrology StatisticsDescription of Benford Tests • In a previous study by Miller-Nigrini (2008), a first two-digit analysis was performed on this data set, displaying a very close fit to the Benford distribution. • In this study, we analyzed the distribution of the first three-digits. The statement of Benford’s Law may be revised as follows to predict the probability of observing a data entry beginning with a pre-determined three-digit combination: Prob(first three digits are d1d2d3) = log10[1+1/(100d1+10d2+d3)]
Hydrology StatisticsRestricting the Data Set • Due to potential rounding discrepancies, we only wanted to include numbers with at least four significant digits so that the first three would be unaltered by rounding. However, by pruning our data set to include only values for which we could trust the first three digits, we were limiting ourselves to a mere 16.1% of our original 457,440 data entries, existing in only one order of magnitude. This resulted in a strange, non-Benford distribution. • Having restricted our data set so thoroughly, we could not conclude that our data set was truly non-Benford. Therefore, we decide to ignore the limitation on significant digits and perform a Benford analysis on a larger portion of the data set under the assumption that any rounding errors would “smooth out” over such a large data set. This enabled us to look at a data set with over 400,000 entries spanning six orders of magnitude. • We compare and contrast these two subsets of the hydrology statistics in the following slides.
Hydrology StatisticsComparison of Restricted vs. Unrestricted Data Restricted First 3-Digits Unrestricted First 3-Digits In this comparison, we can see that the unrestricted data set displays a much better conformance to the Benford distribution (shown in pink). This may be attributed in part to the difference in size, but is primarily due to the presence of data entries spanning an extra five orders of magnitude in the unrestricted data as opposed to the restricted set.
Hydrology StatisticsMeasuring Conformance to Benford’s Law • The following table reports the Chi-square and absolute mean deviation values for Benford tests of the first, first two, and first three (both restricted and unrestricted) digits. Again, we treat the significant Chi-square values with caution, as several of our data sets contain over 400,000 values.
Hydrology StatisticsConclusions • Though both data sets display a relatively good fit for the Benford distribution, we notice that ignoring our initial limitation on the number of significant digits (thereby giving us a much larger data set spanning five additional orders of magnitude) gives us a better value for the mean absolute deviation. Our unrestricted data set is a mere 0.02% away from what we would expect in a Benford distributed data set.
Climate DataBackground and Data Description • A massive email leak at the Climatic Research Unit in November 2009 led to allegations of scientific misconduct and data distortion in the climate science community. The scandal soon earned the title “Climategate”. • The data set analyzed was comprised of data from a paper published by Phil D. Jones and Michael Mann (two of the researchers accused in the scandal) in 2004, titled “Global Surface Temperatures Over the Past Two Millennia”. • The data set contains a total of 32,451 observations (measured as deviations from an average temperature); this data set was further broken down into 30 data subsets (ranging in size from 335 to 1991 entries) covering different regions of the world.
Climate DataDescription of Benford Tests • Because these data entries were measured as deviations from an average temperature, the option of a first-digit analysis was discarded (due to the presence of so many data entries beginning with 0). • Instead, the last two digits were analyzed in four different Benford tests: • Endings: Compares each of the 100 possible last two digit combinations to the expected uniform probability, 1/100. • Non/Doubles: Compares the proportion of total non-double endings to 9/10 and the proportion of total double digit endings to 1/10. • Non/Doubles(Split): Compares the proportion of total non-double endings to 9/10 and the proportion of each double digit ending to 1/100. • Doubles(Conditional): Evaluates the double digit ending proportions conditionally (given that a double occurs), comparing each double digit ending combination to 1/10 of the total double digit endings.
Climate DataDistribution of Double-Digit Ending Combinations • In an amalgamation of all 30 data subsets, we observe a significant spike of values ending in the double digit ending combination 77, and a deficit of values ending in 00.
Climate DataAnalysis of Climate Data Amalgamation • In an analysis of all 32,451 data entries, we see a 3.93% deviation from our expected Uniform distribution in the Doubles (Conditional) test. • An issue that arises in the climate data analysis is the fact that with only three significant digits, we would not expect the last two digit distribution to be entirely uniform, as we have not progressed far enough out in the mantissa to ensure uniformity. We should see a distribution that is slightly biased toward lower values, though less so than a first digit Benford distribution.
Climate DataAnalysis of Individual Data Subsets • Due to large discrepancies in the number of times that particular ending digit combinations were observed, we chose to analyze each of the thirty data subsets individually. This uncovered a number of subsets with ending digit distributions that seemed to be outside the realm of random chance. We include the double digit ending statistics for two of these strange subsets (the Western US Unsmoothed and Tasmania Unsmoothed data sets) below: In our next few slides, we provide an example of the analysis that was performed on each of the thirty data subsets, using the data from these two subsets.
Climate DataAnalysis of Subsets – Western US Unsmoothed • If this data subset were truly Benford, we would expect to see a slight bias toward the lower double digit combinations, and a more uniform decrease as the ending digit combinations increase. When the last two-digit analysis was expanded to include all 100 possible ending combinations, we observed a random scattering of large numbers of occurrences interspersed with 17 ending combinations that did not occur a single time. In addition to the non-Benford pattern of ending digit combinations, we have a 9.78% deviation from the distribution of double-digit endings (Doubles(C)) that we would expect if this subset were Benford.
Climate DataAnalysis of Subsets – Tasmania Unsmoothed • As seen in the previous table, this data subset demonstrates a strong bias towards lower double-digit ending combinations. Originally, we suspected that this anomaly may be due to a lack of range (i.e. if our range covered only the interval [0,0.4], we would not expect to observe any ending combinations above 40). However, our range covers the interval [-4.43,3.59], and includes ending digit combinations ranging from 00 to 99. • An expansion of the analysis to include all 100 possible two-digit ending combinations demonstrates an impressive lack of pattern, with 46 ending combinations being observed not a single time, while others are observed as many as 80 times. A sample of this data is included below:
Climate DataAnalysis of Subsets –Tasmania Unsmoothed (cont.) • Though the majority of the last two-digit tests reveal only a 1 or 2% discrepancy from the expected distribution, our Doubles (Conditional) test reports a deviation of 12.00% from the Uniform distribution that we would expect if this data set were truly Benford.
Climate DataConclusions • An identical analysis performed on all thirty data subsets revealed multiple cases of disparities from the uniform last two-digit distribution to which they were compared. Even with significant deviations in several of the subsets, we would expect that if the data were truly Benford, these discrepancies would smooth out in an amalgamation of all 32,451 values. • As mentioned previously, we currently have no way of determining if this data set should conform to Benford’s Law. Though the data set is large, spans multiple orders of magnitude, and reports data values to three significant digits, it is still entirely possible that the deviations are due to non-nefarious factors, such as rounding errors, discrepancies in data collection methods, or simply non-Benford behavior.
Conclusions • In this study we have seen two natural data sets whose conformance to Benford’s Law vary drastically; a set of hydrology statistics whose leading digit distribution is a very close fit, and a set of controversial climate statistics whose ending digit distribution reveals many discrepancies from a Benford data set. • It is still an open question as to which data sets should conform to Benford’s Law; though many researchers believe that this law is a characteristic intrinsic to our number system, there is no set of criteria that guarantee conformance to the expected leading digit distribution. • It has been our goal to provide an in-depth Benford analysis of several large natural data sets, to demonstrate Benford techniques, and to address common issues that arise in a Benford analysis.