Benford’s Law states that the leading significant digit of many naturally occurring numbers will likely be small. For example, in sets which obey the law, the number 1 appears as the most significant digit about 30% of the time, while 9 appears as the most significant digit less than 5% of the time. By contrast, if the digits were distributed uniformly, they would each occur about 11.1% of the time. This is Benford’s Law’s theoretical probability distribution:
\[ P(d) = \log\left(1 + \frac{1}{d}\right)\]
Benford’s Law is a theoretical probability distribution that can be leveraged to detect data quality issues, fraudulent behavior or other data anomolies by looking for data which deviate from this pattern. In order to get the best results, it is often best to compare past deviations from non-fraudulent data to your test set rather than relying on a direct comparison with the theoretical distribution.
As the Shiny app demonstrates, there are many cases that will not follow this pattern, so it is important not to jump to conclusions or to use Benford’s Law as a key component of any predictive or inferential analysis. It can be most useful in passively monitoring and exploring high volumes of data when building a more accurate model is not feasible. Benford’s Law is a useful heuristic that indicates further investigation might be warranted, especially in data streams which previously followed the theoretical distribution, but have recently begun to deviate from it.