Probably my favourite session at SQLBits the other week was Professor Mark Whitehorn on exploiting exotic patterns in data. One of the things he talked about was Benford’s Law, something I first heard about several years ago (in fact I’m sure I wrote a blog post on implementing Benford’s Law in MDX but I can’t find it), about the frequency distribution of digits in data. I won’t try to explain it myself but there are plenty of places you can read up on it, for example: http://en.wikipedia.org/wiki/Benford%27s_law . I promise, it’s a lot more interesting that it sounds!
Anyway, it struck me that it would be quite useful to have a Power Query function that could be used to find the distribution of the first digits in any list of numbers, for example for fraud detection purposes. The first thing I did was write a simple query that returned the expected distributions for the digits 1 to 9 according to Benford’s Law:
let //function to find the expected distribution of any given digit Benford = (digit as number) as number => Number.Log10(1 + (1/digit)), //get a list of values between 1 and 9 Digits = {1..9}, // get a list containing these digits and their expected distribution DigitsAndDist = List.Transform(Digits, each {_, Benford(_)}), //turn that into a table Output = #table({"Digit", "Distribution"}, DigitsAndDist) in Output
Next I wrote the function itself:
//take a single list of numbers as a parameter (NumbersToCheck as list) as table=> let //remove any non-numeric values RemoveNonNumeric = List.Select(NumbersToCheck, each Value.Is(_, type number)), //remove any values that are less than or equal to 0 GreaterThanZero = List.Select(RemoveNonNumeric, each _>0), //turn that list into a table ToTable = Table.FromList(GreaterThanZero, Splitter.SplitByNothing(), null, null, ExtraValues.Error), RenameColumn = Table.RenameColumns(ToTable,{{"Column1", "Number"}}), //function to get the first digit of a number FirstDigit = (InputNumber as number) as number => Number.FromText(Text.Start(Number.ToText(InputNumber),1))-1, //get the distributions of each digit GetDistributions = Table.Partition(RenameColumn, "Number", 9, FirstDigit), //turn that into a table DistributionTable = Table.FromList(GetDistributions, Splitter.SplitByNothing(), null, null, ExtraValues.Error), //add column giving the digit AddIndex = Table.AddIndexColumn(DistributionTable, "Digit", 1, 1), //show how many times each first digit occurred CountOfDigits = Table.AddColumn(AddIndex, "Count", each Table.RowCount([Column1])), RemoveColumn = Table.RemoveColumns(CountOfDigits ,{"Column1"}), //merge with table showing expected distributions Merge = Table.NestedJoin(RemoveColumn,{"Digit"}, Benford,{"Digit"},"NewColumn",JoinKind.Inner), ExpandNewColumn = Table.ExpandTableColumn(Merge, "NewColumn", {"Distribution"}, {"Distribution"}), RenamedDistColumn = Table.RenameColumns(ExpandNewColumn, {{"Distribution", "Expected Distribution"}}), //calculate actual % distribution of first digits SumOfCounts = List.Sum(Table.Column(RenamedDistColumn, "Count")), AddActualDistribution = Table.AddColumn(RenamedDistColumn, "Actual Distribution", each [Count]/SumOfCounts) in AddActualDistribution
There’s not much to say about this code, apart from the fact that it’s a nice practical use case for the Table.Partition() function I blogged about here. It also references the first query shown above, called Benford, so that the expected and actual distributions can be compared.
Since this is a function that takes a list as a parameter, it’s very easy to pass it any column from any other Power Query query that’s in the same worksheet (as I showed here) for analysis. For example, I created a Power Query query on this dataset in the Azure Marketplace showing the number of minutes that each flight in the US was delayed in January 2012. I then invoked the function above, and pointed it at the column containing the delay values like so:
The output is a table (to which I added a column chart) which shows that this data follows the expected distribution very closely:
You can download my sample workbook containing all the code from here.