Probably my favourite session at SQLBits the other week was Professor Mark Whitehorn on exploiting exotic patterns in data. One of the things he talked about was Benford’s Law, something I first heard about several years ago (in fact I’m sure I wrote a blog post on implementing Benford’s Law in MDX but I can’t find it), about the frequency distribution of digits in data. I won’t try to explain it myself but there are plenty of places you can read up on it, for example: http://en.wikipedia.org/wiki/Benford%27s_law . I promise, it’s a lot more interesting that it sounds!
Anyway, it struck me that it would be quite useful to have a Power Query function that could be used to find the distribution of the first digits in any list of numbers, for example for fraud detection purposes. The first thing I did was write a simple query that returned the expected distributions for the digits 1 to 9 according to Benford’s Law:
let
//function to find the expected distribution of any given digit
Benford = (digit as number) as number => Number.Log10(1 + (1/digit)),
//get a list of values between 1 and 9
Digits = {1..9},
// get a list containing these digits and their expected distribution
DigitsAndDist = List.Transform(Digits, each {_, Benford(_)}),
//turn that into a table
Output = #table({"Digit", "Distribution"}, DigitsAndDist)
in
Output
Next I wrote the function itself:
//take a single list of numbers as a parameter
(NumbersToCheck as list) as table=>
let
//remove any non-numeric values
RemoveNonNumeric = List.Select(NumbersToCheck,
each Value.Is(_, type number)),
//remove any values that are less than or equal to 0
GreaterThanZero = List.Select(RemoveNonNumeric, each _>0),
//turn that list into a table
ToTable = Table.FromList(GreaterThanZero,
Splitter.SplitByNothing(), null, null,
ExtraValues.Error),
RenameColumn = Table.RenameColumns(ToTable,{{"Column1", "Number"}}),
//function to get the first digit of a number
FirstDigit = (InputNumber as number) as
number =>
Number.FromText(Text.Start(Number.ToText(InputNumber),1))-1,
//get the distributions of each digit
GetDistributions = Table.Partition(RenameColumn,
"Number", 9, FirstDigit),
//turn that into a table
DistributionTable = Table.FromList(GetDistributions,
Splitter.SplitByNothing(), null, null, ExtraValues.Error),
//add column giving the digit
AddIndex = Table.AddIndexColumn(DistributionTable, "Digit", 1, 1),
//show how many times each first digit occurred
CountOfDigits = Table.AddColumn(AddIndex,
"Count", each Table.RowCount([Column1])),
RemoveColumn = Table.RemoveColumns(CountOfDigits ,{"Column1"}),
//merge with table showing expected distributions
Merge = Table.NestedJoin(RemoveColumn,{"Digit"},
Benford,{"Digit"},"NewColumn",JoinKind.Inner),
ExpandNewColumn = Table.ExpandTableColumn(Merge, "NewColumn",
{"Distribution"}, {"Distribution"}),
RenamedDistColumn = Table.RenameColumns(ExpandNewColumn,
{{"Distribution", "Expected Distribution"}}),
//calculate actual % distribution of first digits
SumOfCounts = List.Sum(Table.Column(RenamedDistColumn, "Count")),
AddActualDistribution = Table.AddColumn(RenamedDistColumn,
"Actual Distribution", each [Count]/SumOfCounts)
in
AddActualDistribution
There’s not much to say about this code, apart from the fact that it’s a nice practical use case for the Table.Partition() function I blogged about here. It also references the first query shown above, called Benford, so that the expected and actual distributions can be compared.
Since this is a function that takes a list as a parameter, it’s very easy to pass it any column from any other Power Query query that’s in the same worksheet (as I showed here) for analysis. For example, I created a Power Query query on this dataset in the Azure Marketplace showing the number of minutes that each flight in the US was delayed in January 2012. I then invoked the function above, and pointed it at the column containing the delay values like so:
The output is a table (to which I added a column chart) which shows that this data follows the expected distribution very closely:
You can download my sample workbook containing all the code from here.