In my last post you saw how to join two tables together using conditions other than the built-in inner, outer and anti join conditions. However, as I mentioned, you need to be aware of some of the performance implications of using this technique – and that is the subject I’ll be looking at in this post.
Let’s take the events-in-progress problem, one that I have blogged about many times in the past (see here, here, here and here for example) and see how we can solve it in Power Query. It’s a very common problem that you encounter when you have a fact table where each row represents an event, there are columns containing the start and end dates of each event, and you want to count the number of events that were in progress on any given date.
Using the Adventure Works DW database, my normal way of illustrating the problem is this: taking the DimDate and FactInternetSales tables, for each date show the number of purchases that had been ordered but not shipped on that date. One way of solving this problem (but not the only way, as my previous blog posts show) is to do a cross join between the DimDate and FactInternetSales tables, then filter the result so that you get the rows where the value in the DateKey column from the DimDate table is between the values in the OrderDateKey and the ShipDateKey columns from the FactInternetSales table.
As a first attempt you might come up with a query like this one, which uses a custom column containing a table value that is subsequently expanded to do a cross join (the technique shown in my previous post):
let
//Connect to SQL Server
Source = Sql.Database("localhost", "Adventure Works DW"),
//Get data from the DimDate table
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
//Remove all columns except the DateKey column
RemovedOtherColumns = Table.SelectColumns(dbo_DimDate,{"DateKey"}),
//Insert a custom column that contains the whole of FactInternetSales as a table in each row
InsertedCustom = Table.AddColumn(RemovedOtherColumns, "FactInternetSalesTable",
each Source{[Schema="dbo",Item="FactInternetSales"]}[Data]),
//Expand this new column to show the OrderDateKey and ShipDateKey columns
#"Expand FactInternetSalesTable" = Table.ExpandTableColumn(InsertedCustom,
"FactInternetSalesTable",
{"OrderDateKey", "ShipDateKey"},
{"FactInternetSalesTable.OrderDateKey",
"FactInternetSalesTable.ShipDateKey"}),
//Filter where DateKey is greater than or equal to OrderDateKey and
//DateKey is less than or equal to ShipDateKey
FilteredRows = Table.SelectRows(#"Expand FactInternetSalesTable",
each [DateKey] >= [FactInternetSalesTable.OrderDateKey] and
[DateKey] <= [FactInternetSalesTable.ShipDateKey]),
//Find the count of the number of rows grouped by DateKey
GroupedRows = Table.Group(FilteredRows, {"DateKey"},
{{"Count", each Table.RowCount(_), type number}})
in
GroupedRows
There is, however, a big problem with this query: on my laptop it runs and runs forever – well, maybe not forever but I cancelled it after several minutes. Some tuning is necessary.
I don’t think anyone outside the Power Query dev team has much experience of performance tuning Power Query yet. However there is one golden rule that I do know: where possible, allow Power Query to push as much of the work back to the data source. This behaviour is known as “query folding” and it’s something that I’ve blogged about, as have Matt Masson and Darren Gosbell. Looking in SQL Server Profiler for the query above it is clear that no query folding is taking place: the only activity visible is Power Query reading the data from the DimDate and FactInternetSales tables separately.
After a bit of trial and error I came up with the following alternative:
let
//Connect to SQL Server
Source = Sql.Database("localhost", "adventure works dw"),
//Get data from the DimDate table
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
//Remove all columns except DateKey
RemovedOtherColumns = Table.SelectColumns(dbo_DimDate,{"DateKey"}),
//Add a custom column to DimDate containing the value 1
InsertedCustom = Table.AddColumn(RemovedOtherColumns, "Dummy", each 1),
//Get data from the FactInternetSales table
dbo_FactInternetSales = Source{[Schema="dbo",Item="FactInternetSales"]}[Data],
//Remove all columns except OrderDateKey and ShipDateKey
RemovedOtherColumns1 = Table.SelectColumns(dbo_FactInternetSales,
{"OrderDateKey", "ShipDateKey"}),
//Add a custom column to FactInternetSales containing the value 1
InsertedCustom1 = Table.AddColumn(RemovedOtherColumns1, "Dummy", each 1),
//Join DimDate and FactInternetSales on the two columns that contain 1
Custom1 = Table.Join(InsertedCustom1, "Dummy", InsertedCustom, "Dummy"),
//Filter rows where DateKey is between OrderDateKey and ShipDateKey
FilteredRows = Table.SelectRows(Custom1,
each [DateKey] >= [OrderDateKey] and
[DateKey] <= [ShipDateKey]),
//Group by DateKey and find the number of rows for each date
GroupedRows = Table.Group(FilteredRows, {"DateKey"},
{{"Count", each Table.RowCount(_), type number}}),
//Sort dates in ascending order
SortedRows = Table.Sort(GroupedRows,{{"DateKey", Order.Ascending}})
in
SortedRows
This returns the correct result more or less instantly:
The main difference between this query and the previous one is how I’m getting a cross join between the two tables. This time I’m creating custom columns on DimDate and FactInternetSales that both contain the value 1, and then doing an inner join between this two tables on the new columns – which of course results in the equivalent of a cross join.
In Profiler I can see the following SQL query being generated by Power Query:
select [_].[DateKey],
[_].[Count]
from
(
select [rows].[DateKey] as [DateKey],
count(1) as [Count]
from
(
select [_].[OrderDateKey],
[_].[ShipDateKey],
[_].[Dummy],
[_].[DateKey]
from
(
select [$Outer].[OrderDateKey],
[$Outer].[ShipDateKey],
[$Inner].[Dummy],
[$Inner].[DateKey]
from
(
select [_].[OrderDateKey] as [OrderDateKey],
[_].[ShipDateKey] as [ShipDateKey],
1 as [Dummy]
from
(
select [OrderDateKey],
[ShipDateKey]
from [dbo].[FactInternetSales] as [$Table]
) as [_]
) as [$Outer]
inner join
(
select [_].[DateKey] as [DateKey],
1 as [Dummy]
from
(
select [DateKey]
from [dbo].[DimDate] as [$Table]
) as [_]
) as [$Inner] on ([$Outer].[Dummy] = [$Inner].[Dummy])
) as [_]
where [_].[DateKey] >= [_].[OrderDateKey] and [_].[DateKey] <= [_].[ShipDateKey]
) as [rows]
group by [DateKey]
) as [_]
order by [_].[DateKey]
Query folding is definitely taking place now!
The last question to ask here is whether the first query was slow because query folding was not taking place, or slow because of the way the query was written. You can test this quite easily by rewriting the second query to prevent query folding taking place using the Table.Buffer() function. For example, in the second query the step to get the data from the DimDate table is:
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
To prevent query folding it needs to be altered to:
dbo_DimDate = Table.Buffer(Source{[Schema="dbo",Item="DimDate"]}[Data]),
(The step to get data from the FactInternetSales table needs to be altered in the same way.)
With this change made the query now executes in around a minute. So clearly the new query is more efficient when it is executed inside the Power Query engine itself, without query folding, but the Power Query engine is still nowhere near as fast as SQL Server and query folding gives the best possible performance.
You can download the sample workbook here.