In a late addition to the series of posts that I started here a few months ago, I thought it would be useful to look at how to optimise the performance of a rare (but nonetheless useful) type of merge: the cross join. Cross joins are not available as a join kind in the Table.Join or Table.NestedJoin M functions but there are two ways of creating them in Power Query:
- Creating a custom column in your query that contains the query that you want to cross join with and then expanding, as Matt Allington describes here
- Creating a custom column that contains a single value, such as the number 1, on both your queries and then using that column to merge the two queries on (something I mentioned a long time ago here)
Which one gives you the best performance? Apart from my old blog post there’s at least one other post suggesting the second approach is much faster, but I thought it would be useful to do some proper testing.
For my source data I created a query called SourceQuery that returned a table with one thousand rows, containing all the integers from 1 to 1000:
To test the first approach, I created a second query that referenced SourceQuery, then added a custom column that also returned the SourceQuery query:
…and then expanded this new custom column. The result was a query that returned a million rows – the cross join of the thousand-row SourceQuery table with itself.
Here’s the M code:
let Source = SourceQuery, #"Added Custom" = Table.AddColumn( Source, "Custom", each SourceQuery ), #"Expanded Custom" = Table.ExpandTableColumn( #"Added Custom", "Custom", {"Column1"}, {"Custom.Column1"} ), #"Changed Type" = Table.TransformColumnTypes( #"Expanded Custom", {{"Custom.Column1", Int64.Type}} ) in #"Changed Type"
The two timings from Profiler (the same ones that I have used to measure refresh performance throughout this series) were:
- Progress Report End/25 Execute SQL – 0.03 seconds
- Progress Report End/17 Read Data – 3.80 seconds
To test the second approach I added a custom column to the SourceQuery query that contained the value 1:
…and then created a new query that joined this query to itself:
…and then did an expand to achieve the same cross joined output as in the first test. Here’s the M code (all created in the Power Query Editor – no custom M code required):
let Source = Table.NestedJoin( #"SourceQuery With Join Column", {"JoinColumn"}, #"SourceQuery With Join Column", {"JoinColumn"}, "SourceQuery With Join Column", JoinKind.Inner ), #"Expanded SourceQuery With Join Column" = Table.ExpandTableColumn( Source, "SourceQuery With Join Column", {"Column1"}, {"SourceQuery With Join Column.Column1"} ), #"Removed Columns" = Table.RemoveColumns( #"Expanded SourceQuery With Join Column", {"JoinColumn"} ) in #"Removed Columns"
The performance of this query was much faster:
- Progress Report End/25 Execute SQL – 0.03 seconds
- Progress Report End/17 Read Data – 0.80 seconds
So, as you can see, the second approach is the one to use.
There’s another advantage of this second approach too, if you’re using a foldable data source like SQL Server: it is possible to make query folding happen, which is of course incredibly important from a performance point of view, although you have to be careful not to change the data type of your custom column (or at least be careful how you do it). For example, here’s the M for a query that gets the cross join of the DimScenario table from the Adventure Works DW database with itself:
let Source = Sql.Databases("localhost"), AdventureWorksDW2017 = Source{[Name = "AdventureWorksDW2017"]}[Data], dbo_DimScenario = AdventureWorksDW2017{[ Schema = "dbo", Item = "DimScenario" ]}[Data], #"Removed Columns" = Table.RemoveColumns( dbo_DimScenario, {"FactFinance"} ), #"Added Custom" = Table.AddColumn( #"Removed Columns", "Custom", each 1 ), #"Merged Queries" = Table.NestedJoin( #"Added Custom", {"Custom"}, #"Added Custom", {"Custom"}, "Added Custom", JoinKind.Inner ), #"Expanded Added Custom" = Table.ExpandTableColumn( #"Merged Queries", "Added Custom", {"ScenarioKey", "ScenarioName", "Custom"}, { "Added Custom.ScenarioKey", "Added Custom.ScenarioName", "Added Custom.Custom" } ), #"Removed Other Columns" = Table.SelectColumns( #"Expanded Added Custom", { "ScenarioName", "Added Custom.ScenarioName" } ) in #"Removed Other Columns"
Here’s the SQL code generated by this query:
select [$Outer].[ScenarioName] as [ScenarioName], [$Inner].[ScenarioName2] as [Added Custom.ScenarioName] from ( select [_].[ScenarioKey] as [ScenarioKey], [_].[ScenarioName] as [ScenarioName], 1 as [Custom] from [dbo].[DimScenario] as [_] ) as [$Outer] inner join ( select [_].[ScenarioKey] as [ScenarioKey2], [_].[ScenarioName] as [ScenarioName2], 1 as [Custom2] from [dbo].[DimScenario] as [_] ) as [$Inner] on ([$Outer].[Custom] = [$Inner].[Custom2])