In the last two posts in this series I showed how removing columns from the tables used in a Power Query merge operation can improve refresh performance. In this post I’ll show you a different trick that – when you can use it – can give you an equally significant performance boost.
When you merge data from two queries in the Power Query Editor the M code generated uses the Table.NestedJoin function. There is, however, another M function that can be used to merge data: Table.Join. The interesting thing about this function is that has a parameter that Table.NestedJoin doesn’t have: the joinAlgorithm parameter allows you to specify the algorithm used by the Power Query engine for the merge. The documentation doesn’t tell you what the possible values for this parameter are but #shared shows them:
The SortMerge algorithm, last in the list above, is the focus of this blog post. I mentioned in my earlier posts that the reason that merge operations on non-foldable data sources are often slow is that both of the tables used in the merge need to be held in memory. There is an exception though: if you know that the data in the columns used to join the two tables is sorted in ascending order, you can use the Table.Join function and the SortMerge algorithm and the data from both sources can be streamed rather than held in memory, which in turn results in the merge being much faster.
Here’s an example. As before there are two source queries that take data from a one million row CSV file, but for this test no columns are removed and there’s no filter on rows. The two source queries, called First and Second, are almost identical (in First the columns are named A1 to G1 and in Second the columns are named A2 to G2) and are not loaded into the dataset. Here’s the M code for the query called First:
let Source = Csv.Document( File.Contents("C:\Users\chwebb\Documents\NumbersMoreColumns.csv"), [Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None] ), #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]), #"Changed Type" = Table.TransformColumnTypes( #"Promoted Headers", { {"A", Int64.Type}, {"B", Int64.Type}, {"C", Int64.Type}, {"D", Int64.Type}, {"E", Int64.Type}, {"F", Int64.Type}, {"G", Int64.Type} } ), #"Renamed Columns" = Table.RenameColumns( #"Changed Type", {{"A", "A1"}, {"B", "B1"}, {"C", "C1"}, {"D", "D1"}, {"E", "E1"}, {"F", "F1"}, {"G", "G1"}} ) in #"Renamed Columns"
Here’s a query that merges these queries using Table.NestedJoin and returns all columns from the source queries and is enabled:
let Source = Table.NestedJoin(First, {"A1"}, Second, {"A2"}, "Second", JoinKind.Inner), #"Expanded Second" = Table.ExpandTableColumn( Source, "Second", {"A2", "B2", "C2", "D2", "E2", "F2", "G2"}, {"Second.A2", "Second.B2", "Second.C2", "Second.D2", "Second.E2", "Second.F2", "Second.G2"} ) in #"Expanded Second"
The timings for refreshing the merge query are:
- Progress Report End/25 Execute SQL – 54 seconds
- Progress Report End/17 Read Data – 58 seconds
[As I mentioned before, these timings may vary by a few seconds each way from run to run because paging is taking place]
Now it just so happens that in this case I know the data in the A columns in both the source queries is sorted in ascending order, so this means I can rewrite the merge query using Table.Join and the SortMerge algorithm like so:
let Source = Table.Join( First, {"A1"}, Second, {"A2"}, JoinKind.Inner, JoinAlgorithm.SortMerge ) in Source
Here are the timings for this version:
- Progress Report End/25 Execute SQL – 0.1 seconds
- Progress Report End/17 Read Data – 21 seconds
This new query is clearly much faster than the original version of the merge!
Now let’s talk about the limitations of this approach. First of all, if the data in the columns used to join the two tables together is not sorted in ascending order you won’t get an error message, you’ll just get incorrect data returned, so you really need to be sure that the data is indeed sorted. Secondly, if your data is not sorted, then you can sort it in Power Query before the merge – but since sorting itself takes time and sorting for non-foldable data sources is another one of those operations which requires the table to be held in memory, you’re unlikely to get any performance improvement.
That said, I can see that there are going to be plenty of scenarios where you can use this technique. For example, if you’re extracting data to a CSV file it may be possible to specify that you want to sort the data when the extract takes place. Also, if you’re trying to merge data from two different data sources (say, SQL Server and Oracle) that both support query folding for sorting, then the time it takes to sort the data may be less than the performance gain from using the SortMerge algorithm.