Merging (or, in SQL terms, joining) tables in Power Query is a common cause of refresh performance problems. I’ve often wondered whether there’s anything you can do to optimise the performance of merges for non-foldable data source and so I decided to run some tests to try to answer all the questions I had. In this series of posts I’ll tell you what I found.
For these tests the only data source I used was a CSV file with one million rows and seven numeric columns named A, B C, D, E, F and G:
I used SQL Server Profiler to measure the amount of time taken for a query to execute using the technique I blogged about here. If you read that post (and I strongly recommend you do) you’ll see there are actually two Profiler events whose duration is significant when measuring refresh performance:
- Progress Report End/25 Execute SQL
- Progress Report End/17 Read Data
It also turns out that these two events provide some insight into how certain transformations are handled in the Power Query engine, but we’ll come to that later.
The first question I decided to investigate was this:
Does the number of columns in a table affect the performance of a merge?
First of all, I created two identical queries called First and Second that connected to the CSV file, used the first row from the file as the headers, and set the data types to all seven columns to Whole Number. Nothing very special, but here’s the M code all the same:
let Source = Csv.Document( File.Contents("C:\NumbersMoreColumns.csv"), [Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None] ), #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]), #"Changed Type" = Table.TransformColumnTypes( #"Promoted Headers", { {"A", Int64.Type}, {"B", Int64.Type}, {"C", Int64.Type}, {"D", Int64.Type}, {"E", Int64.Type}, {"F", Int64.Type}, {"G", Int64.Type} } ) in #"Changed Type"
I disabled these queries so that they were not loaded into the dataset.
Next, I created a third query that used the Table.NestedJoin function to merge the data from these two queries using an inner join and return all of the columns from both source queries:
let Source = Table.NestedJoin( First, {"A"}, Second, {"A"}, "Second", JoinKind.Inner), #"Expanded Second" = Table.ExpandTableColumn( Source, "Second", {"A", "B", "C", "D", "E", "F", "G"}, {"Second.A", "Second.B", "Second.C", "Second.D", "Second.E", "Second.F", "Second.G"} ) in #"Expanded Second"
When I refreshed this query, in Profiler the two events I mentioned above had the following durations:
- Progress Report End/25 Execute SQL – 40 seconds
- Progress Report End/17 Read Data – 56 seconds
Pretty slow. But what is performance like when you merge two tables with one column instead of seven?
To test this, I added an extra step to the First and Second queries that removed all but the A columns (the ones needed for the merge) like so:
let Source = Csv.Document( File.Contents("C:\NumbersMoreColumns.csv"), [Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None] ), #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]), #"Changed Type" = Table.TransformColumnTypes( #"Promoted Headers", { {"A", Int64.Type}, {"B", Int64.Type}, {"C", Int64.Type}, {"D", Int64.Type}, {"E", Int64.Type}, {"F", Int64.Type}, {"G", Int64.Type} } ), #"Removed Other Columns" = Table.SelectColumns(#"Changed Type", {"A"}) in #"Removed Other Columns"
I then updated the third query that contained the merge to reflect this change:
let Source = Table.NestedJoin( First, {"A"}, Second, {"A"}, "Second", JoinKind.Inner), #"Expanded Second" = Table.ExpandTableColumn(Source, "Second", {"A"}, {"Second.A"}) in #"Expanded Second"
When this query was refreshed, Profiler showed the following durations:
- Progress Report End/25 Execute SQL – 9 seconds
- Progress Report End/17 Read Data – 1 seconds
This query is a lot quicker, but then I thought: what if the performance is more to do with the size of the table returned by the query rather than the merge? So I added an extra step to the end of the merge query, like so:
let Source = Table.NestedJoin( First, {"A"}, Second, {"A"}, "Second", JoinKind.Inner), #"Expanded Second" = Table.ExpandTableColumn(Source, "Second", {"A"}, {"Second.A"}), #"Counted Rows" = Table.RowCount(#"Expanded Second") in #"Counted Rows"
…and then reran the two tests above. My thinking was that now the merge query only returns a single value the amount of data returned by the query should not be a factor in the duration of the queries.
Here are the timings for the version with the merge on the tables with seven columns:
- Progress Report End/25 Execute SQL – 56 seconds
- Progress Report End/17 Read Data – 0 seconds
Here are the timings for the version with the merge on the tables with just one column:
- Progress Report End/25 Execute SQL – 14 seconds
- Progress Report End/17 Read Data – 0 seconds
This does seem to confirm that the number of columns in a table affects the performance of a merge, although of course it might be that it takes longer to count the rows of a table that has more columns.
This shows something else too: Read Data is instant in both cases, compared to the first two tests where it took longer than the Execute SQL events.
Why does the number of columns influence the performance of a merge? If you read my recent post on monitoring memory usage with Query Diagnostics, you’ll remember that merges have to take place in memory – so I guess the larger the tables involved in the merge, the more memory is needed and the more paging happens if the 256MB limit is exceeded. Looking at the performance counter data generated for the last two queries showed that the 256MB limit was indeed exceeded for both the last two queries above, but while the version joining the table with two columns had a maximim commit of 584MB the maximum commit for the version joining the table with seven columns was almost 3GB.
That’s enough for now. At this point I think it’s fair to say the following:
Removing any unwanted columns before you merge two tables in Power Query will improve refresh performance.