In my last post I demonstrated how the size of a table affects the performance of Power Query merge operations on non-foldable data sources in Power BI. Specifically, I showed that removing columns from the tables involved in a merge before the merge took place improved performance. But does it matter when you remove the columns? Is it enough to only select the columns you need when you expand the nested table returned by a merge, for example, or just to remove columns after the merge step? So, today’s question is:
Does it make a difference to Power Query merge performance if you remove unwanted columns from your source tables in the step before the merge or in the step afterwards?
The first problem I had to deal with when testing was that, as I said in my previous post, even when I reduced the number of columns in the million-row table I was using all my queries were exceeding the 256MB container size limit and paging was taking place. This meant that any refresh timings I took in Profiler varied by up to 1-2 seconds from run to run. When you’re comparing queries that take around 50 seconds with queries that take around 10 seconds a bit of variability in the timings from run to run doesn’t matter much. However, if you’re trying to compare two different test scenarios whose performance may not differ that much anyway it becomes more of a problem.
To deal with this I reduced the number of rows in the tables I was using in my merge to 300,000 rows. This kept container memory usage under 256MB and reduced the variability in timings to a minimum. Here are the average timings across multiple runs for the last scenario tested in my previous post – where I removed all but one column in my source tables before the merge and then counted the rows in the table returned by the merge – with these 300,000 row source tables:
- Progress Report End/25 Execute SQL – 2.4 seconds
- Progress Report End/17 Read Data – 0 seconds
I then changed my source queries so they again returned all seven columns and changed the merge query so that it removed all but the two A columns in a Removed Columns step after the step containing the merge, like so:
let Source = Table.NestedJoin(First, {"A"}, Second, {"A"}, "Second", JoinKind.Inner), #"Expanded Second" = Table.ExpandTableColumn( Source, "Second", {"A", "B", "C", "D", "E", "F", "G"}, {"Second.A", "Second.B", "Second.C", "Second.D", "Second.E", "Second.F", "Second.G"} ), #"Removed Other Columns" = Table.SelectColumns(#"Expanded Second", {"A", "Second.A"}), #"Counted Rows" = Table.RowCount(#"Removed Other Columns") in #"Counted Rows"
The average timings for this version of the query were identical to those from the previous test:
- Progress Report End/25 Execute SQL – 2.4 seconds
- Progress Report End/17 Read Data – 0 seconds
Last of all, I removed all but column A from the source query called First, did not remove any columns from the source query called Second, and then only expanded column A in the #”Expanded Second” step in the merge query:
let Source = Table.NestedJoin(First, {"A"}, Second, {"A"}, "Second", JoinKind.Inner), #"Expanded Second" = Table.ExpandTableColumn(Source, "Second", {"A"}, {"Second.A"}), #"Counted Rows" = Table.RowCount(#"Expanded Second") in #"Counted Rows"
Here are the average timings from this test:
- Progress Report End/25 Execute SQL – 2.6 seconds
- Progress Report End/17 Read Data – 0 seconds
These timings were slightly slower, but I don’t think the 200ms increase is really significant.
Therefore, from these tests, I think it’s safe to conclude that:
Removing unwanted columns in a step immediately after a merge results in the same performance benefits as removing unwanted columns in a step immediately before a merge
These results may surprise some of you. Why does removing columns after a merge perform so well? The answer is that just because you remove columns from a table in the step after a merge, it doesn’t mean that the Power Query engine actually removes the columns after it has performed the merge. On the contrary, it’s clever enough to realise that if you remove columns in the step after a merge then there’s no need for those columns to be present at the point where it performs the merge. You probably know how query folding works with certain data sources – well, it turns out that something similar to query folding happens within the Power Query engine whatever data source you’re using. The best explanation of this behaviour – which is fundamental to understanding how the Power Query engine works – can be found in the section on “Streaming Semantics” in this blog post by Ben Gribaudo in his excellent series on M:
https://bengribaudo.com/blog/2018/02/28/4391/power-query-m-primer-part5-paradigm
Of course there may be scenarios where this does not happen, or doesn’t happen efficiently, so I still recommend removing unwanted columns as early as possible in your queries just to be safe.