Following on from my last two posts on Power Query performance, I thought I would write about another trick that can make a big difference to your query execution times. It’s actually something that Ehren from the dev team mentioned in a forums thread that I blogged about here but it deserves to be better known.
Here’s one of the almost-optimised queries from my previous posts:
let Source = Csv.Document( File.Contents( "C:\Users\chwebb\Downloads\pp-2018.csv" ), [Delimiter=",", Columns=16, Encoding=1252, QuoteStyle=QuoteStyle.None]), #"Changed Type" = Table.TransformColumnTypes( Source, {{"Column2", Int64.Type}} ), #"Sorted Rows" = Table.Sort( #"Changed Type", {{"Column2", Order.Descending}} ), Column2 = #"Sorted Rows"[Column2]{0} in Column2
This query returns a number (the maximum value in the column called Column2 of the table loaded in from a csv file) and takes about 5 seconds to run. Now consider the following query, which is the same as the previous query but with only the last line different:
let Source = Csv.Document( File.Contents( "C:\Users\chwebb\Downloads\pp-2018.csv" ), [Delimiter=",", Columns=16, Encoding=1252, QuoteStyle=QuoteStyle.None]), #"Changed Type" = Table.TransformColumnTypes( Source, {{"Column2", Int64.Type}} ), #"Sorted Rows" = Table.Sort( #"Changed Type", {{"Column2", Order.Descending}} ), Column2 = #"Sorted Rows"[Column2]{0} in Column2 + Column2 + Column2 + Column2
This query returns the number returned by the previous query added together four times. As you would expect – and hope – this query also takes about 5 seconds to run. Now consider this query:
let Source = Csv.Document( File.Contents( "C:\Users\chwebb\Downloads\pp-2018.csv" ), [Delimiter=",", Columns=16, Encoding=1252, QuoteStyle=QuoteStyle.None]), #"Changed Type" = Table.TransformColumnTypes( Source, {{"Column2", Int64.Type}} ), #"Sorted Rows" = Table.Sort( #"Changed Type", {{"Column2", Order.Descending}} ) in #"Sorted Rows"[Column2]{0} + #"Sorted Rows"[Column2]{0} + #"Sorted Rows"[Column2]{0} + #"Sorted Rows"[Column2]{0}
Instead of taking the value #”Sorted Rows”[Column2]{0} and storing it in the variable Column2 then adding Column2 four times, I’m adding the expression #”Sorted Rows”[Column2]{0} together four times. The query returns the same number as the previous query. However this query takes 20 seconds to run! Why?
As well as the brief remarks at the end of the post I mentioned earlier, more details can be found in section 3.6 of the Power Query Language Specification on Immutability:
Once a value has been calculated, it is immutable, meaning it can no longer be changed. This simplifies the model for evaluating an expression and makes it easier to reason about the result since it is not possible to change a value once it has been used to evaluate a subsequent part of the expression. For instance, a record field is only computed when needed. However, once computed, it remains fixed for the lifetime of the record. Even if the attempt to compute the field raised an error, that same error will be raised again on every attempt to access that record field.
An important exception to the immutable-once-calculated rule applies to list and table values. Both have streaming semantics. That is, repeated enumeration of the items in a list or the rows in a table can produce varying results. Streaming semantics enables the construction of M expressions that transform data sets that would not fit in memory at once.
In the second query above the variable Column2 returns a number and after that number has been assigned to Column2 it cannot be changed. However in the last line of the third query when I add #”Sorted Rows”[Column2]{0} together four times, #”Sorted Rows” is a variable that returns a table which means it gets evaluated four times during the addition.
In summary, if you’re getting a table or list from a data source and you want to use a value inside that table or list multiple times, store it in a variable in a let expression before you do so.