I saw an interesting post the other day on the Power Query Technet forum which showed how the List.Buffer() function could be used to improve calculation performance. This is something I’d seen hinted at in other places so I thought it was worth a bit of investigation.
Consider the following query:
let //Connect to SQL Server Source = Sql.Database("localhost", "adventure works dw"), //Get first 2000 rows from FactInternetSales dbo_FactInternetSales = Table.FirstN( Source{[Schema="dbo",Item="FactInternetSales"]}[Data], 2000), //Remove unwanted columns RemoveColumns = Table.SelectColumns( dbo_FactInternetSales, {"SalesOrderLineNumber", "SalesOrderNumber","SalesAmount"}), //Get sorted list of values from SalesAmount column RankValues = List.Sort(RemoveColumns[SalesAmount], Order.Descending), //Calculate ranks AddRankColumn = Table.AddColumn(RemoveColumns , "Rank", each List.PositionOf(RankValues,[SalesAmount])+1) in AddRankColumn
It gets the first 2000 rows from the FactInternetSales table in the Adventure Works DW database, removes most of the columns, and adds a custom column that shows the rank of the current row based on its Sales Amount.
On my laptop it takes around 35 seconds to run this query – pretty slow, in my opinion, given the amount of data in this table.
However, using the List.Buffer() function in the RankValues step like so:
let //Connect to SQL Server Source = Sql.Database("localhost", "adventure works dw"), //Get first 2000 rows from FactInternetSales dbo_FactInternetSales = Table.FirstN( Source{[Schema="dbo",Item="FactInternetSales"]}[Data], 2000), //Remove unwanted columns RemoveColumns = Table.SelectColumns( dbo_FactInternetSales, {"SalesOrderLineNumber", "SalesOrderNumber","SalesAmount"}), //Get sorted list of values from SalesAmount column //And buffer them! RankValues = List.Buffer(List.Sort(RemoveColumns[SalesAmount], Order.Descending)), //Calculate ranks AddRankColumn = Table.AddColumn(RemoveColumns , "Rank", each List.PositionOf(RankValues,[SalesAmount])+1) in AddRankColumn
Makes the query run in just 2 seconds. The List.Buffer() function stores the sorted list of values used to calculate the rank in memory which means it will only be evaluated once; in the original query it seems as though this step and those before it are being evaluated multiple times. Curt Hagenlocher’s comment (on this thread) on what List.Buffer() does for a similar calculation is telling:
The reason for this is that M is both functional and lazy, so unless we buffer the output of List.Select, we’re really just building a query that needs to be evaluated over and over. This is similar to the Enumerable functions in LINQ, if you’re familiar with those.
Table.Buffer() and Binary.Buffer() functions also exist, and do similar things.
A few other points to make:
- This is not necessarily the optimal way to calculate ranks in Power Query – it’s just an example of how List.Buffer() can be used.
- In the first query above, query folding is not taking place. If it had been it’s likely that performance would have been better. Since using List.Buffer() explicitly prevents query folding from taking place, it could make performance worse rather than better because of this in many cases.
- I’m 100% certain you’ll get much better performance for a rank calculation by loading the table to the Excel Data Model/Power Pivot and writing the calculation in DAX. You should only really do calculations like this in Power Query if they are needed for other transformations in your query.