You may have noticed that a new dataflows connector was announced in the August 2021 release of Power BI Desktop, and that it now supports query folding between a dataset and a dataflow – which you may be surprised to learn was not possible before. In this post I thought I’d take a look at how much of an improvement in performance this can make to dataset refresh performance.
For my tests I created a new PPU workspace and a dataflow, and made sure the Enhanced Compute Engine was turned on for the dataflow on the Settings page:
Query folding will only happen if the Enhanced Compute Engine is set to “On”, and won’t happen with the “Optimized” setting. The Enhanced Compute Engine is only available with PPU and Premium.
For my data source I used a CSV file with a million rows in and seven integer columns. I then created two tables in my dataflow like so:
The Source table simply connects to the CSV file, uses the first row as the headers, then sets the data type on each column. The second table called Output – which contains no tranformations at all – is needed for the data to be stored in the Enhanced Compute Engine, and the lightning icon in the top-left corner of the table in the diagram shows this is the case.
Next, in Power BI Desktop, I created a Power Query query that used the old Power BI dataflows connector:
If you have any existing datasets that connect to dataflows, this is the connector you will have used – it is based on the PowerBI.Dataflows function. My query connected to the Output table and filtered the rows to where column A is less than 100. Here’s the M code, slightly edited to remove all the ugly GUIDs:
let
Source = PowerBI.Dataflows(null),
ws = Source{[workspaceId="xxxx"]}[Data],
df = ws{[dataflowId="yyyy"]}[Data],
Output1 = df{[entity="Output"]}[Data],
#"Filtered Rows" = Table.SelectRows(Output1, each [A] < 100)
in
#"Filtered Rows"
Remember, this connector does not support query folding. Using this technique to measure how long the query ran when the results from the query were loaded into the dataset, I could see it took almost 12.5 seconds to get the data for this query:
In fact the performance in Desktop is worse: when refresh was taking place, I could see Power BI downloading 108MB of data even though the original source file is only 54MB.
Why is the data downloaded twice? I strongly suspect it’s because of this issue – because, of course, no query folding is happening. So the performance in Desktop is really even worse.
I then created the same query with the new dataflows connector:
This connector uses the PowerPlatform.Dataflows function; it’s not new, but what is new is that you can now access Power BI dataflows using it.
Here’s the M code, again cleaned up to remove GUIDS:
let
Source = PowerPlatform.Dataflows(null),
Workspaces = Source{[Id="Workspaces"]}[Data],
ws = Workspaces{[workspaceId="xxxx"]}[Data],
df = ws{[dataflowId="yyyy"]}[Data],
Output_ = df{[entity="Output",version=""]}[Data],
#"Filtered Rows" = Table.SelectRows(Output_, each [A] < 100)
in
#"Filtered Rows"
When this query was loaded into the dataset, it only took 4 seconds:
This is a lot faster, and Power BI Desktop was a lot more responsive during development too.
It’s reasonable to assume that query folding is happening in this query and the filter on [A]<100 is now taking place inside the Enhanced Compute Engine rather than in Power BI Desktop. But how can you be sure query folding is happening? The “View Native Query” option is greyed out, but of course this does not mean that query folding is not happening. However, if you use Query Diagnostics, hidden away in the Data Source Query column of the detailed diagnostics query, you can see a SQL query with the WHERE clause you would expect:
In conclusion, you can see that the new dataflows connector can give you some big improvements for dataset refresh performance and a much better development experience in Power BI Desktop. Query folding support also means that you can now use dataset incremental refresh when using a dataflow as a source. However, you will need to use Premium or PPU, you may also need to make some changes to your dataflow to make sure it can take advantage of the Enhanced Compute Engine, and you will also need to update any existing Power Query queries to use the new connector. I think the potential performance gains are worth making these changes though. If you do make these changes in your dataflows and find that it helps, please leave a comment!