Chris Webb's BI Blog: Power Query

Earlier this week I gave a webcast on accessing web services with Power BI, Power Query and M on Reza Rad’s YouTube channel. You can watch it here:

It’s an introduction to the subject: I cover the basics of using Web.Contents but don’t go into all the obscure details of what each of the options for it do (most of which I have blogged about anyway). I hope you find it useful!

Continuing my series on tuning the performance of importing data from ADLSgen2 into Power BI, in this post I’m going to look at the performance impact of setting some of the various options in the second parameter of the AzureStorage.DataLake() M function. In the last post in this series I showed how setting the HierarchicalNavigation option can improve refresh performance, but what about BlockSize, RequestSize or ConcurrentRequests?

Here’s what the documentation says about these options:

BlockSize : The number of bytes to read before waiting on the data consumer. The default value is 4 MB.
RequestSize : The number of bytes to try to read in a single HTTP request to the server. The default value is 4 MB.
ConcurrentRequests : The ConcurrentRequests option supports faster download of data by specifying the number of requests to be made in parallel, at the cost of memory utilization. The memory required is (ConcurrentRequest * RequestSize). The default value is 16.

Using the same 8 million row set of csv files I have used in my previous posts and the same queries generated by the From Folder source (see this post for more details – note that in this post I am not using Synapse Serverless, just loading direct from the files), I tested various options. Here’s an example of how these options can be set:

AzureStorage.DataLake(
  "https://xyz.dfs.core.windows.net/myfolder",
  [ConcurrentRequests = 1]
)

Here are the average dataset refresh times measured in the Power BI Service using Profiler:

Option	Average Refresh Time (seconds)
None set – defaults used	67
ConcurrentRequests=1	70
ConcurrentRequests=32	67
BlockSize=1	70
BlockSize=8388608 (8MB)	68
RequestSize=1	Error (see below)
RequestSize=8388608 (8MB)	68
ConcurrentRequests=32,BlockSize=8388608,RequestSize=8388608	67

From these results it looks like it’s possible to make performance slightly worse in some cases but none of the configurations tested made performance better than the default settings.

There are two somewhat interesting things to note. First, this is pretty much what the developers told me to expect when I asked about these options a while ago. However I was told that there may be some scenarios where reducing the value of ConcurrentRequests can be useful to reduce the memory overhead of a Power Query query – I guess to avoid paging on the Desktop (as discussed here) or memory errors in the Power BI Service.

Second, when I set RequestSize=1 (which means that each HTTP request was only allowed to return 1 byte of data, which is a pretty strange thing to want to do) I got the following error:

Expression.Error: The evaluation reached the allowed cache entry size limit. Try increasing the allowed cache size.

This reminds me I need to do some reasearch into how the Power Query cache works in Power BI Desktop and write that up as a post…

Overall, no major revelations here, but sometimes it’s good to know what doesn’t make any difference as much as what does.

In my last post I showed how changing the various options on the AzureStorage.DataLake() M function didn’t have much impact on dataset refresh performance in Power BI. I’ll admit I was slightly surprised by this, but it got me wondering why this was – and so I decided to do some tests to find out.

The answer can be found using Power Query’s query diagnostics functionality. Although you can’t use it to find out what happens when a dataset refresh takes place in the Power BI Service, you can use it to view requests to web services for refreshes in Power BI Desktop as I showed in this post. The Detailed diagnostic log query shows each request Power Query makes to get data from the ADLSgen2 API, the urls show the names of the files being accessed, and you can also see how long each request takes, the start and end time of each request and the amount of data read (the Content Length value in the response) amongst other things:

I wrote a Power Query query to extract all this useful information and put it in a more useful format, which can then be shown in Power BI. It’s fairly rough-and-ready but I turned it into an M function and posted the code here if you’d like to try it yourself – I haven’t done any serious testing on it though.

Here’s the data I captured for a refresh in Power BI Desktop that started at 10:55:42am yesterday and ended at 10:57:33am which took 111 seconds overall. I was using the default options for AzureStorage.DataLake() and this table only shows data for the GET requests to the ADLSgen2 API that returned data:

The main thing to notice here is that the total duration of all the requests was just 5.25 seconds – less than 5% of the overall refresh time – which explains why changing the options in AzureStorage.DataLake() didn’t make much difference to dataset refresh performance. Maybe if the files were larger, or there were more of them, changing the options would make a more noticeable impact. Of course there’s a lot more happening inside both the Power Query engine and the Analysis Services engine here beyond calling the web service to get the raw data. I also ran a Profiler trace while this refresh was running (see here for how to do this) and from the point of view of the Analysis Services engine it took 104 seconds to read the data from Power Query: the ExecuteSQL Profiler event took 4.5 seconds and the ReadData event took 99.5 seconds.

Conclusion: getting raw data from ADLSgen2 only represents a small part of the time taken to refresh a dataset that uses ADLSgen2 as a source, so any attempts to tune this may not have much impact on overall refresh times.

If you’re using the new Parquet connector in Power BI there’s a chance you will run into the following error:

Parameter.Error: Parquet.Document cannot be used with streamed binary values.
Details:
[Binary]

This isn’t a bug or anything that can be fixed, so it’s important to understand why it occurs and what you can do about it.

One easy way to reproduce this problem is by trying to access a reasonably large (larger than a few MB) Parquet file stored in SharePoint, something like this:

let
  Source = SharePoint.Files(
    "https://microsoft-my.sharepoint.com/personal/abc",
    [ApiVersion = 15]
  ),
  GetFile = Source
    {
      [
        Name = "myfile.parquet",
        #"Folder Path"
          = "https://microsoft-my.sharepoint.com/personal/abc/Documents/"
      ]
    }
    [Content],
  #"Imported Parquet" = Parquet.Document(
    GetFile
  )
in
  #"Imported Parquet"

The problem is that reading data from Parquet files requires random file access, and this is something that isn’t possible in Power Query for certain data sources like SharePoint and Google Cloud Storage. This problem will never occur with locally-stored files or files stored in ADLSgen2.

There is one possible workaround but it comes with some serious limitations: buffer the Parquet file in memory using the Binary.Buffer() M function. Here’s an example of how the above query can be rewritten to do this:

let
  Source = SharePoint.Files(
    "https://microsoft-my.sharepoint.com/personal/abc",
    [ApiVersion = 15]
  ),
  GetFile = Source
    {
      [
        Name = "myfile.parquet",
        #"Folder Path"
          = "https://microsoft-my.sharepoint.com/personal/abc/Documents/"
      ]
    }
    [Content],
  #"Imported Parquet" = Parquet.Document(
    Binary.Buffer(GetFile)
  )
in
  #"Imported Parquet"

The problem with buffering files in memory like this is that it’s only feasible for fairly small files because of the limits on the amount of memory Power Query can use (see here for more information): you’re likely to get really bad performance or errors if you try to buffer files that are too large, and Parquet files are often fairly large. The best way of solving this problem is to switch to using a data source like ADLSgen2 where this problem will not happen.

[Thanks to Eric Gorelik for the information in this post]

There has been a lot of excitement around the newly-added support for reading from Parquet files in Power BI. However I have to admit that I was disappointed not to see any big improvements in performance when reading data from Parquet compared to reading data from CSV (for example, see here) when I first started testing it. So, is Power Query able to take advantage of Parquet’s columnar storage when reading data?

The answer is yes, but you may need to make some changes to your Power Query queries to ensure you get the best possible performance. Using the same data that I have been using in my recent series of posts on importing data from ADLSgen2, I took a single 10.1MB Parquet file and downloaded it to my PC. Here’s what the data looked like:

I then created a query to count the number of rows in the table stored in this Parquet file where the TransDate column was 1/1/2015:

let
  Source = Parquet.Document(
    File.Contents(
      "C:\myfile.snappy.parquet"
    )
  ),
  #"Filtered Rows" = Table.SelectRows(
    Source,
    each [TransDate] = #date(2015, 1, 1)
  ),
  #"Counted Rows" = Table.RowCount(
    #"Filtered Rows"
  )
in
  #"Counted Rows"

Here’s the output:

I then used SQL Server Profiler to find out how long this query took to execute (as detailed here): on average it took 3 seconds.

Here’s what I saw in Power BI Desktop while loading the data just before refresh finished:

As you can see, Power Query is scanning all the data in the file.

I then added an extra step to the query to remove all columns except the TransDate column:

let
  Source = Parquet.Document(
    File.Contents(
      "C:\myfile.snappy.parquet"
    )
  ),
  #"Removed Other Columns"
    = Table.SelectColumns(
    Source,
    {"TransDate"}
  ),
  #"Filtered Rows" = Table.SelectRows(
    #"Removed Other Columns",
    each [TransDate] = #date(2015, 1, 1)
  ),
  #"Counted Rows" = Table.RowCount(
    #"Filtered Rows"
  )
in
  #"Counted Rows"

This version of the query only took an average of 0.7 seconds to run – a substantial improvement. This time the maximum amount of data read by Power Query was only 2.44MB:

As you can see, in this case removing unnecessary columns improved the performance of reading data from Parquet files a lot. This is not always true though – I tested a Group By transformation and in that case the Power Query engine was clever enough to only read the required columns, and manually removing columns made no difference to performance.

This demonstrates that Power Query is able to take advantage of Parquet’s columnar storage to only read data from certain columns. However, this is the only performance optimisation available to Power Query on Parquet – it doesn’t do predicate pushdown or anything like that. What’s more, when reading data from the ADLSgen2 connector, the nature of Parquet storage stops Power Query from making parallel requests for data (I guess the same behaviour that is controlled by the ConcurrentRequests option) which puts it at a disadvantage compared to reading data from CSV files.

I think a lot more testing is needed to understand how to get the best performance when reading data from Parquet, so look out for more posts on this subject in the future…

[Thanks once again to Eric Gorelik from the Power Query development team for providing the information about how the Parquet connector works, and to Ben Watt and Gerhard Brueckl for asking the questions in the first place]

Bonus fact: in case you’re wondering, the following compression types are supported by the Parquet connector: GZip, Snappy, Brotli, LZ4, and ZStd.

In all the testing I’ve done recently with importing data from Parquet files into Power BI I noticed something strange: loading data from a folder containing multiple Parquet files seemed a lot slower than I would expect, based on the time taken to load data from a single file. So I wondered – is there something that can be optimised? It turns out there is and in this blog post I’ll show you what I did.

If you import data from a folder containing Parquet files – whether it’s a local folder or a folder in ADLSgen2 storage – you’ll see a series of queries created for you in the Power Query Editor window that looks like this:

The query called Query1 shown in the screenshot iterates over all the files in the folder you’ve chosen and calls a function that reads the data from each Parquet file. It returns a table that contains a column with the name of the original source file in (which isn’t all that interesting for Parquet files) and all the columns from the Parquet files you’re containing.

Using the Parquet files from my series of posts on importing data from ADLSgen2 as a source, here’s the M code Power Quey generates for this query which I have modified to remove the column with the source file name in:

let
  Source = Folder.Files("C:\MyFolder"),
  #"Filtered Hidden Files1"
    = Table.SelectRows(
    Source,
    each [Attributes]?[Hidden]? <> true
  ),
  #"Invoke Custom Function1"
    = Table.AddColumn(
    #"Filtered Hidden Files1",
    "Transform File (3)",
    each #"Transform File (3)"(
      [Content]
    )
  ),
  #"Renamed Columns1"
    = Table.RenameColumns(
    #"Invoke Custom Function1",
    {"Name", "Source.Name"}
  ),
  #"Removed Other Columns1"
    = Table.SelectColumns(
    #"Renamed Columns1",
    {"Transform File (3)"}
  ),
  #"Expanded Table Column1"
    = Table.ExpandTableColumn(
    #"Removed Other Columns1",
    "Transform File (3)",
    Table.ColumnNames(
      #"Transform File (3)"(
        #"Sample File (3)"
      )
    )
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Expanded Table Column1",
    {
      {"TransDate", type date},
      {"GuestId", type text},
      {"ProductId", type text},
      {"NetAmount", type number}
    }
  )
in
  #"Changed Type"

Here’s the output:

On my PC this query took an average of 102 seconds to refresh.

Apart from this query being slower than I expected, I also noticed that there is a “Changed Type” step at the end – which I thought was unnecessary because unlike CSV files, Parquet has typed columns. If you connect to a single Parquet file in Power Query then it recognises the column types, so why not here? Well, it’s because of the way it’s combining files by expanding table columns, and there is a way to work around this that I blogged about here:
https://blog.crossjoin.co.uk/2017/09/25/setting-data-types-on-nested-tables-in-m/

Setting a type on the table column before expanding it did indeed improve performance, but this led me to another optimisation.

I know that using the Table.Combine M function can perform differently to the Table.ExpandTableColumn function used in the original version of the query (although it does not always perform better). Therefore I made the following change to the query above: using Table.Combine to return a single table with all the data in (note that setting a type on the table column is not necessary for this optimisation). Here’s the new version:

let
  Source = Folder.Files("C:\Myfolder"),
  #"Filtered Hidden Files1"
    = Table.SelectRows(
    Source,
    each [Attributes]?[Hidden]? <> true
  ),
  #"Invoke Custom Function1"
    = Table.AddColumn(
    #"Filtered Hidden Files1",
    "Transform File",
    each #"Transform File"([Content])
  ),
  #"Renamed Columns1"
    = Table.RenameColumns(
    #"Invoke Custom Function1",
    {"Name", "Source.Name"}
  ),
  #"Removed Other Columns1"
    = Table.SelectColumns(
    #"Renamed Columns1",
    {"Source.Name", "Transform File"}
  ),
  Combine = Table.Combine(
    #"Removed Other Columns1"[
      Transform File
    ]
  )
in
  Combine

This version of the query took, on average 43 seconds to refresh – a massive improvement.

If you’ve been following my series on ADLSgen2 refresh you may remember that I blogged about importing from a folder of Parquet files there too: in this post I noted that it took on average 72 seconds to load the same data from an ADLSgen2 folder in the Power BI Service using the original code; that was with the Source File column in and removing that column made no different to performance. This new version of the query took on average 49 seconds.

The conclusion is obvious: if you need to load data from a folder of Parquet files then you should use this new approach because the performance benefits are substantial. I know what you’re thinking: does this technique work for other file types apart from Parquet like CSV? The answer is no, because these file types don’t have typed columns like Parquet so it won’t work unfortunately.

Over the last few months I’ve written a series of posts looking at different aspects of one question: what is the best way to import data from ADLSgen2 storage into a Power BI dataset? For example, is Parquet really better than CSV? Should you use Azure Synapse Serverless? In this post I’m going to summarise my findings and offer some recommendations – although, as always, I need to stress that these are the conclusions I can draw from my test results and not the absolute, incontrovertible “Microsoft-says” truth so please do your own testing too.

Partitioning makes import faster

Whatever other choices you make about importing data into Power BI, creating partitioned tables in your dataset is the key to getting the best refresh performance (see here for more details, and here for more thoughts/findings on the subject). However, creating partitioned tables manually adds a lot of complexity since you need to use external tools like Tabular Editor to do so; it also adds cost since you need to have Premium or PPU to get access to XMLA Endpoints for Tabular Editor to connect to. The bigger the Premium capacity SKU you use, the more partitions you’ll be able to refresh in parallel and the faster your refresh will be.

Since incremental refresh also creates partitions in the background, and incremental refresh does not require Premium or PPU, you may want to consider using it instead of creating partitions manually but it’s nowhere near as flexible and if you’re connecting direct to ADLSgen2 then you’d have to use the technique that Miguel Escobar describes here to ensure that query folding takes place.

Do you need to filter?

The most important question you need to ask in this whole process is this:

Are you loading only some of the data from one or more files (for example by filtering on the values in one or more columns), or
Are you loading all the data (ie all the rows) from one or more files?

Scenario #1 is, I guess, a classic data lake scenario where you have multiple Power BI datasets created by different users, each of which is importing a different subset of the data from the files in the lake. Scenario #2, on the other hand, will be the case when you’re creating a set of files in ADLSgen2 that contain exactly the data you want to load into a single Power BI dataset for a specific project.

If your answer is scenario #1 and you’re filtering data before you load, then you’ll get the best import performance if store your data in Parquet files and query it through Azure Synapse Serverless (see here and here). Although Power BI can take advantage of Parquet format to a certain extent and will give you better performance if you are only importing some of the columns from a file (see here) it doesn’t do predicate pushdown. There is an additional cost associated with using Synapse Serverless, of course, but it’s very reasonably priced (see here for more details on how the pricing works).

If your answer is scenario #2 and you’re not filtering data, then you’ll get better performance (and lower costs) by connecting to your files in ADLSgen2 direct from Power BI. Using Azure Synapse Serverless isn’t necessarily a bad option but it does come with an overhead.

Combining data from multiple Parquet files can be optimised

If you are connecting direct to files in ADLSgen2 (and aren’t using Azure Synapse Serverless) and aren’t creating one partition per file then you’ll be combining data from multiple files in your Power Query M code. The code that Power Query generates automatically when you do this performs faster for CSV files than Parquet files (see here) but as I show here, with some simple changes you can create a much faster query to combine data from multiple Parquet files – although this technique does not work with CSV files.

Always use HierarchicalNavigation=true with AzureStorage.DataLake()

If you are not use Azure Synapse Serverless and reading the data direct from ADLSgen2 using the AzureStorage.DataLake() M function then you should always set the HierarchicalNavigation=true option. As I showed in this post you can get some significant performance benefits from using this option. There are other options that you can set on AzureStorage.DataLake() but I couldn’t find any benefits from using them (see here) – probably because requesting data from ADLSgen2 is relatively fast and the act of loading the data returned into a table in your dataset is much slower (see here).

Attaching Common Data Model folders as Dataflows can be a user-friendly option

While connecting to tables in Azure Synapse Serverless is reasonably straightforward, connecting direct to files and folders in ADLSgen2 can be quite intimidating for inexperienced Power BI users. As a result exposing ADLSgen2 data stored in Common Data Model format by attaching it as a Dataflow may be worth considering. There’s a performance difference between doing this and connecting direct to multiple CSV files (see here) but it’s certainly a lot more user-friendly. It’s also worth noting that support for the newer CDM manifest format in Dataflows has just been announced, although Dataflows don’t support CDM folders with data stored in Parquet format yet.

Over the years I have written a lot about Power BI/Power Query performance but it has always been in the context of loading data direct into datasets, not dataflows. A lot of cool things have been happening in dataflows recently, though, and now that Premium Per User has made Premium features to a much wider audience I thought it would be interesting to look at an example of how PPU can help dataflow performance and specifically how and when the Enhanced Compute Engine can make dataflow refresh faster.

Using the same CSV file that I used in my posts from last year on optimising the performance of merges in Power Query, a file with one million rows and seven numeric columns named A, B, C, D, E, F and G, I created the following dataflow in a shared capacity (ie non-Premium) workspace:

The queries called First and Second are identical and just load all the data from the (same) source CSV file; they also have their load disabled. The query called Merge does an inner join between these two queries on the column called A:

The Merge query has its load enabled so it’s the only output of the dataflow; after it has joined the data it expands the nested columns returned and sets the data types on all the output columns.

Refreshing this dataflow in shared capacity took on average 150 seconds.

I then moved the workspace to Premium Per User capacity and without making any changes, I refreshed again.

Refreshing the same dataflow in PPU took on average 73 seconds.

So the first finding is that moving the dataflow to PPU more than halved the refresh time, which is a pretty good result.

However, at this point the Enhanced Compute Engine is not being used – so, to enable it, I enabled loading for the First and Second queries which in turn made the Merge query a Computed Table (what used to be a Computed Entity before the recent terminology changes, indicated by the lightning bolt icon):

For a full explanation of when the Enhanced Compute Engine can and can’t be used see Matthew Roche’s blog post here; basically it loads data into a SQL-based cache which Computed Tables can then leverage which means that data access is faster and the Power Query engine can push transformations back to it via query folding. The only other change I made was to set data types on the columns in the output of First and Second.

Refreshing this new version of the dataflow in PPU took on average 90 seconds

So performance was worse – but why? Enabling loading on First and Second means that more work is done at refresh time because their output needs to be ingested twice (once into ADLSgen2 and once into the SQL cache used by the Enhanced Compute Engine) before the Enhanced Compute Engine can access it. In this case the extra work needed to load First and Second outweighs the performance gains from using the Enhanced Compute Engine. The new metrics available from the dataflow’s Refresh History provide some insight into this (I strongly recommend you read the docs on these metrics here); here’s some of the data from one of the refresh history CSV files loaded into Excel:

In this particular case the overall refresh time of the dataflow was 88 seconds. First and Second refreshed in parallel – First taking 48 seconds and Second taking 51 seconds – and once they had both finished, Merge could refresh and only took 36 seconds to join the output of First and Second. So in this case Merge is indeed faster (36 seconds compared to 73 seconds before) as a result of using the Enhanced Compute Engine but that improvement isn’t enough to cancel out the additional time needed to load the data returned by First and Second into it.

What about a scenario where the Enhanced Compute Engine does make a positive difference? Take a look at the following dataflow, a slight variation on the dataflow above:

There are now three new tables: Ouput Table 1, Output Table 2 and Output Table 3. Each of these tables gets the maximum value from a different column in the table returned by Merge. Note that there are no Computed Tables in this dataflow so the Enhanced Compute Engine is not used, and that First, Second and Merge have load disabled.

Refreshing this dataflow on PPU took on average 95 seconds

Here are the refresh metrics for one of the refreshes:

As you can see, the three tables were refreshed in parallel and took between 84 and 93 seconds. It’s important to remember that for each of these tables the source data was loaded and the Merge query evaluated independently, which explains why they each take so long. The fact that Merge is evaluated three times when this dataflow refreshes is counter-intuitive but really important here – for more details see this post on how queries are evaluated in Power Query.

Now consider this version of the same dataflow where First, Second and Merge have their load enabled, making Merge, Output Table 1, Output Table 2 and Output Table 3 all Computed Tables.

Refreshing this dataflow on PPU took on average 88 seconds

Not a massive improvement, but an improvement. Now look at how different the refresh metrics are:

In this case Output Table 1, Output Table 2 and Output Table 3 only take 1 second to evaluate, but that’s because they are working from data cached in the Enhanced Compute Engine – the table returned by Merge – and the transformations in them fold. The Merge table also uses data cached in the Enhanced Compute Engine: the tables returned by First and Second. What’s more, because Merge is a Computed Table it is only evaluated once in this dataflow. Loading the data for First and Second takes 52 seconds and 50 seconds respectively and Merge takes 35 seconds. In this case the hit of loading the data into the Enhanced Compute Engine is worth taking.

In conclusion, there are two things that these tests have shown:

Moving your dataflow to PPU can make a big difference to refresh performance.
The Enhanced Compute Engine can make dataflow refresh faster but not in all cases: you need to understand how it works, and in some cases the overhead of loading the data into it outweighs the performance advantages it gives you for any transformations later on. Use the information in Refresh History to work out what’s happening for your dataflow.

There were a lot of exciting announcements at the Microsoft Business Applications Summit this week but if you only watched the keynotes or read the recap on the Power BI blog you will have missed all the Power Query-related news in the “Data Prep in Power BI, Power Platform and Excel using Power Query” session:

https://mymbas.microsoft.com/sessions/1332f59f-a051-4a06-ae50-8f3185501a88

It covers all the new things that have happened in Power Query over the last few months such as Diagram View and, more importantly, talks about what’s going to happen in the next few months. It’s relatively short but for those of you with no time or patience, here’s a summary of the roadmap announcements:

[BTW “Power Query Online” is the browser-based version of Power Query that is used in Power BI dataflows]

My highlights are:

The ability to create a dataflow quickly by uploading a file to Power Query Online without needing to use a gateway to connect to a file on-premises, useful for one-time import scenarios.
Multi-value M parameter support – useful for dynamic M parameters and other things I can’t talk about yet
The things that Miguel talks about regarding “easier design experiences” for Synapse are kept intentionally vague but it’s worth listening to carefully to what he says here!
Native SQL support for Snowflake, BigQuery and Redshift – this is really useful for anyone who wants to use DirectQuery with these databases because it will allow you to write your own SQL query and use it as the source of a table, rather than having to use a table or a view.
AAD based Single Sign-On support for Redshift and BigQuery (similar to what we have today for Snowflake) will also be very important for DirectQuery, because it means that the identity of the user running the report can be passed back to the database.
A dataflows connector for Excel Power Query – which means, at last, you’ll be able to get data from a dataflow direct into Excel. This will make a lot of Excel users very happy, I think: a lot of the time all users want is a table of data dumped to Excel and dataflows will be a great way to do provide them with that.

Last of all, the session showcases the great new home for all things Power Query – http://www.powerquery.com/ – which has great resources, newly-updated documentation and a blog. Make sure you check it out!

The team at SQLBits have been publishing all the session recordings from their last (online) conference on their YouTube channel. There’s a lot of great content there to check out and this post is to highlight one of my sessions, “Performance tuning Power BI dataset refresh”.

In this session I look at all of the factors that can influence how long it takes to import data into Power BI and what you can do to make it faster. Topics covered include:

Choosing a dataset storage mode
The importance of good data modelling
How the type of data source you use effects how quickly data can load
Ways to measure refresh performance, such as using SQL Server Profiler and Power Query Query Diagnostics
Power Query options that can influence refresh times such as disabling data previews
Query folding in the Power Query engine
Vertipaq engine features that affect refresh, such as calculated columns and calculated tables
How dataflows can help refresh performance

A few months ago I heard about a new tool from Microsoft called Lobe which makes it easy to train machine learning models. It’s nothing to do with Power BI but I find anything to do with self-service data analytics interesting, and when I finally got round to playing with it today I thought it was so much fun that it deserved a blog post.

You can download it and learn more at https://www.lobe.ai/ and there’s a great ten minute video describing how to use it here:

The most impressive thing about it is not what it does but how it does it: a lot of tools claim to make machine learning easy for non-technical users but Lobe really is easy to use. My AI/ML knowledge is very basic but I got up and running with it extremely quickly.

To test it out I downloaded lots of pictures of English churches and trained a model to detect whether the church had a tower or a spire. After I labelled the pictures appropriately:

…Lobe was able to train the model:

I could test it inside the tool. The model was able to tell whether a church had a tower:

…or a spire:

…very reliably!

If I have one criticism it’s that when you want to use your model things get a lot more technical, at least compared to something like AI Builder for Power Apps and Power Automate, but I guess that’s because it is just a tool for training models. There have been some recent improvements here though (see this blog post) and Lobe does provide a local API for testing purposes that can be consumed in Power BI with some custom M code.

Here’s an example of how to call the local API in Power Query:

let
  Source = Folder.Files("C:\Churches"),
  #"Removed Other Columns"
    = Table.SelectColumns(
    Source,
    {"Content", "Name"}
  ),
  #"Added Custom" = Table.AddColumn(
    #"Removed Other Columns",
    "CallAPI",
    each Text.FromBinary(
      Web.Contents(

        //Insert Lobe Connect URL here                              
        "http://localhost...",
        [
          Content = Json.FromValue(
            [
              image = Binary.ToText(
                [Content],
                BinaryEncoding.Base64
              )
            ]
          ),
          Headers = [
            #"Content-Type"
              = "application/json"
          ]
        ]
      )
    )
  ),
  #"Parsed JSON"
    = Table.TransformColumns(
    #"Added Custom",
    {{"CallAPI", Json.Document}}
  ),
  #"Expanded CallAPI"
    = Table.ExpandRecordColumn(
    #"Parsed JSON",
    "CallAPI",
    {"predictions"},
    {"predictions"}
  ),
  #"Expanded predictions"
    = Table.ExpandListColumn(
    #"Expanded CallAPI",
    "predictions"
  ),
  #"Expanded predictions1"
    = Table.ExpandRecordColumn(
    #"Expanded predictions",
    "predictions",
    {"label", "confidence"},
    {"label", "confidence"}
  ),
  #"Pivoted Column" = Table.Pivot(
    #"Expanded predictions1",
    List.Distinct(
      #"Expanded predictions1"[label]
    ),
    "label",
    "confidence",
    List.Sum
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Pivoted Column",
    {
      {"Tower", type number},
      {"Spire", type number}
    }
  ),
  #"Removed Columns"
    = Table.RemoveColumns(
    #"Changed Type",
    {"Content"}
  )
in
  #"Removed Columns"

You can export models to a variety of other places for production use, including Azure Functions and Azure Machine Learning.

Definitely something to keep an eye on, especially because it will soon be able to do object detection and data classification as well as image classification.

A really useful new Power Query performance enhancement was added to Power BI Desktop in an update to the May release via the Microsoft Store a week or so ago (if you’re not installing Power BI Desktop through the Microsoft Store you’ll have to wait for the June release I’m afraid). You can read the documentation here:

https://docs.microsoft.com/en-us/power-bi/create-reports/desktop-evaluation-configuration

However if you have just read the docs you may be wondering what these two new registry key settings actually do. In this post I’m only going to talk about one, MaxEvaluationWorkingSetInMB; I’ll leave ForegroundEvaluationContainerCount for a future post.

At various times in the past I have blogged about how, when you run a Power Query query, the query itself is executed inside a separate process called an evaluation (or mashup) container and how this process has a limit on the amount of memory it can use. Some transformations such as sorting a table, doing a group by, pivoting and unpivoting require an entire table of data to be held in memory and if these operations require more memory than the evaluation container is able to use then it starts paging and query performance gets a lot worse. This post provides more details:

https://blog.crossjoin.co.uk/2020/05/21/monitoring-power-query-memory-usage-with-query-diagnostics-in-power-bi/

Two things have now changed though. First of all, the default of amount of memory available to an evaluation container in Power BI Desktop has been increased from 256MB to 432MB. This on its own will make many Power Query queries run a lot faster. Secondly, it is now possible to define how much memory an evaluation container can use yourself via the new MaxEvaluationWorkingSetInMB registry setting described in the documentation.

Here’s an example that shows how much of an impact this can have. In Power BI Desktop I created a Power Query query that reads data from a csv file with around one million rows in it and then sorts the resulting table by the values in one column:

let
  Source = Csv.Document(
    File.Contents("C:\demo.csv"), 
    [
      Delimiter  = ",", 
      Columns    = 16, 
      Encoding   = 1252, 
      QuoteStyle = QuoteStyle.None
    ]
  ), 
  #"Sorted Rows" = Table.Sort(
    Source, 
    {{"Column2", Order.Ascending}}
  )
in
  #"Sorted Rows"

Using SQL Server Profiler in the way described here, I found that the Power Query query took almost 87 seconds to start returning data and a further 19 seconds to return all the data:

What’s more, in Task Manager I could see that the evaluation container doing the work was limited to using around 423MB of RAM:

I then used Regedit to set MaxEvaluationWorkingSetInMB to 4096, giving each evaluation container a maximum of 4GB of RAM to use:

After restarting Desktop I reran the same query. This time Task Manager showed the evaluation container doing the work using around 1.2GB of RAM:

…and Profiler showed that the query started returning data after only 14 seconds and returned all the data in a further 12 seconds:

As you can see, that’s a massive performance improvement. Before you get too excited about this, though, a few things need to be made clear.

First, this setting only affects the performance of Power Query queries in Power BI Desktop. It does not affect the performance of queries in the Power BI Service, although there is another setting that (I think) will have the same effect for queries that go through an on-premises data gateway – but that’s yet another for a future post. So while this will make development much quicker and easier it won’t make dataset refreshes in the Power BI Service quicker.

Second, you need to be very careful when changing this setting. There’s no safety net here – you can set MaxEvaluationWorkingSetInMB to whatever value you want – and so some care is needed. When a dataset is refreshed then multiple evaluation containers may be used to handle the Power Query transformations, each of which can use the amount of memory specified by MaxEvaluationWorkingSetInMB. Since there’s a finite amount of memory on your development PC it’s important you don’t set MaxEvaluationWorkingSetInMB too high because if you do there’s a risk that Power BI will try to use more memory than you have available and bring your PC to a grinding halt. What’s more there’s no way of knowing how much memory any given query will need without some experimentation, so my advice is that if you do change MaxEvaluationWorkingSetInMB you should only increase it by a small amount and then increase it only if you are sure you need it.

I’d love to hear how much changing this setting improves the performance of your queries. If it does prove to be useful to a large number of people I hope we can get it added to the Options dialog in Power BI Desktop (which is much more convenient than changing a registry key); I also think it would be very useful in Excel Power Query. Please leave a comment with your findings!

Last week I showed how the new MaxEvaluationWorkingSetInMB registry setting could increase the performance of memory-hungry Power Query queries in Power BI Desktop. In this post I’ll show how the other new registry setting, ForegroundEvaluationContainerCount, can also help performance. Before I carry on I recommend you read the documentation on these new registry settings if you haven’t done so already.

To illustrate the effect of this setting I created ten identical Power Query queries feeding an Import mode dataset in a new .pbix file, each of which read data from the same 150MB CSV file, apply the a filter and then count the number of rows returned. These queries don’t require a large amount of memory but do take a couple of seconds to execute:

With ForegroundEvaluationContainerCount not set, refreshing the entire dataset (with background queries disabled) initially showed ten active evaluation containers:

I’m pretty sure these containers were used to determine the schemas of the tables returned (see here for more background); these were then joined by ten more containers which I assume were actually used by the refresh:

With these default settings refresh took 18 seconds according to Profiler.

With ForegroundEvaluationContainerCount set to 3:

This time there were never more than three evaluation containers active at any one time:

…and refresh took 24 seconds.

So we’ve proved that by setting ForegroundEvaluationContainerCount to a low value we can limit the amount of parallelism and, in this case, make performance worse. So why would you ever want to limit the amount of parallelism like this? The maximum amount of memory available to an evaluation container isn’t just controlled by the MaxEvaluationWorkingSetInMB registry setting; as the docs say, the effective maximum is also determined by the number of evaluation containers used. So reducing the amount of parallelism can increase the amount of memory available to each evaluation container and possibly increase performance.

I then created twenty new copies of the Power Query query, bringing the total number of queries in the pbix file to thirty, and set removed the ForegroundEvaluationContainerCount registry key to go back to using the default settings. During refresh I saw that no more than twenty evaluation containers were active – as expected, because the docs state that with the default settings no more than twenty containers will be used. I’ll spare you the screenshot. Refresh took 62 seconds.

Then I set ForegroundEvaluationContainerCount to 30 and refreshed. This time I could see thirty evaluation containers being used during refresh, and refresh took 55 seconds – not a massive improvement, but an improvement that I’m pretty sure can be attributed to the increased parallelism (I suspect that there was some other bottleneck here, possibly IO).

In conclusion the ForegroundEvaluationContainerCount registry setting is another useful tool to improve refresh performance for Import mode datasets (it’s also useful for DirectQuery but that’s something for a future post) in Power BI Desktop. Finding the optimal value to set it too is not straightforward though and is likely to involve a lot of experimentation. As always, please let me know how you get on using it.

Recently I’ve been asked by colleagues with various different types of performance problems why Power BI is generating SQL in a particular way, and the answer has been the presence of nullable columns in the underlying database – whether it’s SQL Server, Snowflake or Databricks. Now I’m not a DBA or any kind of database tuning expert so I can’t comment on why a SQL query performs the way it does on any given platform, but what I can do is show you two examples of how the presence of nullable columns changes the way Power BI and Power Query generate SQL.

Consider the following table in a SQL Server table with a single, integer column that does not allow null values:

If you connect to this table in DirectQuery mode, drag the MyNumber field into a card in a Power BI report and select the Distinct Count aggregation type:

…here’s the TSQL that is generated:

SELECT 
COUNT_BIG(DISTINCT [t0].[MyNumber])
 AS [a0]
FROM 
(
(
select [$Table].[MyNumber] as [MyNumber]
from [dbo].[NotNullableColumn] as [$Table]
)
)
 AS [t0]

Now if you do the same thing with a table that is identical in all respects but where the MyNumber column does allow null values:

…here’s the TSQL that Power BI generates:

SELECT 
(COUNT_BIG(DISTINCT [t1].[MyNumber]) 
+ MAX(CASE WHEN [t1].[MyNumber] IS NULL THEN 1 ELSE 0 END))
 AS [a0]
FROM 
(
(
select [$Table].[MyNumber] as [MyNumber]
from [dbo].[NullableColumn] as [$Table]
)
)
 AS [t1]

Notice the extra code in the third line of this second query that has been added to handle the possible presence of null values.

It’s not just when you’re using DirectQuery mode that you can see a difference. Let’s say you’re using Import mode and you take each of these tables and join them to themselves in the Power Query Editor like so:

Here’s the M code for this query:

let
  Source = Sql.Databases("localhost"),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NotNullableColumn = FoldingTest
    {
      [
        Schema = "dbo",
        Item   = "NotNullableColumn"
      ]
    }
    [Data],
  #"Merged Queries" = Table.NestedJoin(
    dbo_NotNullableColumn,
    {"MyNumber"},
    dbo_NotNullableColumn,
    {"MyNumber"},
    "dbo_NotNullableColumn",
    JoinKind.Inner
  ),
  #"Expanded dbo_NotNullableColumn"
    = Table.ExpandTableColumn(
    #"Merged Queries",
    "dbo_NotNullableColumn",
    {"MyNumber"},
    {"dbo_NotNullableColumn.MyNumber"}
  )
in
  #"Expanded dbo_NotNullableColumn"

Joining the table with the not nullable column to itself folds and results in the following TSQL query being generated:

select [$Outer].[MyNumber] as [MyNumber],
    [$Inner].[MyNumber2] as [dbo_NotNullableColumn.MyNumber]
from [dbo].[NotNullableColumn] as [$Outer]
inner join 
(
    select [_].[MyNumber] as [MyNumber2]
    from [dbo].[NotNullableColumn] as [_]
) as [$Inner] on ([$Outer].[MyNumber] = [$Inner].[MyNumber2])

If you do the same thing with the table with the nullable column, here’s the TSQL that is generated:

select [$Outer].[MyNumber] as [MyNumber],
    [$Inner].[MyNumber2] as [dbo_NullableColumn.MyNumber]
from [dbo].[NullableColumn] as [$Outer]
inner join 
(
    select [_].[MyNumber] as [MyNumber2]
    from [dbo].[NullableColumn] as [_]
) as [$Inner] on ([$Outer].[MyNumber] = [$Inner].[MyNumber2] 
or [$Outer].[MyNumber] is null and [$Inner].[MyNumber2] is null)

Once again you can see how the SQL generated for an operation on a nullable column is different to the SQL generated for an operation on a non-nullable column. Whether one SQL query performs significantly better or worse than the other is something you need to test.

The last thing to say is that there is no supported way in Power BI or Power Query to treat a nullable column as if it was not nullable. If you have a nullable column and the extra SQL to handle those nulls results in a performance problem then your only option is to alter the design of your table and make the column not nullable.

It’s surprisingly easy to stop query folding happening in Power Query by changing the data type of a column. This is mentioned in the docs here, and it’s something several people have blogged about already (for example here). However there is something new to note: an option that will allow you to convert text columns to number or date columns in a foldable way for SQL Server data sources.

Consider the following table in a SQL Server database that consists of a single nvarchar(50) column containing numeric values:

Here’s an M query that converts this column into a numeric column and which folds:

let
  Source = Sql.Databases(
    "localhost",
    [UnsafeTypeConversions = true]
  ),
  FoldingTest1 = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NumberFoldingTest = FoldingTest1
    {
      [
        Schema = "dbo",
        Item   = "NumberFoldingTest"
      ]
    }
    [Data],
  #"Added Custom" = Table.AddColumn(
    dbo_NumberFoldingTest,
    "ConvertedNumber",
    each Number.From([NumberAsText]),
    Int64.Type
  )
in
  #"Added Custom"

Here’s the output of the query, where a new custom column called ConvertedNumber contains the converted numeric values:

Here’s the resulting SQL generated by Power Query:

select [_].[NumberAsText] as [NumberAsText],
    convert(float, [_].[NumberAsText]) as [ConvertedNumber]
from [dbo].[NumberFoldingTest] as [_]

There are three important things to point out about the M query above:

I have set the (relatively new) UnsafeTypeConversions property on the Sql.Databases function to true
In the custom column I have used the Number.From function to convert the text in the NumberAsText column to numbers
I have used the optional third parameter of Table.AddColumn to set the data type of the new custom column to the Int64 type

All these three things are necessary to get a properly typed numeric column in your Power Query query – if you vary from this too much then folding won’t happen.

It’s also possible to use this technique to convert text to datetime values. Here’s another SQL Server table, this time with dates stored in an nvarchar(50) column:

Here’s another M query that does the conversion and folds:

let
  Source = Sql.Databases(
    "localhost",
    [UnsafeTypeConversions = true]
  ),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_DateFoldingTest = FoldingTest
    {
      [
        Schema = "dbo",
        Item   = "DateFoldingTest"
      ]
    }
    [Data],
  #"Added Custom" = Table.AddColumn(
    dbo_DateFoldingTest,
    "ConvertedDate",
    each DateTime.From([DateAsText]),
    type datetime
  )
in
  #"Added Custom"

And here’s the resulting SQL:

select [_].[DateAsText] as [DateAsText],
    convert(datetime2, [_].[DateAsText]) as [ConvertedDate]
from [dbo].[DateFoldingTest] as [_]

Why, you ask, is this new property on Sql.Databases called “UnsafeTypeConversions”? As the name suggests, it allows you to do something that is potentially unsafe. Consider this SQL Server table that has an nvarchar(50) column containing some numeric values and one non-numeric value:

If you connect to this table and set the data type on this column to be Whole Number using the dropdown in the column header (they normal way to change the data type of a column), something like the M code below will be generated:

let
  Source = Sql.Databases("localhost"),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NumberFoldingErrorsTest
    = FoldingTest
    {
      [
        Schema = "dbo",
        Item = "NumberFoldingErrorsTest"
      ]
    }
    [Data],
  #"Changed Type"
    = Table.TransformColumnTypes(
    dbo_NumberFoldingErrorsTest,
    {{"MixedTextNumbers", Int64.Type}}
  )
in
  #"Changed Type"

Here’s the output of this query:

Note how this query returns four rows and the third row contains the error value shown.

If, however, you try to use the UnsafeTypeConversions approach here using something like the following M:

let
  Source = Sql.Databases(
    "localhost",
    [UnsafeTypeConversions = true]
  ),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NumberFoldingErrorsTest
    = FoldingTest
    {
      [
        Schema = "dbo",
        Item = "NumberFoldingErrorsTest"
      ]
    }
    [Data],
  #"Added Custom" = Table.AddColumn(
    dbo_NumberFoldingErrorsTest,
    "ConvertedToNumber",
    each Number.From([MixedTextNumbers]),
    Int64.Type
  )
in
  #"Added Custom"

You get the following result:

Notice now that there is an error value in both columns and, more importantly, only three rows are returned – the fourth has been lost. So, if you are going to use the UnsafeTypeConversions you need to be 100% sure that it will work and that you don’t have problems with your data quality.

[Thanks to Curt Hagenlocher for the information in this post]

There were a couple of new features and enhancements to existing features in the June 2021 Power BI Desktop release that don’t seem to have much to do with each other but which I think can be combined to do cool things. They are:

The new paginated report visual
Native SQL support in the Snowflake connector
Improvements to dynamic M parameters

Let me give you an example of what I mean…

First of all, let’s start with native SQL support in the Snowflake connector. I deal with a lot of customers who use Snowflake and Power BI together and I know just how much people have wanted this. What does it allow you to do? Well, you have always been able to use the Power Query Editor to transform data coming from Snowflake in either Import mode or DirectQuery mode. Now, though, you can write your own native SQL query and use it as the source for a Power Query query (something that has always been possible with some other connectors, such as the SQL Server connector). Incidentally, this also means that the EnableFolding=true option for Value.NativeQuery that I blogged about recently also now works for Snowflake too.

The main reason you’d want to use a native SQL query when connecting to Snowflake, or indeed any database, is to do something that’s possible in SQL but not in Power Query. One example of this is to use regular expressions to filter data. I have the AdventureWorks DW DimCustomer table loaded into Snowflake and I can use Snowflake’s REGEXP function to filter on the LASTNAME column something like this:

SELECT 
DISTINCT FIRSTNAME, LASTNAME, ENGLISHOCCUPATION 
FROM "AWORKS"."PUBLIC"."DIMCUSTOMER" 
WHERE LASTNAME REGEXP 'To.*'

So that’s useful. I can use a query like this as the source of a table in DirectQuery mode in Power BI, but wouldn’t it be useful if end users of my report could change the regular expression used to filter the data? This is where dynamic M parameters come in. Assuming I have a table of pre-defined regular expressions:

And an M parameter:

…I can write an M query like this that uses the M parameter to return the regular expression used in the WHERE clause of the SQL query:

let
  Source = Value.NativeQuery(
    Snowflake.Databases(
      "mysnowflake.com", 
      "DEMO_WH"
    ){[Name = "AWORKS"]}[Data], 
    "SELECT DISTINCT FIRSTNAME, LASTNAME, ENGLISHOCCUPATION 
    FROM ""AWORKS"".""PUBLIC"".""DIMCUSTOMER"" 
    WHERE LASTNAME REGEXP '"
      & pRegEx
      & "'", 
    null, 
    [EnableFolding = true]
  )
in
  Source

…and then turn this into a dynamic M parameter in the Power BI diagram pane:

…and get a report that does this:

One limitation of dynamic M parameters in regular Power BI reports today is that the values you pass into them have to come from a column somewhere inside your dataset, so all of these values have to be pre-defined. Wouldn’t it be useful if the end user could enter any regular expression that they wanted though? That may not be possible in a regular Power BI report but it is possible with a paginated report, because with paginated reports you can write whatever DAX query you want – and therefore pass any value you want to a dynamic M parameter – and also, in a paginated report, you have the option of creating parameters where the user can enter whatever value they want.

I blogged about how to write DAX queries that contain dynamic M parameters here. Here’s an example of a parameterised DAX query (yes, I know, so many types of parameters…) that takes a regular expression and the name of an occupation and returns a table of customers whose last names match the regular expression and whose occupations match the one entered:

DEFINE
    MPARAMETER pRegEx = @DAXRegExParam
EVALUATE
FILTER (
    Customers,
    'Customers'[ENGLISHOCCUPATION] = @DAXOccupationParam
)

This can be used in a paginated report dataset connected to the Power BI dataset created above (yes, I know, so many types of datasets…) like so:

….which can then be used to build a paginated report that does this:

And of course, with the new paginated report visual, this paginated report can be embedded in a regular Power BI report:

All this is very much a proof-of-concept and not something I would recommend for production (I would be worried about SQL injection attacks for a start). There are more enhancements to these features still to come too. However, I do think it’s interesting to see how these features can be put together now and to imagine how they could be used in the future. What do you think?

You may have noticed that a new dataflows connector was announced in the August 2021 release of Power BI Desktop, and that it now supports query folding between a dataset and a dataflow – which you may be surprised to learn was not possible before. In this post I thought I’d take a look at how much of an improvement in performance this can make to dataset refresh performance.

For my tests I created a new PPU workspace and a dataflow, and made sure the Enhanced Compute Engine was turned on for the dataflow on the Settings page:

Query folding will only happen if the Enhanced Compute Engine is set to “On”, and won’t happen with the “Optimized” setting. The Enhanced Compute Engine is only available with PPU and Premium.

For my data source I used a CSV file with a million rows in and seven integer columns. I then created two tables in my dataflow like so:

The Source table simply connects to the CSV file, uses the first row as the headers, then sets the data type on each column. The second table called Output – which contains no tranformations at all – is needed for the data to be stored in the Enhanced Compute Engine, and the lightning icon in the top-left corner of the table in the diagram shows this is the case.

Next, in Power BI Desktop, I created a Power Query query that used the old Power BI dataflows connector:

If you have any existing datasets that connect to dataflows, this is the connector you will have used – it is based on the PowerBI.Dataflows function. My query connected to the Output table and filtered the rows to where column A is less than 100. Here’s the M code, slightly edited to remove all the ugly GUIDs:

let
    Source = PowerBI.Dataflows(null),
    ws = Source{[workspaceId="xxxx"]}[Data],
    df = ws{[dataflowId="yyyy"]}[Data],
    Output1 = df{[entity="Output"]}[Data],
    #"Filtered Rows" = Table.SelectRows(Output1, each [A] < 100)
in
    #"Filtered Rows"

Remember, this connector does not support query folding. Using this technique to measure how long the query ran when the results from the query were loaded into the dataset, I could see it took almost 12.5 seconds to get the data for this query:

In fact the performance in Desktop is worse: when refresh was taking place, I could see Power BI downloading 108MB of data even though the original source file is only 54MB.

Why is the data downloaded twice? I strongly suspect it’s because of this issue – because, of course, no query folding is happening. So the performance in Desktop is really even worse.

I then created the same query with the new dataflows connector:

This connector uses the PowerPlatform.Dataflows function; it’s not new, but what is new is that you can now access Power BI dataflows using it.

Here’s the M code, again cleaned up to remove GUIDS:

let
    Source = PowerPlatform.Dataflows(null),
    Workspaces = Source{[Id="Workspaces"]}[Data],
    ws = Workspaces{[workspaceId="xxxx"]}[Data],
    df = ws{[dataflowId="yyyy"]}[Data],
    Output_ = df{[entity="Output",version=""]}[Data],
    #"Filtered Rows" = Table.SelectRows(Output_, each [A] < 100)
in
    #"Filtered Rows"

When this query was loaded into the dataset, it only took 4 seconds:

This is a lot faster, and Power BI Desktop was a lot more responsive during development too.

It’s reasonable to assume that query folding is happening in this query and the filter on [A]<100 is now taking place inside the Enhanced Compute Engine rather than in Power BI Desktop. But how can you be sure query folding is happening? The “View Native Query” option is greyed out, but of course this does not mean that query folding is not happening. However, if you use Query Diagnostics, hidden away in the Data Source Query column of the detailed diagnostics query, you can see a SQL query with the WHERE clause you would expect:

In conclusion, you can see that the new dataflows connector can give you some big improvements for dataset refresh performance and a much better development experience in Power BI Desktop. Query folding support also means that you can now use dataset incremental refresh when using a dataflow as a source. However, you will need to use Premium or PPU, you may also need to make some changes to your dataflow to make sure it can take advantage of the Enhanced Compute Engine, and you will also need to update any existing Power Query queries to use the new connector. I think the potential performance gains are worth making these changes though. If you do make these changes in your dataflows and find that it helps, please leave a comment!

There are a lot of articles and blog posts out there on how to handle OAuth2 authentication when connecting to REST APIs from Power Query in Power BI. However there is also a lot of confusion and contradictory information too so in this post I want to give you the definitive, Microsoft-endorsed answer to this question, which is:

If want to connect from Power BI to a REST API that uses OAuth2 authentication then you need to build a custom connector. You can find documentation on how to implement an OAuth2 flow in a custom connector here.

The only exception is that you can connect to some APIs that use AAD authentication using the built-in web or OData connectors, as documented here.

A quick web search will turn up several examples of how to implement an OAuth2 credential flow in regular Power Query queries without needing a custom connector. This is not recommended: it’s not secure and it’s not reliable. In particular, hard-coding usernames/passwords or client ids/client secrets in your M code is a really bad idea. What’s more requesting a new token every time a query runs isn’t great either.

Unfortunately Excel Power Query doesn’t support custom connectors at the time of writing. Also, if you use a custom connector in the Power BI Service then you’ll need to use an on-premises gateway. Finally, there’s an article here explaining why it isn’t easy to connect Power BI to the Microsoft Graph API.

[Thanks to Curt Hagenlocher and Matt Masson for the information in this post]

My post earlier this year on enabling query folding when using SQL queries as a data source in Power Query provoked a lot of interest. This post adds one more useful detail: how to preserve the original data types of the columns in your query when using this technique with SQL Server-related sources.

Consider the DimDate table in the AdventureWorksDW2017 sample database for SQL Server:

Notice that the FullDateAlternateKey column has the data type Date.

If you connect to this table in the normal way, by selecting it in the Navigation pane when you connect to your SQL Server instance, the M code for your Power Query query will look something like this:

let
  Source = Sql.Databases("localhost"), 
  AdventureWorksDW2017 = Source
    {[Name = "AdventureWorksDW2017"]}
    [Data], 
  dbo_DimDate = AdventureWorksDW2017
    {[Schema = "dbo", Item = "DimDate"]}
    [Data]
in
  dbo_DimDate

Unsurprisingly, the FullDateAlternateKey column in the Power Query query also has a data type of Date, as indicated by the calendar icon on the left side of the column header in the Power Query Editor:

However, if you use Value.NativeQuery to run a SQL query to get the same data and set EnableFolding=true, like so:

let
  Source = Sql.Databases("localhost"), 
  AdventureWorksDW2017 = Source
    {[Name = "AdventureWorksDW2017"]}
    [Data], 
  Q = Value.NativeQuery(
    AdventureWorksDW2017, 
    "Select * From DimDate", 
    null, 
    [EnableFolding = true]
  )
in
  Q

…you’ll see that the FullDateAlternateKey column comes through as a DateTime type instead:

The same thing would happen with a column of type Time too, ie it would come through as a DateTime.

If you want the types in the output of Value.NativeQuery to match the types in the output of the first Power Query query above there’s an extra option you need to add: PreserveTypes=true.

let
  Source = Sql.Databases("localhost"), 
  AdventureWorksDW2017 = Source
    {[Name = "AdventureWorksDW2017"]}
    [Data], 
  Q = Value.NativeQuery(
    AdventureWorksDW2017, 
    "Select * From DimDate", 
    null, 
    [
      PreserveTypes = true, 
      EnableFolding = true
    ]
  )
in
  Q

In the output of this query, FullDateAlternateKey has the data type Date again:

This option is only available for the SQL Server connector (and connectors related to it) at the time of writing.

[Thanks to Curt Hagenlocher for this information]

I decided to stop writing book reviews here on my blog a long time ago: it’s a lot of work to read a book and write a proper, detailed review and what’s more I don’t like the idea of writing a bad review and upsetting someone who has gone to all the effort of writing a book. That said, from time to time I get given free copies of books (which I’m always happy to receive – I like to see how other people go about explaining Power BI concepts and functionality) and in return I give the authors some free publicity here. Recently I received two copies of new books from people that I know:

Expert data modeling with Power BI, by Soheil Bakhshi (Buy it here on Amazon UK)

Soheil is an MVP whose blog I have read and admired for some time so I’m pleased to see he has written a book. It’s an important subject too: good data modelling is key to success with Power BI, and the problems of many customers I work with stem from not taking the time to learn how data should be modelled for Power BI. This book introduces you to concepts like dimensional modelling and star schemas and shows you how to build datasets that follow best practices. It also covers topics such as calculation groups and object-level security that won’t be in older books.

Power Query cookbook, by Andrea Janicijevic (Buy it here on Amazon UK)

Andrea is a colleague of mine at Microsoft and of course Power Query is a technology close to my heart. This book follows the cookbook format which teaches through a series of worked examples and easy-to-follow steps; anyone learning Power Query will find it useful to follow these recipes to get practice creating queries. I liked the inclusion of Power BI Dataflows as well as Power Query in Power BI Desktop, and again this book has the advantage of being new – it covers recently-added features such as Schema View and Diagram View in Dataflows and Query Diagnostics in Power BI Desktop that won’t be covered in other books.

There’s another book I was curious about and was lucky enough to be able to read via Microsoft’s online library for employees:

Pro Power BI theme creation, by Adam Aspin (Buy it here on Amazon UK)

When I hear someone had written a book about Power BI theme files I couldn’t believe it, but Adam is an experienced writer and has pulled it off. As you might expect it’s everything you ever wanted to learn about Power BI themes and as such, if themes are something you’re interested in you should read this book. It explains how theme files are structured, how to edit them and how the various attributes are applied to different visuals.

Webcast: Accessing Web Services With Power BI And Power Query

Testing The Performance Impact Of AzureStorage.DataLake() Options On Power BI Refresh Performance

Measuring The Performance Of AzureStorage.DataLake() Using Power Query Query Diagnostics

Parquet Files In Power BI/Power Query And The “Streamed Binary Values” Error

Parquet File Performance In Power BI/Power Query

Optimising The Performance Of Combining Data From Multiple Parquet Files In Power Query/Power BI

Importing Data From ADLSgen2 Into Power BI: Summary Of Findings

Power BI Dataflow Performance, Premium Per User And The Enhanced Compute Engine

Power Query And Power BI Connectivity Announcements At The Microsoft Business Applications Summit

Video: Performance Tuning Power BI Dataset Refresh

A Look At Lobe – A Free, Easy-To-Use Tool For Training Machine Learning Models

Speed Up Power Query In Power BI Desktop By Allocating More Memory To Evaluation Containers

Speed Up Power Query In Power BI Desktop By Increasing Or Decreasing The Number Of Evaluation Containers

Power BI/Power Query And Nullable Columns

Data Type Conversions For SQL Server Sources And Query Folding In Power Query

Dynamic M Parameters, Snowflake Native SQL And Paginated Reports

How Query Folding And The New Power BI Dataflows Connector Can Help Dataset Refresh Performance

Connecting To REST APIs With OAuth2 Authentication In Power Query/Power BI

Preserving Data Types With SQL Queries In Power Query And Power BI

Three New Power BI/Power Query Books