Over the last few months I’ve written a series of posts looking at different aspects of one question: what is the best way to import data from ADLSgen2 storage into a Power BI dataset? For example, is Parquet really better than CSV? Should you use Azure Synapse Serverless? In this post I’m going to summarise my findings and offer some recommendations – although, as always, I need to stress that these are the conclusions I can draw from my test results and not the absolute, incontrovertible “Microsoft-says” truth so please do your own testing too.
Partitioning makes import faster
Whatever other choices you make about importing data into Power BI, creating partitioned tables in your dataset is the key to getting the best refresh performance (see here for more details, and here for more thoughts/findings on the subject). However, creating partitioned tables manually adds a lot of complexity since you need to use external tools like Tabular Editor to do so; it also adds cost since you need to have Premium or PPU to get access to XMLA Endpoints for Tabular Editor to connect to. The bigger the Premium capacity SKU you use, the more partitions you’ll be able to refresh in parallel and the faster your refresh will be.
Since incremental refresh also creates partitions in the background, and incremental refresh does not require Premium or PPU, you may want to consider using it instead of creating partitions manually but it’s nowhere near as flexible and if you’re connecting direct to ADLSgen2 then you’d have to use the technique that Miguel Escobar describes here to ensure that query folding takes place.
Do you need to filter?
The most important question you need to ask in this whole process is this:
- Are you loading only some of the data from one or more files (for example by filtering on the values in one or more columns), or
- Are you loading all the data (ie all the rows) from one or more files?
Scenario #1 is, I guess, a classic data lake scenario where you have multiple Power BI datasets created by different users, each of which is importing a different subset of the data from the files in the lake. Scenario #2, on the other hand, will be the case when you’re creating a set of files in ADLSgen2 that contain exactly the data you want to load into a single Power BI dataset for a specific project.
If your answer is scenario #1 and you’re filtering data before you load, then you’ll get the best import performance if store your data in Parquet files and query it through Azure Synapse Serverless (see here and here). Although Power BI can take advantage of Parquet format to a certain extent and will give you better performance if you are only importing some of the columns from a file (see here) it doesn’t do predicate pushdown. There is an additional cost associated with using Synapse Serverless, of course, but it’s very reasonably priced (see here for more details on how the pricing works).
If your answer is scenario #2 and you’re not filtering data, then you’ll get better performance (and lower costs) by connecting to your files in ADLSgen2 direct from Power BI. Using Azure Synapse Serverless isn’t necessarily a bad option but it does come with an overhead.
Combining data from multiple Parquet files can be optimised
If you are connecting direct to files in ADLSgen2 (and aren’t using Azure Synapse Serverless) and aren’t creating one partition per file then you’ll be combining data from multiple files in your Power Query M code. The code that Power Query generates automatically when you do this performs faster for CSV files than Parquet files (see here) but as I show here, with some simple changes you can create a much faster query to combine data from multiple Parquet files – although this technique does not work with CSV files.
Always use HierarchicalNavigation=true with AzureStorage.DataLake()
If you are not use Azure Synapse Serverless and reading the data direct from ADLSgen2 using the AzureStorage.DataLake() M function then you should always set the HierarchicalNavigation=true option. As I showed in this post you can get some significant performance benefits from using this option. There are other options that you can set on AzureStorage.DataLake() but I couldn’t find any benefits from using them (see here) – probably because requesting data from ADLSgen2 is relatively fast and the act of loading the data returned into a table in your dataset is much slower (see here).
Attaching Common Data Model folders as Dataflows can be a user-friendly option
While connecting to tables in Azure Synapse Serverless is reasonably straightforward, connecting direct to files and folders in ADLSgen2 can be quite intimidating for inexperienced Power BI users. As a result exposing ADLSgen2 data stored in Common Data Model format by attaching it as a Dataflow may be worth considering. There’s a performance difference between doing this and connecting direct to multiple CSV files (see here) but it’s certainly a lot more user-friendly. It’s also worth noting that support for the newer CDM manifest format in Dataflows has just been announced, although Dataflows don’t support CDM folders with data stored in Parquet format yet.