Quantcast
Channel: Chris Webb's BI Blog: Power Query
Viewing all articles
Browse latest Browse all 248

Optimise The Performance Of Reading Data From ADLSgen2 In Power BI With The HierarchicalNavigation Option

$
0
0

Last year Marco Russo wrote a very useful blog post pointing out the performance problems you can run into when connecting to data stored in ADLSgen2 from Power BI when there are a large number of files elsewhere in the container. You can read that post here:

https://www.sqlbi.com/blog/marco/2020/05/29/optimizing-access-to-azure-data-lake-storage-adls-gen-2-in-power-query/

Marco’s advice – which is 100% correct – is that you should either pass the full path to the folder that you want to connect in the initial call to AzureStorage.DataLake() or, if you’re connecting to a single file, pass the path to the file itself. This avoids the performance overhead of reading metadata from files you’re not interested in reading from, which can be quite considerable.

There are some scenarios where this advice doesn’t work, though, and there is another way to avoid this overhead and make the performance of reading data much faster – and this is by using the HierarchicalNavigation option of the AzureStorage.DataLake() function. I blogged about what this option does some time ago but didn’t realise at the time the performance benefits of using it:

https://blog.crossjoin.co.uk/2019/09/29/hierarchical-navigation-adlsgen2-power-bi/

Consider the following scenario. Let’s say you want to connect to a CSV file in a folder which also contains a subfolder that contains many (in this example 20,000) other files that you’re not interested in:

[I’m only going to connect to a single file here to keep the example simple; I know I could just connect direct to the file rather than the folder and avoid the performance overhead that way]

Here’s the M code generated by the Power Query Editor using the default options to get the contents of the aSales.csv file:

let
  Source = AzureStorage.DataLake(
    "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder"
  ),
  Navigate = Source
    {
      [
        #"Folder Path"
          = "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder/",
        Name = "aSales.csv"
      ]
    }
    [Content],
  #"Imported CSV" = Csv.Document(
    Navigate,
    [
      Delimiter  = ",",
      Columns    = 2,
      Encoding   = 1252,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers"
    = Table.PromoteHeaders(
    #"Imported CSV",
    [PromoteAllScalars = true]
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Promoted Headers",
    {
      {"Product", type text},
      {"Sales", Int64.Type}
    }
  )
in
  #"Changed Type"

In Power BI Desktop refreshing the table that this M query returns (even with the Allow Data Preview To Download In The Background option turned off) takes 23 seconds. I measured refresh time using a stopwatch, starting with the time that I clicked the refresh button and ending when the refresh dialog disappeared; this is a lot longer than the refresh time that you might see using the Profiler technique I blogged about here, but as a developer this is the refresh time that you’ll care about.

The problem here is the Source step which returns a list of all the files in the ParentFolder folder and the ManySmallFiles subfolder.

Now, here’s an M query that returns the same data but where the HierarchicalNavigation=true option is set:

let
  Source = AzureStorage.DataLake(
    "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder",
    [HierarchicalNavigation = true]
  ),
  Navigation = Source
    {
      [
        #"Folder Path"
          = "https://xyz.dfs.core.windows.net/MyContainer/ParentFolder/",
        Name = "aSales.csv"
      ]
    }
    [Content],
  #"Imported CSV" = Csv.Document(
    Navigation,
    [
      Delimiter  = ",",
      Columns    = 2,
      Encoding   = 1252,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers"
    = Table.PromoteHeaders(
    #"Imported CSV",
    [PromoteAllScalars = true]
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Promoted Headers",
    {
      {"Product", type text},
      {"Sales", Int64.Type}
    }
  )
in
  #"Changed Type"

This takes just 3 seconds to refresh in Power BI Desktop – a really big improvement.

Conclusion: always use the HierarchicalNavigation=true option in AzureStorage.DataLake() when connecting to data in ADLSgen2 storage from Power BI to get the best refresh performance and the best developer experience in Power BI Desktop.


Viewing all articles
Browse latest Browse all 248

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>