Quantcast
Channel: Chris Webb's BI Blog: Power Query
Viewing all articles
Browse latest Browse all 248

Speed Up Data Refresh Performance In Power BI Desktop Using Table.View

$
0
0

It can sometimes be frustrating to work with slow data sources or complex Power Query queries in Power BI Desktop: you open the Power Query Editor, make some changes, click Close & Apply and then wait a loooong time for your data to refresh. In this post I’m going to show you a technique that can cut this wait by up to 50%. It involves some fairly complex M code but I promise you, the effort is worth it!

In fact, what I’m going to describe is more or less what I showed towards the end of my appearance on Guy In A Cube last year and in a few other posts, but at that time I didn’t understand properly why it worked or what the performance implications actually were. Now, thanks to a lot of help from Curt Hagenlocher of the Power Query development team I have all the details I need to blog about it.

Let’s see a simple example. Say you have a large JSON file – for this test I generated one that is 67MB containing random data using this handy online tool – and you want to load it into Power BI. You’ll end up with a query that looks something like this:

let
  Source = Json.Document(File.Contents("C:\generated.json")),
  #"Converted to Table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ),
  #"Expanded Column1" = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {"_id", "index", "guid", "isActive", "balance"}, 
    {"_id", "index", "guid", "isActive", "balance"}
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Expanded Column1", 
    {
      {"_id", type text}, 
      {"index", Int64.Type}, 
      {"guid", type text}, 
      {"isActive", type text}, 
      {"balance", type text}
    }
  )
in
  #"Changed Type"

Here’s what the output of this query looks like:

Output

On my laptop this query takes around 7-8 seconds to load. Using Process Monitor to see how much data is being read from the file (see this post for more details) shows that Power Query is reading the data from this file twice. This graph, generated using data from Process Monitor, has time on the x axis and amount of data read from the file on the y axis, and the two peaks indicate that the data is being read twice:

FirstLoad

Why? When you refresh a table in Power BI Desktop two things happen:

  • First, Power BI has to check what columns are present in the table. To do this, it runs the query for the table but asks the Power Query engine to filter the table so it returns zero rows.
  • Secondly the query is run again but this time with no filter, so all rows are returned, and this is when the data is actually loaded into the table in Power BI.

It’s on this first run of the query where problems can occur. If you’re using a data source like SQL Server and you’re not doing any complex transformations then query folding will take place, so Power BI’s request to return the table but with no rows can be handled very efficiently. However, if you’re using a data source where query folding is not possible (such as Excel, CSV or JSON files) or you have complex transformations that stop folding taking place then the only way Power Query can work out what columns the query returns is by running the entire query. That’s what is happening in the example above.

This only happens in Power BI Desktop, though. When you refresh a dataset in the Power BI Service the query is only run once.

The solution is to trick the Power Query engine into running these zero-row filter queries instantly, and you can do that using the Table.View M function. I blogged about Table.View here (and I strongly suggest you read that post before carrying on) but that example didn’t quite get to the solution you need here. Here’s a new version of the query above with an extra step:

let
  Source = Json.Document(File.Contents("C:\generated.json")),
  #"Converted to Table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ),
  #"Expanded Column1" = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {"_id", "index", "guid", "isActive", "balance"}, 
    {"_id", "index", "guid", "isActive", "balance"}
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Expanded Column1", 
    {
      {"_id", type text}, 
      {"index", Int64.Type}, 
      {"guid", type text}, 
      {"isActive", type text}, 
      {"balance", type text}
    }
  ),
  OverrideZeroRowFilter = Table.View(
    null, 
    [
    GetType = () => 
      type table[
      _id = Text.Type, 
      index = Int64.Type, 
      guid = Text.Type, 
      isAction = Text.Type, 
      balance = Text.Type
    ], 
    GetRows = () => 
      #"Changed Type", 
    OnTake = (count as number) => 
      if count = 0 then 
      #table(
      type table[
        _id = Text.Type, 
        index = Int64.Type, 
        guid = Text.Type, 
        isAction = Text.Type, 
        balance = Text.Type
      ], 
      {}
    ) 
    else 
     Table.FirstN(#"Changed Type", count)]
  )
in
  OverrideZeroRowFilter

The OverrideZeroRowFilter step is where the magic happens. It uses Table.View to override query folding behaviour by intercepting what happens when the table returned by the query is filtered. There are three fields in the record in the second parameter of Table.View that you need to change:

  • GetType returns a table type that describes the columns and their data types present in the output of the query. The six columns listed here are the six columns you can see in the screenshot of the query output above. It’s very easy to generate the required M code when you use the custom function that I blogged about here.
  • GetRows returns all the rows that the query can return, that’s to say the table returned by the #”Changed Type” step.
  • OnTake is used when a top n filter is applied to the table returned by the query.  In this case the code looks at the number of rows that are being requested (specified in the count parameter) and if that’s zero, it just returns an empty table with the same columns as the #”Changed Type” step; if it’s more than zero then it uses Table.FirstN to get the actual number of rows requested from #”Changed Type”. This means that when Power BI does that zero-row filter it can now happen immediately because there’s no need to go back to the data source or execute any of the transformations in the query.

This version of the query now runs in 4-5 seconds, and Process Monitor now shows that the JSON file is only read once and, obviously, reading the data once is a lot faster than reading it twice:

SecondLoad

Look at how the graph flattens at the end… but that’s something for another blog post.

There is a downside to this approach: you have to hard-code the table schema that you expect your query to return, and if you change your query to return different columns you’ll have to update the table type in the last step.

I’ve used this technique on other slow queries and it has made a massive difference to the development experience in Power BI Desktop – one query that was taking five minutes to refresh when I closed the Power Query Editor went down to two and a half minutes. If you try this out yourself please let me know if it makes a difference by leaving a comment.

[Don’t forget that there are other things you can do that will also speed up the development experience for slow queries: you should definitely turn off the “Allow data preview to download in the background” option, and you might want to consider turning off data privacy checks so long as you fully understand what the implications]

 


Viewing all articles
Browse latest Browse all 248

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>