Optimising The Performance Of Power Query Merges In Power BI, Part 1

Merging (or, in SQL terms, joining) tables in Power Query is a common cause of refresh performance problems. I’ve often wondered whether there’s anything you can do to optimise the performance of merges for non-foldable data source and so I decided to run some tests to try to answer all the questions I had. In this series of posts I’ll tell you what I found.

For these tests the only data source I used was a CSV file with one million rows and seven numeric columns named A, B C, D, E, F and G:

Image may be NSFW.
Clik here to view. Csv

I used SQL Server Profiler to measure the amount of time taken for a query to execute using the technique I blogged about here. If you read that post (and I strongly recommend you do) you’ll see there are actually two Profiler events whose duration is significant when measuring refresh performance:

Progress Report End/25 Execute SQL
Progress Report End/17 Read Data

It also turns out that these two events provide some insight into how certain transformations are handled in the Power Query engine, but we’ll come to that later.

The first question I decided to investigate was this:

Does the number of columns in a table affect the performance of a merge?

First of all, I created two identical queries called First and Second that connected to the CSV file, used the first row from the file as the headers, and set the data types to all seven columns to Whole Number. Nothing very special, but here’s the M code all the same:

let
  Source = Csv.Document(
    File.Contents("C:\NumbersMoreColumns.csv"), 
    [Delimiter = ",", Columns = 7, 
     Encoding = 65001, QuoteStyle = QuoteStyle.None]
  ),
  #"Promoted Headers" = 
   Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Promoted Headers", 
    {
      {"A", Int64.Type}, 
      {"B", Int64.Type}, 
      {"C", Int64.Type}, 
      {"D", Int64.Type}, 
      {"E", Int64.Type}, 
      {"F", Int64.Type}, 
      {"G", Int64.Type}
    }
  )
in
  #"Changed Type"

I disabled these queries so that they were not loaded into the dataset.

Next, I created a third query that used the Table.NestedJoin function to merge the data from these two queries using an inner join and return all of the columns from both source queries:

let
  Source = Table.NestedJoin(
   First, {"A"}, Second, {"A"}, 
   "Second", JoinKind.Inner),
  #"Expanded Second" = Table.ExpandTableColumn(
    Source, 
    "Second", 
    {"A", "B", "C", "D", "E", "F", "G"}, 
    {"Second.A", "Second.B", "Second.C", 
     "Second.D", "Second.E", "Second.F", "Second.G"}
  )
in
  #"Expanded Second"

Image may be NSFW.
Clik here to view. MergeAllCols

When I refreshed this query, in Profiler the two events I mentioned above had the following durations:

Progress Report End/25 Execute SQL – 40 seconds
Progress Report End/17 Read Data – 56 seconds

Pretty slow. But what is performance like when you merge two tables with one column instead of seven?

To test this, I added an extra step to the First and Second queries that removed all but the A columns (the ones needed for the merge) like so:

let
  Source = Csv.Document(
    File.Contents("C:\NumbersMoreColumns.csv"), 
    [Delimiter = ",", Columns = 7, 
     Encoding = 65001, QuoteStyle = QuoteStyle.None]
  ),
  #"Promoted Headers" = 
   Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Promoted Headers", 
    {
      {"A", Int64.Type}, 
      {"B", Int64.Type}, 
      {"C", Int64.Type}, 
      {"D", Int64.Type}, 
      {"E", Int64.Type}, 
      {"F", Int64.Type}, 
      {"G", Int64.Type}
    }
  ),
  #"Removed Other Columns" = 
   Table.SelectColumns(#"Changed Type", {"A"})
in
  #"Removed Other Columns"

Image may be NSFW.
Clik here to view. Just A

I then updated the third query that contained the merge to reflect this change:

let
  Source = Table.NestedJoin(
   First, {"A"}, Second, {"A"}, 
   "Second", JoinKind.Inner),
  #"Expanded Second" = 
   Table.ExpandTableColumn(Source, "Second", {"A"}, {"Second.A"})
in
  #"Expanded Second"

Image may be NSFW.
Clik here to view. Merge

When this query was refreshed, Profiler showed the following durations:

Progress Report End/25 Execute SQL – 9 seconds
Progress Report End/17 Read Data – 1 seconds

This query is a lot quicker, but then I thought: what if the performance is more to do with the size of the table returned by the query rather than the merge? So I added an extra step to the end of the merge query, like so:

let
  Source = Table.NestedJoin(
   First, {"A"}, Second, {"A"}, 
   "Second", JoinKind.Inner),
  #"Expanded Second" = 
   Table.ExpandTableColumn(Source, "Second", {"A"}, {"Second.A"}),
  #"Counted Rows" = Table.RowCount(#"Expanded Second")
in
  #"Counted Rows"

Image may be NSFW.
Clik here to view. Rowcount

…and then reran the two tests above. My thinking was that now the merge query only returns a single value the amount of data returned by the query should not be a factor in the duration of the queries.

Here are the timings for the version with the merge on the tables with seven columns:

Progress Report End/25 Execute SQL – 56 seconds
Progress Report End/17 Read Data – 0 seconds

Here are the timings for the version with the merge on the tables with just one column:

Progress Report End/25 Execute SQL – 14 seconds
Progress Report End/17 Read Data – 0 seconds

This does seem to confirm that the number of columns in a table affects the performance of a merge, although of course it might be that it takes longer to count the rows of a table that has more columns.

This shows something else too: Read Data is instant in both cases, compared to the first two tests where it took longer than the Execute SQL events.

Why does the number of columns influence the performance of a merge? If you read my recent post on monitoring memory usage with Query Diagnostics, you’ll remember that merges have to take place in memory – so I guess the larger the tables involved in the merge, the more memory is needed and the more paging happens if the 256MB limit is exceeded. Looking at the performance counter data generated for the last two queries showed that the 256MB limit was indeed exceeded for both the last two queries above, but while the version joining the table with two columns had a maximim commit of 584MB the maximum commit for the version joining the table with seven columns was almost 3GB.

That’s enough for now. At this point I think it’s fair to say the following:

Removing any unwanted columns before you merge two tables in Power Query will improve refresh performance.

Optimising The Performance Of Power Query Merges In Power BI, Part 1

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112