There has been a lot of excitement around the newly-added support for reading from Parquet files in Power BI. However I have to admit that I was disappointed not to see any big improvements in performance when reading data from Parquet compared to reading data from CSV (for example, see here) when I first started testing it. So, is Power Query able to take advantage of Parquet’s columnar storage when reading data?
The answer is yes, but you may need to make some changes to your Power Query queries to ensure you get the best possible performance. Using the same data that I have been using in my recent series of posts on importing data from ADLSgen2, I took a single 10.1MB Parquet file and downloaded it to my PC. Here’s what the data looked like:
I then created a query to count the number of rows in the table stored in this Parquet file where the TransDate column was 1/1/2015:
let
Source = Parquet.Document(
File.Contents(
"C:\myfile.snappy.parquet"
)
),
#"Filtered Rows" = Table.SelectRows(
Source,
each [TransDate] = #date(2015, 1, 1)
),
#"Counted Rows" = Table.RowCount(
#"Filtered Rows"
)
in
#"Counted Rows"
Here’s the output:
I then used SQL Server Profiler to find out how long this query took to execute (as detailed here): on average it took 3 seconds.
Here’s what I saw in Power BI Desktop while loading the data just before refresh finished:
I then added an extra step to the query to remove all columns except the TransDate column:
let
Source = Parquet.Document(
File.Contents(
"C:\myfile.snappy.parquet"
)
),
#"Removed Other Columns"
= Table.SelectColumns(
Source,
{"TransDate"}
),
#"Filtered Rows" = Table.SelectRows(
#"Removed Other Columns",
each [TransDate] = #date(2015, 1, 1)
),
#"Counted Rows" = Table.RowCount(
#"Filtered Rows"
)
in
#"Counted Rows"
This version of the query only took an average of 0.7 seconds to run – a substantial improvement. This time the maximum amount of data read by Power Query was only 2.44MB:
As you can see, in this case removing unnecessary columns improved the performance of reading data from Parquet files a lot. This is not always true though – I tested a Group By transformation and in that case the Power Query engine was clever enough to only read the required columns, and manually removing columns made no difference to performance.
This demonstrates that Power Query is able to take advantage of Parquet’s columnar storage to only read data from certain columns. However, this is the only performance optimisation available to Power Query on Parquet – it doesn’t do predicate pushdown or anything like that. What’s more, when reading data from the ADLSgen2 connector, the nature of Parquet storage stops Power Query from making parallel requests for data (I guess the same behaviour that is controlled by the ConcurrentRequests option) which puts it at a disadvantage compared to reading data from CSV files.
I think a lot more testing is needed to understand how to get the best performance when reading data from Parquet, so look out for more posts on this subject in the future…
[Thanks once again to Eric Gorelik from the Power Query development team for providing the information about how the Parquet connector works, and to Ben Watt and Gerhard Brueckl for asking the questions in the first place]
Bonus fact: in case you’re wondering, the following compression types are supported by the Parquet connector: GZip, Snappy, Brotli, LZ4, and ZStd.