In all the testing I’ve done recently with importing data from Parquet files into Power BI I noticed something strange: loading data from a folder containing multiple Parquet files seemed a lot slower than I would expect, based on the time taken to load data from a single file. So I wondered – is there something that can be optimised? It turns out there is and in this blog post I’ll show you what I did.
If you import data from a folder containing Parquet files – whether it’s a local folder or a folder in ADLSgen2 storage – you’ll see a series of queries created for you in the Power Query Editor window that looks like this:
The query called Query1 shown in the screenshot iterates over all the files in the folder you’ve chosen and calls a function that reads the data from each Parquet file. It returns a table that contains a column with the name of the original source file in (which isn’t all that interesting for Parquet files) and all the columns from the Parquet files you’re containing.
Using the Parquet files from my series of posts on importing data from ADLSgen2 as a source, here’s the M code Power Quey generates for this query which I have modified to remove the column with the source file name in:
let
Source = Folder.Files("C:\MyFolder"),
#"Filtered Hidden Files1"
= Table.SelectRows(
Source,
each [Attributes]?[Hidden]? <> true
),
#"Invoke Custom Function1"
= Table.AddColumn(
#"Filtered Hidden Files1",
"Transform File (3)",
each #"Transform File (3)"(
[Content]
)
),
#"Renamed Columns1"
= Table.RenameColumns(
#"Invoke Custom Function1",
{"Name", "Source.Name"}
),
#"Removed Other Columns1"
= Table.SelectColumns(
#"Renamed Columns1",
{"Transform File (3)"}
),
#"Expanded Table Column1"
= Table.ExpandTableColumn(
#"Removed Other Columns1",
"Transform File (3)",
Table.ColumnNames(
#"Transform File (3)"(
#"Sample File (3)"
)
)
),
#"Changed Type"
= Table.TransformColumnTypes(
#"Expanded Table Column1",
{
{"TransDate", type date},
{"GuestId", type text},
{"ProductId", type text},
{"NetAmount", type number}
}
)
in
#"Changed Type"
Here’s the output:
On my PC this query took an average of 102 seconds to refresh.
Apart from this query being slower than I expected, I also noticed that there is a “Changed Type” step at the end – which I thought was unnecessary because unlike CSV files, Parquet has typed columns. If you connect to a single Parquet file in Power Query then it recognises the column types, so why not here? Well, it’s because of the way it’s combining files by expanding table columns, and there is a way to work around this that I blogged about here:
https://blog.crossjoin.co.uk/2017/09/25/setting-data-types-on-nested-tables-in-m/
Setting a type on the table column before expanding it did indeed improve performance, but this led me to another optimisation.
I know that using the Table.Combine M function can perform differently to the Table.ExpandTableColumn function used in the original version of the query (although it does not always perform better). Therefore I made the following change to the query above: using Table.Combine to return a single table with all the data in (note that setting a type on the table column is not necessary for this optimisation). Here’s the new version:
let
Source = Folder.Files("C:\Myfolder"),
#"Filtered Hidden Files1"
= Table.SelectRows(
Source,
each [Attributes]?[Hidden]? <> true
),
#"Invoke Custom Function1"
= Table.AddColumn(
#"Filtered Hidden Files1",
"Transform File",
each #"Transform File"([Content])
),
#"Renamed Columns1"
= Table.RenameColumns(
#"Invoke Custom Function1",
{"Name", "Source.Name"}
),
#"Removed Other Columns1"
= Table.SelectColumns(
#"Renamed Columns1",
{"Source.Name", "Transform File"}
),
Combine = Table.Combine(
#"Removed Other Columns1"[
Transform File
]
)
in
Combine
This version of the query took, on average 43 seconds to refresh – a massive improvement.
If you’ve been following my series on ADLSgen2 refresh you may remember that I blogged about importing from a folder of Parquet files there too: in this post I noted that it took on average 72 seconds to load the same data from an ADLSgen2 folder in the Power BI Service using the original code; that was with the Source File column in and removing that column made no different to performance. This new version of the query took on average 49 seconds.
The conclusion is obvious: if you need to load data from a folder of Parquet files then you should use this new approach because the performance benefits are substantial. I know what you’re thinking: does this technique work for other file types apart from Parquet like CSV? The answer is no, because these file types don’t have typed columns like Parquet so it won’t work unfortunately.