A recent conversation in the comments of this blog post brought up the subject of how to handle columns that have either been removed from or added to a data source in Power Query. Anyone who has worked with csv files knows that they have a nasty habit of changing format even when they aren’t supposed to, and added or removed columns can cause all kinds of problems downstream.
Ken Puls (whose excellent blog you are probably already reading if you’re interested in Power Query) pointed out that it’s very easy to protect yourself against new columns in your data source. When creating a query, select all the columns that you want and then right-click and select Remove Other Columns:
This means that if any new columns are added to your data source in the future, they won’t appear in the output of your query. In the M code the Table.SelectColumns() function is used to do this.
Dealing with missing columns is a little bit more complicated. In order to find out whether a column is missing, first of all you’ll need a list of columns that should be present in your query. You can of course store these tables in a table in Excel and enter the column names manually, or you can do this in M fairly easily by creating a query that connects to your data source and using the Table.ColumnNames() function something like this:
let //Connect to CSV file Source = Csv.Document( File.Contents( "C:\Users\Chris\Documents\Power Query demos\SampleData.csv" ),null,",",null,1252), //Use first row as headers FirstRowAsHeader = Table.PromoteHeaders(Source), //Get a list of column names GetColumns = Table.ColumnNames(FirstRowAsHeader), //Turn this list into a table MakeATable = Table.FromList( GetColumns, Splitter.SplitByNothing(), null, null, ExtraValues.Error), //Rename this table's sole column RenamedColumns = Table.RenameColumns( MakeATable , {{"Column1", "ColumnName"}}) in RenamedColumns
Given a csv file that looks like this:
…the query above returns the following table of column names:
You can then store the output of this query in an Excel table for future reference – just remember not to refresh the query!
Having done that, you can then look at the columns returned by your data source and compare them with the columns you are expecting by using the techniques shown in this post. For example, here’s a query that reads a list of column names from an Excel table and compares them with the columns returned from a csv file:
let //Connect to Excel table containing expected column names ExcelSource = Excel.CurrentWorkbook(){[Name="GetColumnNames"]}[Content], //Get list of expected columns ExpectedColumns = Table.Column(ExcelSource, "ColumnName"), //Connect to CSV file CSVSource = Csv.Document( File.Contents( "C:\Users\Chris\Documents\Power Query demos\SampleData.csv" ),null,",",null,1252), //Use first row as headers FirstRowAsHeader = Table.PromoteHeaders(CSVSource), //Get a list of column names in csv CSVColumns = Table.ColumnNames(FirstRowAsHeader), //Find missing columns MissingColumns = List.Difference(ExpectedColumns, CSVColumns), //Find added columns AddedColumns = List.Difference(CSVColumns, ExpectedColumns), //Report what has changed OutputMissing = if List.Count(MissingColumns)=0 then "No columns missing" else "Missing columns: " & Text.Combine(MissingColumns, ","), OutputAdded = if List.Count(AddedColumns)=0 then "No columns added" else "Added columns: " & Text.Combine(AddedColumns, ","), Output = OutputMissing & " " & OutputAdded in Output
Given a csv file that looks like this:
…and an Excel table like the one above containing the three column names Month, Product and Sales, the output of this query is:
It would be very easy to convert this query to a function that you could use to check the columns expected by multiple queries, and also to adapt the output to your own needs. Also, in certain scenarios (such as when you’re importing data from SQL Server) you might also want to check the data types used by the columns; I’ll leave that for another blog post though. In any case, data types aren’t so much of an issue with CSV files because it’s Power Query that imposes the types on the columns within a query, and any type conversion issues can be dealt with by Power Query’s error handling functionality (see Gerhard Brueckl’s post on this topic, for example).
You can download a workbook containing the two queries from this post here.