CSV files often contain text fields, and text fields often contain line breaks that shouldn’t be there (especially if the text is entered by end users). For example, here is are the contents of a simple CSV file that should contain three fields(Product, Comment and Sales) and six rows:
Product,Comment,Sales Apples,This is a commment,10 Oranges,Another comment,20 Pears,A comment with a line break in,30 Grapes,A normal comment,40 Bananas,A comment with two line breaks,50 Pineapples,A normal comment,60
However, as you can see, the Comment field contains line breaks in some cases. If you try to load this file into Power Query/Power BI, you’ll see that it doesn’t handle things well:
The easiest way of solving this problem is to make sure you have double quotes around the text in the Comment field, so that Power Query/Power BI knows to ignore the line breaks:
Product,Comment,Sales Apples,"This is a commment",10 Oranges,"Another comment",20 Pears,"A comment with a line break in",30 Grapes,"A normal comment",40 Bananas,"A comment with two line breaks",50 Pineapples,"A normal comment",60
[By the way, no text in the screenshot above has been truncated. The line breaks are still there in the Comment field; you can see the full text in the bottom pane of the Query Editor by clicking on a cell in the table.]
But what if you can’t fix the source data? I was asked this question the other week, and since I had been asked about it before and not come up with a good answer, I decided to spend some time researching the problem.
What I found was that it was relatively easy to write some M code that gave me the correct results, but very hard to write code that performed acceptably well on a large data set (I was testing on a CSV file containing almost half a million rows). Here’s the code for the function I ended up with:
(FileName as text, DelimiterCharacter as text, DelimitersPerLine as number, UseFirstRowAsHeaders as logical) => let // Load file and buffer in memory BufferedBinary = Binary.Buffer(File.Contents(FileName)), // Convert to a single block of text Source = Text.FromBinary(BufferedBinary), // Find the position of all line breaks LineBreaks = List.Buffer(Text.PositionOf(Source, "#(cr,lf)", Occurrence.All)), // Get a list of numbers from 0 to the number of line breaks-1 LinePositions = List.Positions(LineBreaks), // Count the number of line breaks TotalLines = List.Count(LineBreaks), // Split the text by line breaks FirstSplit = Text.Split(Source, "#(cr,lf)"), // On each line, count the number of column delimiter characters DelimiterCounts = List.Buffer( List.Transform( LinePositions, each List.Count(Text.PositionOf(FirstSplit{_}, DelimiterCharacter, Occurrence.All)))), // Loop through each line and // a) Calculate a running total of the number of column delimiter characters on each line // b) Divide the running total by the expected number of delimiters per line, and round down to the nearest integer RunningSums = List.Buffer( List.Generate( ()=>[Counter=0, RunningTotal=0], each [Counter]<TotalLines, each [Counter=[Counter]+1, RunningTotal=[RunningTotal]+DelimiterCounts{[Counter]}], each Number.IntegerDivide([RunningTotal]+DelimiterCounts{[Counter]},DelimitersPerLine))), // If the result of the division calculated in the previous step is // the same as on the previous line, then the line break can be ignored // so return null, otherwise return the position of the line break FindFullLines = List.Transform(LinePositions, each if _=0 or RunningSums{_}<>RunningSums{_-1} then LineBreaks{_} else null), // Remove nulls from the list RemoveNulls = List.RemoveNulls(FindFullLines), // Split the original text by the positions of the line breaks left LineSplitterFunction = Splitter.SplitTextByPositions(List.Combine({{0},RemoveNulls})), ActualLines = LineSplitterFunction(Source), // Convert the result to a table ConvertToTable = Table.FromList(ActualLines, Splitter.SplitTextByDelimiter(DelimiterCharacter), null, null, ExtraValues.Ignore), // Use the first rows as headers if desired PromoteHeaders = if UseFirstRowAsHeaders then Table.PromoteHeaders(ConvertToTable) else ConvertToTable in PromoteHeaders
Assuming that you called this function LoadFile(), here’s an example of how to call it:
LoadFile( //File name "C:\MyFile.csv", //Column delimiter character - in this case a comma ",", //The expected number of commas to be found per line //ie one less than the number of columns 2, //If true then first row contains header values true)
The comments in the code should give you an idea of how it all works. A few other things to note:
- If any of your lines contain extra columns, this approach won’t work
- I’ve noticed a difference between how the Power BI Desktop and Excel 2016 behave with this code: the current Excel 2016 Query Editor will keep any line breaks after splitting on them, whereas Power BI Desktop does not. It’s easy to fix the Excel output though: just use the Clean transform to remove the line breaks.
I’m not really sure why it was so hard to come up with code that performed well, even after liberal use of List.Buffer(). It seemed like any time I referenced the original text (as returned by the Source step) in the List.Generate() function then performance died – so you’ll notice in the code above that I’m only referencing the list that contains the number of delimiters found per line.