Working With CSV Files That Contain Rogue Line Breaks In Power Query And Power BI

CSV files often contain text fields, and text fields often contain line breaks that shouldn’t be there (especially if the text is entered by end users). For example, here is are the contents of a simple CSV file that should contain three fields(Product, Comment and Sales) and six rows:

Product,Comment,Sales
Apples,This is a commment,10
Oranges,Another comment,20
Pears,A comment with a
line break in,30
Grapes,A normal comment,40
Bananas,A comment
with
two line breaks,50
Pineapples,A normal comment,60

However, as you can see, the Comment field contains line breaks in some cases. If you try to load this file into Power Query/Power BI, you’ll see that it doesn’t handle things well:

Image may be NSFW.
Clik here to view.

The easiest way of solving this problem is to make sure you have double quotes around the text in the Comment field, so that Power Query/Power BI knows to ignore the line breaks:

Product,Comment,Sales
Apples,"This is a commment",10
Oranges,"Another comment",20
Pears,"A comment with a
line break in",30
Grapes,"A normal comment",40
Bananas,"A comment
with
two line breaks",50
Pineapples,"A normal comment",60

Image may be NSFW.
Clik here to view.

[By the way, no text in the screenshot above has been truncated. The line breaks are still there in the Comment field; you can see the full text in the bottom pane of the Query Editor by clicking on a cell in the table.]

But what if you can’t fix the source data? I was asked this question the other week, and since I had been asked about it before and not come up with a good answer, I decided to spend some time researching the problem.

What I found was that it was relatively easy to write some M code that gave me the correct results, but very hard to write code that performed acceptably well on a large data set (I was testing on a CSV file containing almost half a million rows). Here’s the code for the function I ended up with:

(FileName as text, DelimiterCharacter as text, DelimitersPerLine as number, UseFirstRowAsHeaders as logical) =>
let
    // Load file and buffer in memory
    BufferedBinary = Binary.Buffer(File.Contents(FileName)),
    // Convert to a single block of text
    Source = Text.FromBinary(BufferedBinary),
    // Find the position of all line breaks
    LineBreaks = List.Buffer(Text.PositionOf(Source, "#(cr,lf)", Occurrence.All)),
    // Get a list of numbers from 0 to the number of line breaks-1
    LinePositions = List.Positions(LineBreaks),
    // Count the number of line breaks
    TotalLines = List.Count(LineBreaks),
    // Split the text by line breaks
    FirstSplit = Text.Split(Source, "#(cr,lf)"),
    // On each line, count the number of column delimiter characters
    DelimiterCounts = List.Buffer(
                           List.Transform(
                            LinePositions, 
                            each List.Count(Text.PositionOf(FirstSplit{_}, 
                            DelimiterCharacter, 
                            Occurrence.All)))),
    // Loop through each line and
    // a) Calculate a running total of the number of column delimiter characters on each line
    // b) Divide the running total by the expected number of delimiters per line, and round down to the nearest integer
    RunningSums = List.Buffer(
                           List.Generate(
                            ()=>[Counter=0, RunningTotal=0], 
                            each [Counter]<TotalLines, 
                            each [Counter=[Counter]+1, RunningTotal=[RunningTotal]+DelimiterCounts{[Counter]}],
                            each Number.IntegerDivide([RunningTotal]+DelimiterCounts{[Counter]},DelimitersPerLine))),
    // If the result of the division calculated in the previous step is
    // the same as on the previous line, then the line break can be ignored
    // so return null, otherwise return the position of the line break
    FindFullLines = List.Transform(LinePositions, each if _=0 or RunningSums{_}<>RunningSums{_-1} then LineBreaks{_} else null),
    // Remove nulls from the list
    RemoveNulls = List.RemoveNulls(FindFullLines),
    // Split the original text by the positions of the line breaks left
    LineSplitterFunction = Splitter.SplitTextByPositions(List.Combine({{0},RemoveNulls})),
    ActualLines = LineSplitterFunction(Source),
    // Convert the result to a table
    ConvertToTable = Table.FromList(ActualLines, Splitter.SplitTextByDelimiter(DelimiterCharacter), null, null, ExtraValues.Ignore),
    // Use the first rows as headers if desired
    PromoteHeaders = if UseFirstRowAsHeaders then Table.PromoteHeaders(ConvertToTable) else ConvertToTable
in
    PromoteHeaders

Assuming that you called this function LoadFile(), here’s an example of how to call it:

LoadFile(
//File name
"C:\MyFile.csv", 
//Column delimiter character - in this case a comma
",", 
//The expected number of commas to be found per line
//ie one less than the number of columns
2, 
//If true then first row contains header values
true)

The comments in the code should give you an idea of how it all works. A few other things to note:

If any of your lines contain extra columns, this approach won’t work
I’ve noticed a difference between how the Power BI Desktop and Excel 2016 behave with this code: the current Excel 2016 Query Editor will keep any line breaks after splitting on them, whereas Power BI Desktop does not. It’s easy to fix the Excel output though: just use the Clean transform to remove the line breaks.

I’m not really sure why it was so hard to come up with code that performed well, even after liberal use of List.Buffer(). It seemed like any time I referenced the original text (as returned by the Source step) in the List.Generate() function then performance died – so you’ll notice in the code above that I’m only referencing the list that contains the number of delimiters found per line.

Image may be NSFW.
Clik here to view.

Working With CSV Files That Contain Rogue Line Breaks In Power Query And Power BI

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112