Quantcast
Channel: Chris Webb's BI Blog: Power Query
Viewing all articles
Browse latest Browse all 248

Removing Punctuation From Text In Power Query

$
0
0

In one of my first posts on Power Query (and still my favourite) I found the top 100 words in the complete works of Shakespeare. As always when you’re learning a new tool, though, I look back at what I wrote then and realise there are better ways of doing things… one of which is removing all of the punctuation from a piece of text.

In the post I’m talking about I used the following expression to remove punctuation:

Text.Remove(RemoveLineBreaks,{"," ,"." ,"?" ,";" ,":" ,";" ,"’","@" ,"#" ,"~" ,"{" ,"[" ,"}" ,"]" ,"(" ,")", "*"})

However, as you can see, it only removes the punctuation could be bothered to hard-code in the list – luckily Shakespeare didn’t use too much exotic punctuation! There is a better way of doing this, though, and it relies on the Character.FromNumber() function which takes a number and returns the equivalent Unicode character; for example, the expression Character.FromNumber(38) returns an ampersand & symbol.

Looking at the list of Unicode characters here it’s easy to identify the ranges of numbers that represent punctuation characters. These ranges can then be represented in M as lists. Remembering that lists of ranges of numbers can be declared easily by using the notation {1..5}, which results in the list of numbers {1,2,3,4,5}, you can write a list of lists containing the numbers representing the Latin punctuation characters in Unicode like so:

{{0..31},{33..47},{58..64},{91..96},{123..191}}

and you can use List.Combine() to turn this list of lists into a single list of numbers.  Finally, you can use List.Transform() and Character.FromNumber() to get a list of the Unicode characters for these numbers, and pass that list to Text.Remove(). Here’s a query showing everything:

let

    //get a list of lists containing the numbers of Unicode punctuation characters

    numberlists = {{0..31},{33..47},{58..64},{91..96},{123..191}},

    //turn this into a single list

    combinedlist = List.Combine(numberlists),

    //get a list of all the punctuation characters that these numbers represent

    punctuationlist = List.Transform(combinedlist, each Character.FromNumber(_)),

    //some text to test this on

    inputtext = "Hello! My name is Chris, and I'm hoping that this *cool* post will help you!",

    //the text with punctuation removed

    outputtext = Text.Remove(inputtext, punctuationlist)

in

    outputtext

 

The output of this query is this:

image

You can download the sample workbook for this post here.



Viewing all articles
Browse latest Browse all 248

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>