In one of my first posts on Power Query (and still my favourite) I found the top 100 words in the complete works of Shakespeare. As always when you’re learning a new tool, though, I look back at what I wrote then and realise there are better ways of doing things… one of which is removing all of the punctuation from a piece of text.
In the post I’m talking about I used the following expression to remove punctuation:
Text.Remove(RemoveLineBreaks,{"," ,"." ,"?" ,";" ,":" ,";" ,"’","@" ,"#" ,"~" ,"{" ,"[" ,"}" ,"]" ,"(" ,")", "*"})
However, as you can see, it only removes the punctuation could be bothered to hard-code in the list – luckily Shakespeare didn’t use too much exotic punctuation! There is a better way of doing this, though, and it relies on the Character.FromNumber() function which takes a number and returns the equivalent Unicode character; for example, the expression Character.FromNumber(38) returns an ampersand & symbol.
Looking at the list of Unicode characters here it’s easy to identify the ranges of numbers that represent punctuation characters. These ranges can then be represented in M as lists. Remembering that lists of ranges of numbers can be declared easily by using the notation {1..5}, which results in the list of numbers {1,2,3,4,5}, you can write a list of lists containing the numbers representing the Latin punctuation characters in Unicode like so:
{{0..31},{33..47},{58..64},{91..96},{123..191}}
and you can use List.Combine() to turn this list of lists into a single list of numbers. Finally, you can use List.Transform() and Character.FromNumber() to get a list of the Unicode characters for these numbers, and pass that list to Text.Remove(). Here’s a query showing everything:
let
//get a list of lists containing the numbers of Unicode punctuation characters
numberlists = {{0..31},{33..47},{58..64},{91..96},{123..191}},
//turn this into a single list
combinedlist = List.Combine(numberlists),
//get a list of all the punctuation characters that these numbers represent
punctuationlist = List.Transform(combinedlist, each Character.FromNumber(_)),
//some text to test this on
inputtext = "Hello! My name is Chris, and I'm hoping that this *cool* post will help you!",
//the text with punctuation removed
outputtext = Text.Remove(inputtext, punctuationlist)
in
outputtext
The output of this query is this:
You can download the sample workbook for this post here.