Chris Webb's BI Blog: Power Query

One of many cool things about Power Query is the way that it allows you to retrieve data from web services and load it into Excel. While this is a subject that lots of people (including me) have already blogged about, the Web.Contents() function – which is the M function that you’ll need to use to call a web service in your query – has gained some new, useful functionality over the last few releases that isn’t fully documented anywhere and which is important to know about.

For this post I’ll use a real-life example of a web service that I’ve been working with recently. As you may know, I’m one of the organisers of SQLBits and a few weeks ago I was given the task of building a Power BI solution to monitor registrations. SQLBits uses RegOnline to handle registrations and they have an API that allows developers to access registration data for events. I’ve been using this API in Power Query.

To take a simple example that illustrates the new functionality in Web.Contents(), take a look at the documentation for RegOnline’s GetEvents method:
http://developer.regonline.com/getevents/

When calling this method in Power Query using Web.Contents() there are two things I need to do:

Pass two query parameters to it, called filter and orderBy – although I only need to pass empty values to these parameters
Handle authentication by passing an API token through the HTTP header

Here’s one way of using Web.Contents() to do this:

Web.Contents(

  "https://www.regonline.com/api/default.asmx/GetEvents",

   Query=[ #"filter"="", #"orderBy"=""],

   Headers=[#"APIToken" = "insertAPITokenHere"]

])

In the example above, the second parameter to Web.Contents() is a record (so it has to be enclosed in square brackets) containing two options fields, Query and Headers. The Query field is itself a record, with one field for each of the two query parameters I’m passing. The Headers field is another record, with one field for each custom HTTP header I need (see this post on the Power Query forum for more details). The RegOnline API specifies that for the purposes of authentication a header is needed called APIToken which contains the API token, so the Headers option here contains one field for that header.

The problem with handling authentication like this is that I need to hard-code the RegOnline API token in the code for my Power Query query, which is not exactly secure. Therefore instead of doing this I use the ApiKeyName option to tell Power Query the name of the header that should contain the RegOnline API token:

Web.Contents(

 "https://www.regonline.com/api/default.asmx/GetEvents",

  Query=[ #"filter"="", #"orderBy"=""],

  ApiKeyName="APIToken"

])

This way, when the Power Query query is executed for the first time, you get prompted to tell Power Query how to authenticate the GetEvents method and in the dialog you have to choose Web API and enter the API token in the Key box, as shown below:

After this the API token gets stored separately from the query in Power Query’s own secure credentials store and you don’t need to enter it again; this also means that the API token will not travel with the workbook if you email it to someone else, for example.

One last thing to mention (and this is something that deserves a separate blog post really) is that while Web.Contents() generates a GET request by default, you can make it generate a POST request by specifying the Content option. You can find details on how to do this on this thread started by Lee Hawthorn on the old Data Explorer forum.

In my recent post on web services in Power Query I mentioned that while the Power Query Web.Contents() function generates a GET request by default, you can make it generate a POST request by specifying the Content option. Since this is a useful thing to be able to do I thought I’d put together a detailed example of how this works.

For my example I’m going to use the Bing Maps Elevations API, which allows you to get the elevation in metres for a set of geographic locations. You can read the documentation here:

http://msdn.microsoft.com/en-us/library/jj158961.aspx

As it says at the bottom of the page, if you have a large number of locations to pass to the web service you can do so by passing them as a comma delimited list of latitudes and longitudes using a POST request. Here’s the code for a Power Query query that generates a list of latitudes and longitudes that stretches across the UK from North Wales in the west to the Wash in the east and finds the elevation for each point:

let

    //Insert your Bing Maps API key here

    BingMapsKey = "xxxx",

    //Latitude

    Latitude = 53,

    //Generate a list of longitudes

    Longitudes = List.Numbers(-4.5,66,0.1),

    //Generate a list of latitudes and longitudes

    PointList = List.Transform(Longitudes,

        each  Number.ToText(Latitude) & "," & Number.ToText(_)),

    //Turn this list to comma delimited text

    PointListText = Text.Combine(PointList,","),

    //Add the text "points=" to the beginning

    PostContents = "points=" & PointListText,

    //Call the Elevations web service

    GetElevations = Web.Contents(

                    "http://dev.virtualearth.net/REST/v1/Elevation/List?key=" & BingMapsKey,

                    [Content=Text.ToBinary(PostContents)]),

    //Treat the result as a JSON document

    ImportedJSON = Json.Document(GetElevations),

    //Navigate to the elevations data

    resourceSets = ImportedJSON[resourceSets],

    resourceSets1 = resourceSets{0},

    resources = resourceSets1[resources],

    resources1 = resources{0},

    elevations = resources1[elevations],

    //Turn the elevations data into a table

    TableFromList = Table.FromList(elevations, Splitter.SplitByNothing(),

                                null, null, ExtraValues.Error),

    //Rename the column containing the elevations

    RenamedColumns = Table.RenameColumns(TableFromList,{{"Column1", "Elevations"}}),

    //Add a column containing the latitude

    InsertedCustom = Table.AddColumn(RenamedColumns, "Latitude", each Latitude),

    //Add an index column

    InsertedIndex = Table.AddIndexColumn(InsertedCustom,"Index"),

    //Use the index column to find the longitude for the current row

    InsertedCustom1 = Table.AddColumn(InsertedIndex, "Longitude", each Longitudes{[Index]}),

    //Remove the index column

    RemovedColumns = Table.RemoveColumns(InsertedCustom1,{"Index"}),

    //Set all columns to data type number

    ChangedType = Table.TransformColumnTypes(RemovedColumns,

                    {{"Elevations", type number}, {"Latitude", type number},

                    {"Longitude", type number}})

in

    ChangedType

It’s quite easy to edit the code so that it generates a list of latitudes and longitudes across the country of your choice…

Two things to point out:

To get this to work you need to insert a Bing Maps API key in the first step where indicated. If you don’t have one, you can get your own at https://www.bingmapsportal.com
The important step is GetElevations. The code is:

Web.Contents(

"http://dev.virtualearth.net/REST/v1/Elevation/List?key=" & BingMapsKey,
[Content=Text.ToBinary(PostContents)]),

You can see here how the list of locations is passed to the Web.Contents() function (documentation here) via the Content field; notice also that I’ve had to use Text.ToBinary() on the text that I’m passing in.

Here’s the output in Power Map:

You can download the sample workbook here.

Matt Masson and Theresa Palmer gave an excellent presentation on Power Query and M for the PASS DW/BI Virtual Chapter a few days ago (hopefully it will be on their YouTube channel soon). One thing that they showed which I hadn’t seen before was that you can display help for a function in M’s built in library simply by typing its name in the formula bar.

Take, for example, the Text.Replace() function. With a new blank query, if you create a step with the following definition:

= Text.Replace

You will see help and examples as shown below:

By doing this you are creating a step that returns the function itself – note that this is not the same as invoking the function, although it does mean you can invoke the function in a subsequent step. If you do decide to use the function you just need to click Invoke and a dialog will appear to prompt you for the values to pass to the function:

Clicking OK will show the output of the function:

Here’s the full M code for the query for those of you who are curious:

let

    Source = Text.Replace,

    InvokedSource = Source("the cat sat on the mat", "cat", "dog")

in

    InvokedSource

A very useful tip! Unfortunately you can’t specify help text for your own functions yet, although Matt did say it was something they wanted to do.

When I’m working with XML files, or web pages, or any data with columns containing nested tables in Power Query, I often end up having to expand every expandable column in the table and then expanding any new columns that are revealed after that to find the data that I’m looking for. This is a such a pain I thought I’d write a function to do it for me – which is the subject of this post.

For example, consider the following XML:

If you load this into Power Query you will see the following table created for the first step:

To get to a table where all of the data is visible requires clicking on the expand icons in the address and the employees columns (highlighted), and then three more clicks after that. Sigh.

Here’s my function, called ExpandAll, to expand all the columns in a table that can be expanded:

let

    //Define function taking two parameters - a table and an optional column number

    Source = (TableToExpand as table, optional ColumnNumber as number) =>

let

     //If the column number is missing, make it 0

     ActualColumnNumber = if (ColumnNumber=null) then 0 else ColumnNumber,

     //Find the column name relating to the column number

     ColumnName = Table.ColumnNames(TableToExpand){ActualColumnNumber},

     //Get a list containing all of the values in the column

     ColumnContents = Table.Column(TableToExpand, ColumnName),

     //Iterate over each value in the column and then

     //If the value is of type table get a list of all of the columns in the table

     //Then get a distinct list of all of these column names

     ColumnsToExpand = List.Distinct(List.Combine(List.Transform(ColumnContents,

                        each if _ is table then Table.ColumnNames(_) else {}))),

     //Append the original column name to the front of each of these column names

     NewColumnNames = List.Transform(ColumnsToExpand, each ColumnName & "." & _),

     //Is there anything to expand in this column?

     CanExpandCurrentColumn = List.Count(ColumnsToExpand)>0,

     //If this column can be expanded, then expand it

     ExpandedTable = if CanExpandCurrentColumn

                         then

                         Table.ExpandTableColumn(TableToExpand, ColumnName,

                                ColumnsToExpand, NewColumnNames)

                         else

                         TableToExpand,

     //If the column has been expanded then keep the column number the same, otherwise add one to it

     NextColumnNumber = if CanExpandCurrentColumn then ActualColumnNumber else ActualColumnNumber+1,

     //If the column number is now greater than the number of columns in the table

     //Then return the table as it is

     //Else call the ExpandAll function recursively with the expanded table

     OutputTable = if NextColumnNumber>(Table.ColumnCount(ExpandedTable)-1)

                        then

                        ExpandedTable

                        else

                        ExpandAll(ExpandedTable, NextColumnNumber)

in

     OutputTable

in

    Source

You can then use this function on the XML file shown above as follows:

let

    //Load XML file

    Source = Xml.Tables(File.Contents("C:\Users\Chris\Documents\PQ XML Expand All Demo.xml")),

    ChangedType = Table.TransformColumnTypes(Source,{{"companyname", type text}}),

    //Call the ExpandAll function to expand all columns

    Output = ExpandAll(ChangedType)

in

    Output

And bingo, in one step, you get everything:

You can download the sample workbook here.

Last week someone asked me whether it was possible to do the equivalent of a SQL LIKE filter in Power Query. Unfortunately there isn’t a function to do this in the standard library but, as always, it is possible to write some M code to do this. Here’s what I came up while I was waiting around at the stables during my daughter’s horse-riding lesson. At the moment it only supports the % wildcard character; also I can’t guarantee that it’s the most efficient implementation or indeed 100% bug-free, but it seems to work fine as far as I can see…

let

    Like = (Phrase as text, Pattern as text) =>

let

    //Split pattern up into a list using % as a delimiter

    PatternList = Text.Split(Pattern, "%"),

    //if the first character in the pattern is %

    //then the first item in the list is an empty string

    StartsWithWc = (List.First(PatternList)=""),

    //if the last character in the pattern is %

    //then the last item in the list is an empty string

    EndsWithWc = (List.Last(PatternList)=""),

    //if the first character is not %

    //then we have to match the first string in the pattern

    //with the opening characters of the phrase

    StartsTest = if (StartsWithWc=false)

       then Text.StartsWith(Phrase, List.First(PatternList))

       else true,

    //if the last item is not %

    //then we have to match the final string in the pattern

    //with the final characters of the phrase

    EndsText = if (EndsWithWc=false)

       then Text.EndsWith(Phrase, List.Last(PatternList))

       else true,

    //now we also need to check that each string in the pattern appears

    //in the correct order in the phrase

    //and to do this we need to declare a function PhraseFind

    PhraseFind = (Phrase as text, SearchString as list) =>

let

     //does the first string in the pattern appear in the phrase?

     StringPos = Text.PositionOf(Phrase, SearchString{0}, Occurrence.First),

     PhraseFindOutput =

if

                 //if string not find then return false

                 (StringPos=-1)

                 then false

                 else if

                 //we have found the string in the pattern, and

                 //if this is the last string in the pattern, return true

                 List.Count(SearchString)=1

                 then true

                 else

                 //if it isn't the last string in the pattern

                 //test the next string in the pattern by removing

                 //the first string from the pattern list

                 //and all text up to and including the string we have found in the phrase

                 (true and

                 @PhraseFind(

                 Text.RemoveRange(Phrase, 0, StringPos + Text.Length(SearchString{0})),

                 List.RemoveRange(SearchString, 0, 1)))

in

      PhraseFindOutput,

    //return true if we have passed all tests

    Output = StartsTest and EndsText and PhraseFind(Phrase, PatternList)

in

    Output

in

    Like

Using the following test data:

I can run the following query:

let

    Source = Excel.CurrentWorkbook(){[Name="Phrases"]}[Content],

    ChangedType = Table.TransformColumnTypes(Source,{{"Phrases", type text}}),

    InsertedCustom = Table.AddColumn(ChangedType, "Test", each Like([Phrases],"%cat%sat%mat%"))

in

    InsertedCustom

And get this output:

You can download the sample workbook here.

I know the Power Query team have been asked for this several times already, but it would be really useful if we could package up functions like this and make it easy to share them publicly with other Power Query users…

In last week’s post I showed how to create a simple LIKE function in Power Query which could be used in a calculated column. This week I’m going to show you how you can use this function in a condition when joining two tables together.

Consider the following two tables in an Excel worksheet:

Let’s say you want to join the Patterns table to the Phrases table, but only return the rows for each pattern where the Like() function returns true. Power Query has good support for different types of joins in the Table.Join() and Table.NestedJoin() functions but it isn’t immediately obvious how to handle join conditions such as this one.

Here’s the M code for a query that shows you how to do this:

let

    //Load Patterns table

    PatternsTable = Excel.CurrentWorkbook(){[Name="Patterns"]}[Content],

    //Load Phrases table

    PhrasesTable = Excel.CurrentWorkbook(){[Name="Phrases"]}[Content],

    //Add custom column to Patterns that returns the Phrases table for each row

    InsertedCustom = Table.AddColumn(PatternsTable, "AllPhrases", each PhrasesTable),

    //Expand the new column

    ExpandedTable = Table.ExpandTableColumn(InsertedCustom, "AllPhrases", {"Phrases"}, {"Phrases"}),

    //Filter the expanded table using the Like() function

    Custom1 = Table.SelectRows(ExpandedTable, each Like([Phrases],[Patterns]))

in

    Custom1

What this query does is the following:

Gets the data from both the Patterns table and the Phrases table. I’ve done this as two steps in the same query for simplicity; you might want to create two separate queries to do this.
On the Patterns table, add a new custom column that returns the entire Phrases table as a value. The Insert Custom Column dialog and its output will look like this:

Expands the new custom column so that every row of the Phrases table is displayed against every row of the Patterns table – in effect, it does a cross join between the two tables. Interestingly there is no option to do a cross join using the Table.Join() and Table.NestedJoin() functions.

Finally, it uses the Like() function in Table.SelectRows() as follows:

Table.SelectRows(ExpandedTable, each Like([Phrases],[Patterns]))
…to filter the table from the previous step so that you only get the rows back where the Like() function returns true and the pattern matches the phrase:

So… we have a solution to our problem (and you can download the example workbook here). However, be warned: for large tables this approach may not perform well! In part two I’ll show you another example of a complex join condition and show you how different approaches can yield very different performance.

In my last post you saw how to join two tables together using conditions other than the built-in inner, outer and anti join conditions. However, as I mentioned, you need to be aware of some of the performance implications of using this technique – and that is the subject I’ll be looking at in this post.

Let’s take the events-in-progress problem, one that I have blogged about many times in the past (see here, here, here and here for example) and see how we can solve it in Power Query. It’s a very common problem that you encounter when you have a fact table where each row represents an event, there are columns containing the start and end dates of each event, and you want to count the number of events that were in progress on any given date.

Using the Adventure Works DW database, my normal way of illustrating the problem is this: taking the DimDate and FactInternetSales tables, for each date show the number of purchases that had been ordered but not shipped on that date. One way of solving this problem (but not the only way, as my previous blog posts show) is to do a cross join between the DimDate and FactInternetSales tables, then filter the result so that you get the rows where the value in the DateKey column from the DimDate table is between the values in the OrderDateKey and the ShipDateKey columns from the FactInternetSales table.

As a first attempt you might come up with a query like this one, which uses a custom column containing a table value that is subsequently expanded to do a cross join (the technique shown in my previous post):

let

    //Connect to SQL Server

    Source = Sql.Database("localhost", "Adventure Works DW"),

    //Get data from the DimDate table

    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],

    //Remove all columns except the DateKey column

    RemovedOtherColumns = Table.SelectColumns(dbo_DimDate,{"DateKey"}),

    //Insert a custom column that contains the whole of FactInternetSales as a table in each row

    InsertedCustom = Table.AddColumn(RemovedOtherColumns, "FactInternetSalesTable",

                        each Source{[Schema="dbo",Item="FactInternetSales"]}[Data]),

    //Expand this new column to show the OrderDateKey and ShipDateKey columns

    #"Expand FactInternetSalesTable" = Table.ExpandTableColumn(InsertedCustom,

                        "FactInternetSalesTable",

                        {"OrderDateKey", "ShipDateKey"},

                        {"FactInternetSalesTable.OrderDateKey",

                        "FactInternetSalesTable.ShipDateKey"}),

    //Filter where DateKey is greater than or equal to OrderDateKey and

    //DateKey is less than or equal to ShipDateKey

    FilteredRows = Table.SelectRows(#"Expand FactInternetSalesTable",

                        each [DateKey] >= [FactInternetSalesTable.OrderDateKey] and

                        [DateKey] <= [FactInternetSalesTable.ShipDateKey]),

    //Find the count of the number of rows grouped by DateKey

    GroupedRows = Table.Group(FilteredRows, {"DateKey"},

                        {{"Count", each Table.RowCount(_), type number}})

in

    GroupedRows

There is, however, a big problem with this query: on my laptop it runs and runs forever – well, maybe not forever but I cancelled it after several minutes. Some tuning is necessary.

I don’t think anyone outside the Power Query dev team has much experience of performance tuning Power Query yet. However there is one golden rule that I do know: where possible, allow Power Query to push as much of the work back to the data source. This behaviour is known as “query folding” and it’s something that I’ve blogged about, as have Matt Masson and Darren Gosbell. Looking in SQL Server Profiler for the query above it is clear that no query folding is taking place: the only activity visible is Power Query reading the data from the DimDate and FactInternetSales tables separately.

After a bit of trial and error I came up with the following alternative:

let

    //Connect to SQL Server

    Source = Sql.Database("localhost", "adventure works dw"),

    //Get data from the DimDate table

    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],

    //Remove all columns except DateKey

    RemovedOtherColumns = Table.SelectColumns(dbo_DimDate,{"DateKey"}),

    //Add a custom column to DimDate containing the value 1

    InsertedCustom = Table.AddColumn(RemovedOtherColumns, "Dummy", each 1),

    //Get data from the FactInternetSales table

    dbo_FactInternetSales = Source{[Schema="dbo",Item="FactInternetSales"]}[Data],

    //Remove all columns except OrderDateKey and ShipDateKey

    RemovedOtherColumns1 = Table.SelectColumns(dbo_FactInternetSales,

                        {"OrderDateKey", "ShipDateKey"}),

    //Add a custom column to FactInternetSales containing the value 1

    InsertedCustom1 = Table.AddColumn(RemovedOtherColumns1, "Dummy", each 1),

    //Join DimDate and FactInternetSales on the two columns that contain 1

    Custom1 = Table.Join(InsertedCustom1, "Dummy", InsertedCustom, "Dummy"),

    //Filter rows where DateKey is between OrderDateKey and ShipDateKey

    FilteredRows = Table.SelectRows(Custom1,

                    each [DateKey] >= [OrderDateKey] and

                    [DateKey] <= [ShipDateKey]),

    //Group by DateKey and find the number of rows for each date

    GroupedRows = Table.Group(FilteredRows, {"DateKey"},

                    {{"Count", each Table.RowCount(_), type number}}),

    //Sort dates in ascending order

    SortedRows = Table.Sort(GroupedRows,{{"DateKey", Order.Ascending}})

in

    SortedRows

This returns the correct result more or less instantly:

The main difference between this query and the previous one is how I’m getting a cross join between the two tables. This time I’m creating custom columns on DimDate and FactInternetSales that both contain the value 1, and then doing an inner join between this two tables on the new columns – which of course results in the equivalent of a cross join.

In Profiler I can see the following SQL query being generated by Power Query:

select [_].[DateKey],

    [_].[Count]

from

    select [rows].[DateKey] as [DateKey],

        count(1) as [Count]

    from

        select [_].[OrderDateKey],

            [_].[ShipDateKey],

            [_].[Dummy],

            [_].[DateKey]

        from

            select [$Outer].[OrderDateKey],

                [$Outer].[ShipDateKey],

                [$Inner].[Dummy],

                [$Inner].[DateKey]

            from

                select [_].[OrderDateKey] as [OrderDateKey],

                    [_].[ShipDateKey] as [ShipDateKey],

                    1 as [Dummy]

                from

                    select [OrderDateKey],

                        [ShipDateKey]

                    from [dbo].[FactInternetSales] as [$Table]

                ) as [_]

            ) as [$Outer]

            inner join

                select [_].[DateKey] as [DateKey],

                    1 as [Dummy]

                from

                    select [DateKey]

                    from [dbo].[DimDate] as [$Table]

                ) as [_]

            ) as [$Inner] on ([$Outer].[Dummy] = [$Inner].[Dummy])

        ) as [_]

        where [_].[DateKey] >= [_].[OrderDateKey] and [_].[DateKey] <= [_].[ShipDateKey]

    ) as [rows]

    group by [DateKey]

) as [_]

order by [_].[DateKey]

Query folding is definitely taking place now!

The last question to ask here is whether the first query was slow because query folding was not taking place, or slow because of the way the query was written. You can test this quite easily by rewriting the second query to prevent query folding taking place using the Table.Buffer() function. For example, in the second query the step to get the data from the DimDate table is:

dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],

To prevent query folding it needs to be altered to:

dbo_DimDate = Table.Buffer(Source{[Schema="dbo",Item="DimDate"]}[Data]),

(The step to get data from the FactInternetSales table needs to be altered in the same way.)

With this change made the query now executes in around a minute. So clearly the new query is more efficient when it is executed inside the Power Query engine itself, without query folding, but the Power Query engine is still nowhere near as fast as SQL Server and query folding gives the best possible performance.

You can download the sample workbook here.

Recently I had a request for help from someone who wanted to do the following in Power Query: take a piece of text and then, using a table, search for all of the occurrences of the words in one column of the table in the text and replace those words with those in the other column. So, for example, given these two tables in Excel:

You want to take the table on the left and for each piece of text replace the words in the ‘Word To Replace’ column of the right-hand table with those in the ‘Replace With’ column of the right-hand table. The output would therefore be:

An interesting challenge in itself, and one I solved first of all using a recursive function. Here’s some code showing how I did it:

let

    //Get table of word replacements

    Replacements = Excel.CurrentWorkbook(){[Name="Replacements"]}[Content],

    //Get table containing text to change

    TextToChange = Excel.CurrentWorkbook(){[Name="Text"]}[Content],

    //Get a list of all words to replace

    WordsToReplace = Table.Column(Replacements, "Word To Replace"),

    //Get a list of all words to replace with

    WordsToReplaceWith = Table.Column(Replacements, "Replace With"),

    //Recursive function to do the replacement

    ReplacementFunction = (InputText, Position)=>

let

     //Use Text.Replace to do each replace

     ReplaceText = Text.Replace(

            InputText,

            WordsToReplace{Position},

            WordsToReplaceWith{Position})

in

     //If we have reached the end of the list of replacements

     if Position=List.Count(WordsToReplace)-1

      then

      //return the output of the query

      ReplaceText

      else

      //call the function again

      @ReplacementFunction(ReplaceText, Position+1),

    //Add a calculated column to call the function on every row in the table

    //containing text to change

    Output = Table.AddColumn(TextToChange, "Changed Text", each ReplacementFunction([Text], 0))

in

    Output

It does the job, but… after thinking about this some more, I wondered if there was a better way. A lot of my recent Power Query blog posts have used recursive functions, but are they a Good Thing? So I asked on the forum, and as usual the nice people on the Power Query dev team answered very promptly (that’s one of the things I like about the Power Query dev team – they engage with their users). Recursive functions are indeed something that should be avoided if there is an alternative, and in this case List.Generate() can be used instead. Here’s how:

let

    //Get table of word replacements

    Replacements = Excel.CurrentWorkbook(){[Name="Replacements"]}[Content],

    //Get table containing text to change

    TextToChange = Excel.CurrentWorkbook(){[Name="Text"]}[Content],

    //Get list of words to replace

    WordsToReplace = Table.Column(Replacements, "Word To Replace"),

    //Get list of words to replace them with

    WordsToReplaceWith = Table.Column(Replacements, "Replace With"),

    //A non-recursive function to do the replacements

    ReplacementFunction = (InputText)=>

let

       //Use List.Generate() to do the replacements

       DoReplacement = List.Generate(

                          ()=> [Counter=0, MyText=InputText],

                          each [Counter]<=List.Count(WordsToReplaceWith),

                          each [Counter=[Counter]+1,

                                MyText=Text.Replace(

                                         [MyText],

                                         WordsToReplace{[Counter]},

                                         WordsToReplaceWith{[Counter]})],

                          each [MyText]),

       //Return the last item in the list that

       //List.Generate() returns

       GetLastValue = List.Last(DoReplacement)

in

      GetLastValue,

    //Add a calculated column to call the function on every row in the table

    //containing the text to change

    Output = Table.AddColumn(TextToChange, "Changed Text", each ReplacementFunction([Text]))

in

    Output

List.Generate() is a very powerful function indeed, albeit one that took me a while to understand properly. It’s a bit like a FOR loop even if it’s a function that returns a list. Here’s what each of the parameters I’m passing to the function in the example above do:

()=> [Counter=0, MyText=InputText] returns a function that itself returns a record (a record is a bit like a table with just one row in it). The record contains two fields: Counter, which has the value 0, and MyText which is given the value of the text where the values are to be replaced. This record is the initial value that List.Generate() will modify at each iteration.
each [Counter]<=List.Count(WordsToReplaceWith) returns a function too. An each expression is a quick way of declaring a function that takes one, unnamed parameter, and in this case the value that will be passed to this parameter is a record of the same structure as the one declared in the previous bullet. The expression [Counter] gets the value of the Counter field from that record. The function returns a boolean value, true when the value in the [Counter] field of the record is less than or equal to the number of items in the list of words to replace. List.Generate() returns a list, and while this function returns true it will keep on iterating and adding new items to the list it returns.
each [Counter=[Counter]+1, MyText=Text.Replace([MyText], WordsToReplace{[Counter]}, WordsToReplaceWith{[Counter]})] returns yet another function, once again declared using an each expression. The function here takes the record from the current iteration and returns the record to be used at the next iteration: a record where the value of the Counter field is increased by one, and where the value of the MyText field has one word replaced. The word that gets replaced in MyText is the word in the (zero-based) row number given by Counter in the ‘Word To Replace’ column; this word is replaced by the word in the row number given by Counter in the ‘Replace With’ column.
each [MyText] returns a very simple function, one that returns the value from the MyText field of the record from the current iteration. It’s the value that this function returns that is added to the list returned by List.Generate() at every iteration.

To illustrate this, here’s a simplified example showing how List.Generate() works in this case:

let

    WordsToReplace = {"cat", "dog", "mat"},

    WordsToReplaceWith = {"fish", "snake", "ground"},

    Demo = List.Generate(

                          ()=> [Counter=0, MyText="the cat and the dog sat on the mat"],

                          each [Counter]<=List.Count(WordsToReplaceWith),

                          each [Counter=[Counter]+1,

                                MyText=Text.Replace(

                                         [MyText],

                                         WordsToReplace{[Counter]},

                                         WordsToReplaceWith{[Counter]})],

                          each [MyText])

in

    Demo

The output of this query is the list:

This list can be written as (with the words changed at each iteration highlighted):

{“the cat and the dog sat on the mat”, “the fish and the dog sat on the mat”, “the fish and the snake sat on the mat”, “the fish and the snake sat on the ground”}

So, another useful function to know about. I’m slowly getting to grips with all this functional programming!

You can download the sample workbook here.

Looking for some summer holiday (or winter holiday, depending on which hemisphere you live in) reading? If so, may I suggest my new Power Query book? “Power Query for Power BI and Excel” is available now from the Apress site, Amazon.com, Amazon.co.uk and all good bookstores.

Power Query for Power BI and Excel Cover Image

It’s an introductory level book. It covers all of the stuff you can do in the UI, it has a chapter on M, and it goes into a reasonable amount of detail on more advanced topics; it is not a 500-page exhaustive guide to the product. I’ve focused on readability and teaching the fundamentals of Power Query rather than every looking at every obscure M function, but at the same time if you’ve already used Power Query I think there’ll be plenty of material in there you’ll find interesting.

Now for the bad news: the book is out-of-date already, although not by much. One of the best things about Power Query is the monthly release cycle; unfortunately that makes writing a book on it a bit of a nightmare. I started off writing in January and had to deal with lots of added functionality and changes to the UI over the next few months; I had to retake pretty much all of the screenshots as a result. The published version of the book is based on the version of Power Query that was released in early June rather than the current version. Hopefully you can forgive this – the differences are minor – but it’s a good reason to buy the book as soon as you can! I want to do a second edition in a year’s time once (if?) the release cycle slows down.

I’ve been teased a bit for blogging and teaching so much about Power Query recently, so the final thing I want to say here is why an old corporate BI/SSAS guy like me is getting so excited about a self-service ETL tool. Well, the main reason is that Power Query is a great piece of software. It does what it does very well; it does useful things rather than what the marketing guys/analysts/journalists think is hot in BI; it is easy to use but at the same time is flexible enough for the advanced user to do really complex stuff; it is updated regularly based on feedback from its users. I only wish all Microsoft software was this good… Honestly, I wouldn’t be able to motivate myself to blog and write about Power Query if I didn’t think it was cool, and even though it hasn’t been hyped in the same way as other parts of the Power BI stack it is nonetheless the part that people get excited about when I show them Power BI. It’s not just me either – every day I see positive comments like Greg Low’s here. I think it is as important, if not more important, than Power Pivot and I think it will be a massive success.

Oh, and did I mention that I’m also teaching a Power Query course in London later this year….?

One of the more recent additions to Power Query is the ability to access data from named ranges in the Excel worksheet rather than an Excel table. I’ve got used to formatting data as tables in Excel because that’s what Power Pivot needs to import data directly from the worksheet, but if you are working with Power Query and a pre-existing workbook then trying to reformat data as tables can be a pain. Also, if you just want to import a single value, for example as a parameter to a query, a table seems like overkill.

(Incidentally, if you’re wondering what a named range is in Excel, there are tons of good introductions to the subject on the internet like this one. You can do loads of cool stuff with them.)

Consider the following Excel worksheet:

There are three named ranges here: FirstRange, SecondRange, ThirdRange, and the values in the cells show which range the cells are in. FirstRange consists of two cells in two columns; SecondRange consists of three cells in a single row; and ThirdRange is consists of three, non-contiguous cells. (You can also use this trick to display the names of all contiguous ranges in an Excel workbook, but alas it does seem to work for non-contiguous ranges).

At the moment, the Power Query ribbon doesn’t make it obvious that you can use named ranges as data sources. However you can see all the tables and cells in a worksheet, and even return that list from a query, by creating a blank query and using the expression

= Excel.CurrentWorkbook()

Once you’ve done this you can see all the ranges (and also any tables) in the workbook, and click on the table link next to the name to see the data. For example, clicking on FirstRange shows the following table in a new step in the query editor:

The expression to get at this table in a single step is:

= Excel.CurrentWorkbook(){[Name="FirstRange"]}[Content]

The range SecondRange in my example is equally straightforward to reference, and you can see its contents by using the expression

= Excel.CurrentWorkbook(){[Name="SecondRange"]}[Content]

Unfortunately ThirdRange, which is not contiguous, is a problem: I can only get the first cell in the range. So the expression

= Excel.CurrentWorkbook(){[Name="ThirdRange"]}[Content]

Returns just this table:

It would be nice if we could get a list containing the cell values, rather than a table, for ranges like this…

Last thing to mention is that if you do want the value in a cell, rather than a table, you just need to right-click inside the cell in the Query Editor and select Drill Down:

This returns the value (in this case the text “Third Range Cell 1”) in the cell you clicked on:

This is a much more useful value to return than a table containing a single row/column, if you intend to use a value from a single cell in a named range as a parameter to another query.

You can download the sample workbook for this post here.

At the end of the Power Query Formula Library Specification (which can be downloaded here) are sections on Comparer, Combiner, Replacer and Splitter functions. These functions are most often used in conjunction with other functions like Table.CombineColumns() and Table.SplitColumn, but what you may not realise from the documentation (which also has a few minor but nonetheless confusing bugs in it) is what these functions do: they are functions that return functions, and the functions that they return can be used independently just like any other function.

Take Splitter.SplitTextByDelimiter() as an example. It returns a function that splits a piece of text by a delimiter, and returns a list containing the resulting pieces. The following M code calls this function to return a function that splits comma delimited text:

let

    demo = Splitter.SplitTextByDelimiter(",")

in

    demo

As noted here, once you have a query that returns a function you can see that function’s signature and invoke it from the Query Editor window. Here’s what the query above shows in the Query Editor window:

If you click the Invoke button and enter the text

one,two,three,four

As follows:

So that the code for the query becomes:

let

    demo = Splitter.SplitTextByDelimiter(","),

    Invokeddemo = demo("one,two,three,four")

in

    Invokeddemo

What is returned is the list {“one”, “two”, “three”, “four”} which looks like this in the Query Editor window:

There are various other Splitter functions that can be used to return functions that split text in different ways. Similarly, the Combiner functions return functions that can be used to combine a list of text into a single piece of text. For example:

let

    demo = Combiner.CombineTextByDelimiter("--"),

    Invokeddemo = demo({"one","two","three","four"})

in

    Invokeddemo

Returns the text

one–two–three—four

The Replacer functions return functions for replacing values in text , while the Comparer functions return functions that can be used for comparing text using specific cultures and case sensitivities.

In one of my first posts on Power Query (and still my favourite) I found the top 100 words in the complete works of Shakespeare. As always when you’re learning a new tool, though, I look back at what I wrote then and realise there are better ways of doing things… one of which is removing all of the punctuation from a piece of text.

In the post I’m talking about I used the following expression to remove punctuation:

Text.Remove(RemoveLineBreaks,{"," ,"." ,"?" ,";" ,":" ,";" ,"’","@" ,"#" ,"~" ,"{" ,"[" ,"}" ,"]" ,"(" ,")", "*"})

However, as you can see, it only removes the punctuation could be bothered to hard-code in the list – luckily Shakespeare didn’t use too much exotic punctuation! There is a better way of doing this, though, and it relies on the Character.FromNumber() function which takes a number and returns the equivalent Unicode character; for example, the expression Character.FromNumber(38) returns an ampersand & symbol.

Looking at the list of Unicode characters here it’s easy to identify the ranges of numbers that represent punctuation characters. These ranges can then be represented in M as lists. Remembering that lists of ranges of numbers can be declared easily by using the notation {1..5}, which results in the list of numbers {1,2,3,4,5}, you can write a list of lists containing the numbers representing the Latin punctuation characters in Unicode like so:

{{0..31},{33..47},{58..64},{91..96},{123..191}}

and you can use List.Combine() to turn this list of lists into a single list of numbers. Finally, you can use List.Transform() and Character.FromNumber() to get a list of the Unicode characters for these numbers, and pass that list to Text.Remove(). Here’s a query showing everything:

let

    //get a list of lists containing the numbers of Unicode punctuation characters

    numberlists = {{0..31},{33..47},{58..64},{91..96},{123..191}},

    //turn this into a single list

    combinedlist = List.Combine(numberlists),

    //get a list of all the punctuation characters that these numbers represent

    punctuationlist = List.Transform(combinedlist, each Character.FromNumber(_)),

    //some text to test this on

    inputtext = "Hello! My name is Chris, and I'm hoping that this *cool* post will help you!",

    //the text with punctuation removed

    outputtext = Text.Remove(inputtext, punctuationlist)

in

    outputtext

The output of this query is this:

You can download the sample workbook for this post here.

You probably know that, when you are importing data from multiple tables in SQL Server into the Excel Data Model in Excel 2013 using Power Query, Power Query will automatically create relationships between those tables in the Data Model. But did you know that you can get Power Query to do this for other data sources too?

Now wait – don’t get excited. I’ve known about this for a while but not blogged about it because I don’t think it works all that well. You have to follow some very precise steps to make it happen and even then there are some problems. That said, I think we’re stuck with the current behaviour (at least for the time being) so I thought I might as well document it.

Consider the following Excel worksheet with two tables in it, called Dimension and Fact:

If you were to load these two tables into the Excel Data Model, you would probably want to create a relationship between the two tables based on the FruitID column. Here are the steps to use Power Query to create the relationship automatically:

Click inside the Dimension table and then, on the Power Query tab in the Excel ribbon, click the From Table button to create a new query.
When the Query Editor window opens, right click on the FruitID column and select Remove Duplicates.

Why are we doing this when there clearly aren’t any duplicate values in this column? The new step contains the expression
Table.Distinct(Source, {"FruitID"})
…and one of the side-effects of using Table.Distinct() is that it adds a primary key to the table. Yes, tables in Power Query can have primary keys – the Table.AddKey() function is another way of doing this. There’s a bit more information on this subject in my Power Query book, which I hope you have all bought!
Click the Close & Load to.. button to close the Query Editor, and then choose the Only Create Connection option to make sure the output of the query is not loaded anywhere and the query is disabled, then click the Load button. (Am I the only person that doesn’t like this new dialog? I thought the old checkboxes were much simpler, although I do appreciate the new flexibility on where to put your Excel table output)
Click inside the Fact table in the worksheet, click the From Table button again and this time do load it into the Data Model.
Next, in the Power Query tab in the Excel ribbon, click the Merge button. In the Merge dialog select Dimension as the first table, Fact as the second, and in both select the FruitID column to join on.
Click OK and the Query Editor window opens again. Click the Close & Load to.. button again, and load this new table into the Data Model.
Open the Power Pivot window and you will see that not only have your two tables been loaded into the Data Model, but a relationship has been created between the two:

What are the problems I talked about then? Well, for a start, if you don’t follow these instructions exactly then you won’t get the relationship created – it is much harder than I would like. There may be other ways to make sure the relationships are created but I haven’t found them yet (if you do know of an easier way, please leave a comment!). Secondly if you delete the two tables from the Data Model and delete the two Power Query queries, and then follow these steps again, you will find the relationship is not created. That can’t be right. Thirdly, I don’t like having to create a third query with the Merge, and would prefer it if I could just create two queries and define the relationship somewhere separately. With all of these issues I don’t think there’s any practical use for this functionality right now.

I guess the reason I think the ability to create relationships automatically is so important is because the one thing that the Excel Data Model/Power Pivot/SSAS Tabular sorely lacks is a simple way to script the structure of a model. Could Power Query and M one day be the modelling language that Marco asks for here? To be fair to the Power Query team this is not and should not be their core focus right now: Power Query is all about data acquisition, and this is data modelling. If this problem was solved properly it would take a lot of thought and a lot of effort. I would love to see it solved one day though.

You can download the sample workbook for this post here.

Anyone who has tried to do any serious work with Power Pivot and Power Query will know about this problem: you use Power Query to load some tables into the Data Model in Excel 2013; you make some changes in the Power Pivot window; you then go back to Power Query, make some changes there and you get the dreaded error

We couldn’t refresh the table ‘xyz’ from the connection ‘Power Query – xyz’. Here’s the error message we got:

COM Error: Microsoft.Mashup.OleDbProvider; The query ‘xyz’ or one of its inputs was modified in Power Query after this connection was added. Please disable and re-enable loading to the Data Model for this query..

This post has a solution for the same problem in Excel 2010, but it doesn’t work for Excel 2013 unfortunately. There is a lot of helpful information out there on the web about this issue if you look around, though, and that’s why I thought it would be useful to bring it all together into one blog post and also pass on some hints and tips about how to recover from this error if you get it. This is the single biggest source of frustration among the Power Query users I speak to; a fix for it is being worked on, and I hope it gets released soon.

Problem Description

Why does this problem occur? Let’s take a simple repro.

Import the data from a table in SQL Server using Power Query. Load it into the Excel Data Model.
Open the PowerPivot window in Excel, then create measures/calculated fields, calculated columns, relationships with other tables as usual.
Go back to the worksheet and build a PivotTable from data in this table, using whatever measures or calculated columns you have created.
Go back to the PowerPivot window and rename one of the columns there. The column name change will be reflected in the PivotTable and everything will continue to work.
Re-open the Power Query query editor, and then rename any of the columns in the table (not necessarily the one you changed in the previous step). Close the query editor window and when the query refreshes, bang! you see the error above. The table in the Excel Data Model is unaffected, however, and your PivotTable continues to work – it’s just that now you can’t refresh the data any more…
Do what the error message suggests and change the Load To option on the Power Query query, unchecking the option to load to the Data Model. When you do this, on the very latest build of Power Query, you’ll see a “Possible Data Loss” warning dialog telling you that you’ll lose any customisations you made. Click Continue, and the query will be disabled. The destination table will be deleted from your Excel Data Model and your PivotTable, while it will still show data, will be frozen.
Change the Load To option on the query to load the data into the Excel Data Model again. When you do this, and refresh the data, the table will be recreated in the Excel Data Model. However, your measures, calculated columns and relationships will all be gone. What’s more, although your PivotTable will now work again, any measures or calculated columns you were using in it will also have gone.
Swear loudly at your computer and add all the measures, calculated columns and relationships to your Data Model all over again.

So what exactly happened here? The important step is step 4. As Miguel Llopis of the Power Query team explains here and here, when you make certain changes to a table in the Power Pivot window the connection from your Power Query query to the Excel Data Model goes into ‘read-only’ mode. This then stops Power Query from making any subsequent changes to the structure of the table.

What changes put the connection to the Excel Data Model in ‘read-only’ mode?

Here’s a list of changes (taken from Miguel’s posts that I linked to above) that you can make in the PowerPivot window that put the connection from your query to the Data Model into ‘read-only’ mode:

Edit Table Properties
Column-level changes: Rename, Data type change, Delete
Table-level changes: Rename, Delete
Import more tables using Power Pivot Import Wizard
Upgrade existing workbook

How can you tell whether my connection is in ‘read-only’ mode?

To find out whether your connection is in ‘read-only’ mode, go to the Data tab in Excel and click on the Connections button. Then, in the Workbook Connections dialog you’ll see the connection from Power Query to the Data Model listed – it will be called something like ‘Power Query – Query1’ and the description will be ‘Connection to the Query1 query in the Data Model’. Select this connection and click on the Properties button. When the Connection Properties dialog opens, go to the Definition tab. If the connection is in read-only mode the properties will be greyed out, and you’ll see the message ‘Some properties cannot be changed because this connection was modified using the PowerPivot Add-In’. If you do see this message, you’re already in trouble!

How to avoid this problem

Avoiding this problem is pretty straightforward: if you’re using Power Query to load data into the Excel Data Model, don’t make any of the changes listed above in the PowerPivot window! Make them in Power Query instead.

How to recover from this problem

But what if your connection is already in ‘read-only’ mode? There is no magic solution, unfortunately, you are going to have to rebuild your model. However there are two things you can do to reduce the amount of pain you have to go through to recreate your model.

First, you can use the DISCOVER_CALC_DEPENDENCY DMV to list out all of your measure and calculated column definitions to a table in Excel. Here’s some more information about the DMV:

http://cwebbbi.wordpress.com/2011/09/17/documenting-dependencies-between-dax-calculations/

To use this, all you need to do is to create a DAX query table in the way Kasper shows at the end of this post, and use the query:

select * from $system.discover_calc_dependency

Secondly, before you disable and re-enable your Power Query query (as in step 6 above), install the OLAP PivotTable Extensions add-in (if you don’t already have it) and use its option to disable auto-refresh on all of your PivotTables, as described here:

http://olappivottableextend.codeplex.com/wikipage?title=Disable%20Auto%20Refresh&referringTitle=Home

http://www.artisconsulting.com/blogs/greggalloway/Lists/Posts/Post.aspx?ID=26

Doing this prevents the PivotTables from auto-refreshing when the table is deleted from the Data Model when you disable the Power Query query. This means that they remember all of their references to your measures and calculated columns, so when you have recreated them in your Data Model (assuming that all of the names are still the same) and you re-enable auto-refresh the PivotTables will not have changed at all and will continue to work as before.

[After writing this post, I realised that Barbara Raney covered pretty much the same material in this post: http://www.girlswithpowertools.com/2014/06/power-query-refresh-fails/ . I probably read that post when it was published and then forgot about it. I usually don't blog about things that other people have already blogged about, but since I'd already done the hard work and the tip on using OLAP PivotTable Extensions is new, I thought I'd post anyway. Apologies...]

Recently I was asked an interesting question by Ondra Plánička in the comments of a blog post: how can you handle errors caused by unavailable or missing data sources in Power Query?

Let’s imagine you are loading a csv file like this one into Excel using Power Query:

The M query generated by Power Query will be as follows:

let

    Source = Csv.Document(File.Contents("C:\Users\Chris\Documents\SampleData.csv")

                ,null,",",null,1252),

    #"First Row as Header" = Table.PromoteHeaders(Source),

    #"Changed Type" = Table.TransformColumnTypes(#"First Row as Header"

                ,{{"Month", type text}, {"Product", type text}, {"Sales", Int64.Type}})

in

    #"Changed Type"

If you load into the Excel Data Model you’ll see the following in the Power Pivot window:

So far so good. But what happens if you try to refresh the query and the csv file is not there any more? The query refreshes but you will see the following in the Power Pivot window:

The structure of the table that has been loaded has changed: instead of three columns you get just one, containing the error message. This wipes any selections in Excel PivotTables that are based on this table; they will need to be recreated when the source file is available once again. Similarly, any relationships between this table and other tables in the Excel Data Model get deleted and have to be added again manually when the source file is there again. Not good.

Here’s how to alter the query so that it handles the error more gracefully:

let

    //This is the original code generated by Power Query

    Source =

      Csv.Document(File.Contents("C:\Users\Chris\Documents\SampleData.csv"),null,",",null,1252),

    #"First Row as Header" = Table.PromoteHeaders(Source),

    #"Changed Type" = Table.TransformColumnTypes(#"First Row as Header",

      {{"Month", type text}, {"Product", type text}, {"Sales", Int64.Type}}),

    //End of original code

    //Define the alternative table to return in case of error

    AlternativeOutput=#table(type table [Month=text,Product=text,Sales=Int64.Type],

      {{"Error", "Error", 0}}),

    //Does the Source step return an error?

    TestForError= try Source,

    //If Source returns an error then return the alternative table output

    //else return the value of the #"Changed Type" step

    Output = if TestForError[HasError] then AlternativeOutput else #"Changed Type"

in

    Output

While the code from the original query remains intact, the following extra steps have been added:

The AlternativeOutput step returns a table (defined using #table) that has exactly the same columns as the csv file. This table has one row containing the text “Error” in the two text columns and 0 in the Sales column.
The TestForError step uses a try to see whether the Source step returns an error (for example because the file is missing)
The Output step checks to see whether TestForError found an error – if it does, it returns the table defined in the AlternativeOutput step, otherwise it returns the contents of the csv file as returned by the #”Changed Type” step.

Now when you run the query and the csv file is missing, then you see the following in the Power Pivot window:

Because this table has the same structure as the one the query returns when the csv file is present, any PivotTables connected to this table will still retain their selections and the relationships in the Excel Data Model are left intact. This means that when the csv file is back in its proper place everything works again with no extra work required.

You can download the example workbook and csv file here.

Loading data from SQL Server using Power Query is fairly straightforward, and the basics are well-documented (see here for example). However there are a few advanced options, not all of which are currently shown in the online help although they are in the latest version of the Library Specification document and are visible inside the Power Query window’s own help, and not all of which are documented in a lot of detail.

Here’s what the Power Query window shows for the Sql.Database function when you show help:

Here are all of the options available:

MaxDegreeOfParallelism does what you would expect, setting the MAXDOP query hint for the SQL query that Power Query generates. For example, the Power Query query:

let

    Source = Sql.Database("localhost", "adventure works dw", [MaxDegreeOfParallelism=2]),

    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]

in

    dbo_DimDate

Generates the query:

select [$Ordered].[DateKey],

    [$Ordered].[FullDateAlternateKey],

    [$Ordered].[DayNumberOfWeek],

    [$Ordered].[EnglishDayNameOfWeek],

    [$Ordered].[SpanishDayNameOfWeek],

    [$Ordered].[FrenchDayNameOfWeek],

    [$Ordered].[DayNumberOfMonth],

    [$Ordered].[DayNumberOfYear],

    [$Ordered].[WeekNumberOfYear],

    [$Ordered].[EnglishMonthName],

    [$Ordered].[SpanishMonthName],

    [$Ordered].[FrenchMonthName],

    [$Ordered].[MonthNumberOfYear],

    [$Ordered].[CalendarQuarter],

    [$Ordered].[CalendarYear],

    [$Ordered].[CalendarSemester],

    [$Ordered].[FiscalQuarter],

    [$Ordered].[FiscalYear],

    [$Ordered].[FiscalSemester]

from [dbo].[DimDate] as [$Ordered]

order by [$Ordered].[DateKey]

option(maxdop 2)

[as an aside – yes, the SQL query has an Order By clause in it. Power Query likes all of its tables ordered. It would be nice to have an option to turn off the Order By clause though, I think]

CreateNavigationProperties controls the creation of the navigation properties that allow you to browse from table to table in the Power Query Query Editor. For example, when you connect to a table in SQL Server and Power Query can see foreign key relationships between that table and other table, you’ll see extra columns that allow you to follow these relationships:

The problem is that these columns will appear as useless text columns when you load the data into a table on the worksheet or the Excel Data Model, although of course you can delete them manually using the Remove Columns functionality in the Power Query Query Editor:

Setting CreateNavigationProperties=false will stop these extra columns being created, for example:

let

    Source = Sql.Database("localhost", "adventure works dw",[CreateNavigationProperties=false]),

    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]

in

    dbo_DimDate

I believe it should also prevent the automatic creation of relationships between tables in the Excel Data Model, but I can’t work out how to test this.

NavigationPropertyNameGenerator controls how the names for these columns are generated. There’s no public documentation about this at all, and I’m rather pleased with myself for working out how to use it. It’s quite useful because I find the default names that get generated aren’t as clear as they could be in all cases. Here’s an example of how to build more detailed, albeit more verbose, names:

let

    //declare a function that combines a list of text using commas

    ConcatByComma = Combiner.CombineTextByDelimiter(","),

    //declare a name function

    MyNameFunction = (p, a) =>

      List.Transform(a, each

        "Navigate from " & _[SourceTableName] & " to " & _[TargetTableName] &

        " via " & ConcatByComma(_[SourceKeys]) & " and " & ConcatByComma(_[TargetKeys])),

    //use this name function in Sql.Database

    Source = Sql.Database("localhost", "adventure works dw",

      [NavigationPropertyNameGenerator=MyNameFunction]),

    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]

in

    dbo_DimDate

The NavigationPropertyNameGenerator option takes a function with two arguments:

Pattern, which appears to be a list containing all of the names of the columns on the current table
Arguments, which is a list of records containing information about each of the navigation columns. Each record contains the following fields

SourceTableName – the name of the source table, ie the table you’re currently on
TargetTableName – the name of the table to navigate to
SourceKeys – a list containing the names of the columns on the source table involved in the relationship
TargetKeys – a list containing the names of the columns on the target table involved in the relationship
SingleTarget – not quite sure about this, but it appears to be a logical (ie true or false) value indicating whether there is just one target table involved in all relationships

In my example above, I’ve created two functions. The first, ConcatByComma, is a simple combiner that concatenates all of the text in a list using commas (see here for more details on how to do this). The second, MyNameFunction, is the important one – it uses List.Transform to iterate over the list passed to Arguments (called a in my function) and generate the text for each column header.

Query allows you to specify your own SQL query for Power Query to use. Very useful, but there’s not much to say about this that isn’t already covered in the official help on this feature. The only thing that’s important is that if you do use this option it will prevent query folding from taking place for the rest of your query – which could lead to performance problems.

Here’s an example query:

let

    Source = Sql.Database("localhost", "adventure works dw",

     [Query="select distinct CalendarYear from DimDate"])

in

    Source

CommandTimeout allows you to specify a query timeout as a value of type duration. The default timeout is ten minutes. Here’s an example of how to set a new timeout of one day, two hours, three minutes and four seconds using the #duration() intrinsic function:

let

    Source = Sql.Database("localhost", "adventure works dw",

     [CommandTimeout=#duration(1,2,3,4)]),

    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]

in

    dbo_DimDate

A few months ago someone at a conference asked me what the Power Query Table.Partition() function could be used for, and I had to admit I had no idea. However, when I thought about it, I realised one obvious use: for creating histograms! Now I know there are lots of other good ways to create histograms in Excel but here’s one more, and hopefully it will satisfy the curiosity of anyone else who is wondering about Table.Partition().

Let’s start with a table in Excel (called “Customers”) containing a list of names and ages:

Here’s the M code for the query to find the buckets:

let

    //Get data from Customers table

    Source = Excel.CurrentWorkbook(){[Name="Customers"]}[Content],

    //Get a list of all the values in the Age column

    Ages = Table.Column(Source,"Age"),

    //Find the maximum age

    MaxAge = List.Max(Ages),

    //The number of buckets is the max age divided by ten, then rounded up to the nearest integer

    NumberOfBuckets = Number.RoundUp(MaxAge/10),

    //Hash function to determine which bucket each customer goes into

    BucketHashFunction = (age) => Number.RoundDown(age/10),

    //Use Table.Partition() to split the table into multiple buckets

    CreateBuckets = Table.Partition(Source, "Age", NumberOfBuckets, BucketHashFunction),

    //Turn the resulting list into a table

    #"Table from List" = Table.FromList(CreateBuckets, Splitter.SplitByNothing()

                           , null, null, ExtraValues.Error),

    //Add a zero-based index column

    #"Added Index" = Table.AddIndexColumn(#"Table from List", "Index", 0, 1),

    //Calculate the name of each bucket

    #"Added Custom" = Table.AddColumn(#"Added Index", "Bucket",

                        each Number.ToText([Index]*10) & " to " & Number.ToText(([Index]+1)*10)),

    //Find the number of rows in each bucket - ie the count of customers

    #"Added Custom1" = Table.AddColumn(#"Added Custom", "Count", each Table.RowCount([Column1])),

    //Remove unnecessary columns

    #"Removed Columns" = Table.RemoveColumns(#"Added Custom1",{"Column1", "Index"})

in

    #"Removed Columns"

And here’s the output in Excel, with a bar chart:

How does this work?

After loading the data from the Excel table in the Source step, the first problem is to determine how many buckets we’ll need. This is fairly straightforward: I use Table.Column() to get a list containing all of the values in the Age column, then use List.Max() to find the maximum age, then divide this number by ten and round up to the nearest integer.
Now for Table.Partition(). The first thing to understand about this function is what it returns: it takes a table and returns a list of tables, so you start with one table and end up with multiple tables. Each row from the original table will end up in one of the output tables. A list object is something like an array.
One of the parameters that the Table.Partition() function needs is a hash function that determines which bucket table each row from the original table goes into. The BucketHashFunction step serves this purpose here: it takes a value, divides it by ten and rounds the result down; for example pass in the age 88 and you get the value 8 back.
The CreateBuckets step calls Table.Partition() with the four parameters it needs: the name of the table to partition, the column to partition by, the number of buckets to create and the hash function. For each row in the original table the age of each customer is passed to the hash function. The number that the hash function returns is the index of the table in the list that Table.Partition() returns. In the example above nine buckets are created, so Table.Partition() returns a list containing nine tables; for the age 8, the hash function returns 0 so the row is put in the table at index 0 in the list; for the age 88 the hash function returns 8, so the row is put in the table at index 8 in the list. The output of this step, the list of tables, looks like this:
The next thing to do is to convert the list itself to a table, then add a custom column to show the names for each bucket. This is achieved by adding a zero-based index column and then using that index value to generate the required text in the step #”Added Custom”.
Next, find the number of customers in each bucket. Remember that at this point the query still includes a column (called “Column1”) that contains a value of type table, so all that is needed is to create another custom column that calls Table.RowCount() for each bucket table, as seen in the step #”Added Custom1”.
Finally I remove the columns that aren’t needed for the output table.

I’m not convinced this is the most efficient solution for large data sets (I bet query folding stops very early on if you try this on a SQL Server data source) but it’s a good example of how Table.Partition() works. What other uses for it can you think of?

You can download the sample workbook here.

You may have seen Joseph Sirosh’s blog post last week about the ability to publish Azure Machine Learning models to the Azure Marketplace, and that MS have published a number of APIs there already. There’s a new Excel add-in that can be used with these APIs but I noticed that at least one of them, the Sentiment Analysis API, can be used direct from Power Query too.

To do this, the first thing you need to do is to go to the Azure Marketplace, sign in with your Microsoft account, and subscribe to the Lexicon Based Sentiment Analysis API. The docs say you get 25000 transactions free per month although there doesn’t appear to be a way to pay for more; that said the number of transactions remaining shown on my account kept resetting, so maybe there is no limit. The API itself is straightforward: pass it a sentence to evaluate and it will return a score between –1 and 1, where 1 represents a positive sentiment and –1 is a negative sentiment. For example, the sentence “I had a good day” returns the value 1:

…whereas the sentence “I had a bad day” returns –1:

You can now go to Power Query and click From Azure Marketplace (you will need to enter your credentials at this point if this is the first time you’ve used the Azure Marketplace from Power Query):

…and then, when the Navigator pane appears, double-click on Score:

The API is exposed as a function in Power Query (I guess because it’s an OData service operation, but I’m not sure) and double-clicking on Score invokes the function. You can enter a sentence here and the Query Editor will open to show the score:

However, to do anything useful you don’t want to invoke the function just yet – you want a query that returns the function. To do this you need to edit the query. Go to the Advanced Editor and you’ll see the M code for the query will be something like this:

let
    Source = Marketplace.Subscriptions(),
    #"https://api.datamarket.azure.com/data.ashx/aml_labs/
      lexicon_based_sentiment_analysis/v1/" = 
      Source{
       [ServiceUrl="https://api.datamarket.azure.com/data.ashx/aml_labs/
       lexicon_based_sentiment_analysis/v1/"]
       }[Feeds],
    Score1 = 
       #"https://api.datamarket.azure.com/data.ashx/aml_labs/
       lexicon_based_sentiment_analysis/v1/"
       {[Name="Score"]}[Data],
    #"Invoked FunctionScore1" = Score1("I had a good day")
in
    #"Invoked FunctionScore1"

You need to remove the last line (called #”Invoked FunctionScore1”) which invokes the function, leaving:

let
    Source = Marketplace.Subscriptions(),
    #"https://api.datamarket.azure.com/data.ashx/aml_labs/
      lexicon_based_sentiment_analysis/v1/" = 
      Source{[ServiceUrl="https://api.datamarket.azure.com/data.ashx/aml_labs/
      lexicon_based_sentiment_analysis/v1/"]}[Feeds],
    Score1 = #"https://api.datamarket.azure.com/data.ashx/aml_labs/
      lexicon_based_sentiment_analysis/v1/"
      {[Name="Score"]}[Data]
in
    Score1

You can now click the Close and Load button to close the Query Editor window; you now have a function called Score that you can call in other queries. For example, take the following Excel table (called Sentences):

Click on the From Table button to load this table into Power Query, then in the Query Editor click the Add Custom Column button and add a new custom column called Sentiment Score with the expression

Score([Sentence])

You’ll then be prompted to set a data privacy level for the data you’re using, because calling this function involves sending data from your worksheet up to the API where someone could potentially see it.

Click the Continue button and set the privacy level for the workbook to Public, so that this data can be sent to the API:

Click Save and you’ll see the sentiment score has been added as a column containing a Record value. Click on the Expand icon in the SentimentScore column then OK:

And at last you’ll see the scores themselves:

Here’s the code:

let
    Source = Excel.CurrentWorkbook(){[Name="Sentences"]}[Content],
    #"Added Custom" = Table.AddColumn(Source, "SentimentScore", each Score([Sentence])),
    #"Expand SentimentScore" = 
       Table.ExpandRecordColumn(#"Added Custom", "SentimentScore", 
       {"result"}, {"SentimentScore.result"})
in
    #"Expand SentimentScore"

You can download the sample workbook here.

Of course, I couldn’t resist repeating this exercise with all of my Facebook status updates – which, of course, can be accessed from Power Query very easily. Here’s my monthly average sentiment score from June 2009 to now:

As you can see, I was in a particularly good mood this August – probably because I was on holiday for almost the whole month.

Often when you’re working with web services in Power Query you’ll be making a large number of requests to the web service in a small amount of time – and you’ll notice that your query seems to be a lot slower than you would expect. The reason is, of course, that many public websites and APIs implement throttling to prevent you from making too many requests all at once, I guess to prevent denial-of-service attacks.

While looking into this issue I found a very interesting post containing the solution to this problem on the Power Query forum, from Alejandro Lopez-Lago. In it he shows how to create a function that creates a delay of a specified number of seconds in a Power Query query. Here’s an example of how to use it in a query:

let
    //Find current time
    One = DateTime.LocalNow(),
    //Define Wait function
    Wait = (seconds as number, action as function) =>
            if (List.Count(
             List.Generate(
              () => DateTimeZone.LocalNow() + #duration(0,0,0,seconds),
              (x) => DateTimeZone.LocalNow() < x,
              (x) => x)
              ) = 0)
               then null else action(),
    //Call Wait function to wait 5 seconds
    //then return the current time again
    Two = Wait(5,DateTime.LocalNow),
    //Display start time, finish time, and duration
    Output = "Start Time: " & DateTime.ToText(One) &
                   " Finish Time:" & DateTime.ToText(Two) &
                   " Difference: " & Duration.ToText(Two-One)
in
    Output

Here’s an example of the output:

The important step in this query is Wait, which defines the function. It works by using List.Generate() to keep on generating a list until a given number of seconds (passed in though the parameter seconds) have elapsed; when that has happened, it will execute the function that is passed in through the parameter action. Very clever stuff!

If you’re interested in performance-tuning a Power Query query, you’re going to need a reliable way of telling how long your query takes to run. It’s not as easy as it seems to adapt your query to do this though, and that’s because of the way the M formula language works. This post does show you how to find how long a query takes to run, but I have to admit that I’m not 100% satisfied with the approach and I suspect there are other, better ways of achieving the same thing that do not involve any M code. However this post does highlight a lot of interesting points about Power Query and M so I think it’s worth reading anyway…

M language background

In my last post, showing how to create a Power Query function that waits a given number of seconds, I used the following query to demonstrate the function:


let
    //Find current time
    One = DateTime.LocalNow(),
    //Define Wait function
    Wait = (seconds as number, action as function) =>
            if (List.Count(
             List.Generate(
              () => DateTimeZone.LocalNow() + #duration(0,0,0,seconds),
              (x) => DateTimeZone.LocalNow() < x,
              (x) => x)
              ) = 0)
               then null else action(),
    //Call Wait function to wait 5 seconds
    //then return the current time again
    Two = Wait(5,DateTime.LocalNow),
    //Display start time, finish time, and duration
    Output = "Start Time: " & DateTime.ToText(One) &
                   " Finish Time: " & DateTime.ToText(Two) &
                   " Difference: " & Duration.ToText(Two-One)
in
    Output

The output of this seems very straightforward:

However, consider the output when you change the last step in the query to be:

    Output = "Finish Time: " & DateTime.ToText(Two) &
                   " Start Time: " & DateTime.ToText(One) &
                   " Difference: " & Duration.ToText(Two-One)

The only difference is that the finish time is displayed before the start time, and yet the output is now:

The start time and the finish time are now the same, and the duration is 0. However the query does still take the same amount of time to run. What’s going on here?

Looking at the M code for a query, you’d be forgiven for thinking that the first step in a query is evaluated first, the second step is evaluated second, the last step is evaluated last, and so on. This is not the case though. From page 10 of the Language Reference document (which can be downloaded here):

List and record member expressions (as well as let expressions, introduced further below) are

evaluated using lazy evaluation, which means that they are evaluated only as needed.

A Power Query query is usually a single let expression, and within a let expression steps are evaluated as and when they are needed; if a step’s output is not needed anywhere else then it may not be evaluated at all. Therefore, in the second example above, even though the step called One is the first step in the query, it is evaluated after the step called Two because that is the order that the values are needed in the expression calculated for the step Output.

Another important point to understand is that (from page 30 of the Language Reference):

Once a value has been calculated, it is immutable, meaning it can no longer be changed.

This simplifies the model for evaluating an expression and makes it easier to reason about the

result since it is not possible to change a value once it has been used to evaluate a subsequent

part of the expression.

In our example, this means that once StartTime and EndTime have been evaluated for the first time, the value they return will not change on subsequent calls.

The final thing to point out is that I’m using the DateTime.LocalNow() function here to get the system date and time at the point that the expression is evaluated. If you use DateTime.FixedLocalNow() it will return the same value every time you call it in the same query – which is obviously not going to be much use for our purposes.

How to time your queries

Now we know all this we can alter a query that does something useful so we can see how long it takes to execute. Getting status updates from Facebook is sufficiently slow for us to be able to measure a duration, and here’s an example query that shows how to measure how long the request takes:

let
    //Get the current time, which will be the query start time
    StartTime = DateTime.LocalNow(),
    //Make sure StartTime is evaluated just before we request data from Facebook
    Source = if StartTime<>null
	     then
             Facebook.Graph("https://graph.facebook.com/me/statuses")
             else
	     null,
    //Find the number of rows returned
    NumberOfRows = Number.ToText(Table.RowCount(Source)),
    //Get the current time, which will be the query end time
    EndTime = DateTime.LocalNow(),
    //Make sure PQ evvaluates all expressions in the right order:
    //first, get the Number of rows, which ensure that
    //the Source expression is evaluated, and in turn StartTime is evaluated
    //second, the EndTime is evaluated as part of the duration calculation
    Output = "Query returned " & NumberOfRows
             & " rows and took "
             & Duration.ToText(EndTime - StartTime)
in
    Output

Here’s an example of the output of the query:

How does this work?

The Output step has the expression that returns the query output
The first value evaluated in the Output step is the NumberOfRows step, which finds the number of rows returned by Facebook in turn forces the evaluation of the Source step. If we didn’t include an expression in the output that was in some way derived from data loaded from Facebook, Power Query would not bother to evaluate the Source step.
The Source step gets data from Facebook, but before that happens there is an if expression that checks to see whether StartTime is null or not. Of course this always returns true, but the point here is that by making this check we are forcing StartTime to be evaluated.
The second value evaluated in Output is the duration. The duration is the EndTime (evaluated for the first time at this point) minus the StartTime (which has already been evaluated, remember), so we get the amount of time elapsed between the start of the request to Facebook and the end of the query.

With all of the other stuff stripped out, here’s the basic template I’ve used:

let
    StartTime = DateTime.LocalNow(),
    Source = if StartTime<>null
	     then
             //do whatever we were going to do in our first step
             else
	     null,

    //insert all other steps here

    NumberOfRows = Number.ToText(Table.RowCount(WhateverThePreviousStepIs)),
    EndTime = DateTime.LocalNow(),
    Output = "Query returned " & NumberOfRows
             & " rows and took "
             & Duration.ToText(EndTime - StartTime)
in
    Output

I’m making a pretty big assumption here, though, and that is that finding the number of rows in the NumberOfRows step will force the evaluation of all previous steps, which may not be the case! You might have to think carefully about how you write an expression that forces evaluation of all of your steps.

This is all very complicated (indeed, probably too complicated) so I’m also investigating what the Power Query trace file can tell you about things like query execution times. Hopefully I’ll be able to blog about that in a few weeks.

You can download an Excel workbook with some sample Power Query queries here.

Working with Web Services in Power Query

Web Services And POST Requests In Power Query

Displaying Help For A Power Query Function

Expanding All Columns In A Table In Power Query

Implementing A Basic LIKE/Wildcard Search Function In Power Query

Join Conditions In Power Query, Part 1

Join Conditions in Power Query, Part 2: Events-In-Progress, Performance and Query Folding

Using List.Generate() To Make Multiple Replacements Of Words In Text In Power Query

Power Query Book Published!

Working With Excel Named Ranges In Power Query

Comparers, Combiners, Replacers and Splitters in Power Query

Removing Punctuation From Text In Power Query

Create Your Own Relationships Between Tables In The Excel Data Model With Power Query

Power Pivot / Power Query Read-Only Connection Problems In Excel 2013 – And What To Do About Them

Handling Data Source Errors In Power Query

Advanced Options For Loading Data From SQL Server With Power Query

Creating Histograms With Power Query

Sentiment Analysis In Excel With Azure Machine Learning And Power Query

Waiting Between Web Service Requests In Power Query

Timing Power Query Queries