Extract pattern string and numbers from text using List.Accumulate in Power Query

A typical task when cleaning data is to extract substrings from string that follow a certain pattern. In this post I’m going to describe a method that uses the List.Accumulate function for it. Extract a pattern string.

Task

I have to extract a payroll key from a description field that starts with 8 number, followed by an “-” and another number.

aölsfdk0125-fds  da12345678-0asdf

So I’m after the 12345678-0.

Plan

I plan to approach this by

  1. stepping through the string and check each character if it is valid.
  2. If so, store it
  3. and if not, forget it. And in that case, also forget the stored values so far so the collection starts from scratch.
  4. Then if a series of matches builds up, I will have to check the count of the stored values to not exceed the length of my target pattern.
  5. Once the length is reached, no further checks shall be performed and the found values be returned.

My aim is to find a generic way so that this solution can be adapted to many other similar use cases as well. Therefore the pattern must be described in a scalable way and the identification of the pattern elements should be easy to compute.

Execution

First comes defining the pattern structure. Therefore I create a placeholder string, that holds one placeholder symbol for each kind of valid values:

xxxxxxxxyx

This reads as: For an x, any number between 0 and 9 is valid and for a y, only a “-” is allowed.

Therefore I create a table (“ValidValues”) with valid values for each kind of position:

Extract a pattern string

Valid values are organized as placeholders

Then I create another table (“Position”) where I define the pattern and match these values to the actual pattern:

Extract a pattern string

Pattern definition with placeholders from valid values

So for each position in the target string, I define the placeholder that identifies the valid value(s) from the first table. This is the key that I now use to merge that ValidValues-table to my positions table:

Merge the allowed values to the pattern table via placeholder key

When I expand out the “ValidValues”-column I get a long table with all valid values for each position in the pattern:

Expanding returns all valid values for each position in the pattern

This is a structure that lets me easily grab the valid values for each position with the following syntax:

Table.SelectRows(Positions, each [Position] = <NumberI’mAfter>) [ValidValue]

The blue part selects all rows which contain the number that I’m after and the green part selects the last column and transforms it to a list that is easy digestible for further computation. I’m going to use this formula in the List.Accumulate operation later on.

Now I start preparations to walk through the string that contains my pattern. Therefore I turn the text into a list:

Text.ToList(“aölsfdk0125-fds  da12345678-0asdf aölsdfj”)

Tranform text to list

List.Accumulate

Then I use List.Accumulate to step through this list, check each element and perform the task outlined in my plan above:

List.Accumulate to step through the values and create a list of matching characters

BTW: This code has been formatted with the great new Power Query Formatter tool .

The following picture contains the description of what each step does:

Commented code

After the ListAccumulate, I select the field “Result” (in step “Result”). It contains the list with matching strings which I then combine in the last step.

Please check the enclosed file to see it in action:  RegexDummy1_.zip

If you’ve enjoyed this tutorial so far, please stay tuned for the next episode. There I will transform the query to a function that can be applied to all rows of a table and adjust it to make the last 2 characters optional, so that also strings with just 8 numbers in them can be found.

Enjoy & stay queryious 😉

Comments (12) Write a comment

  1. Nice Imke. Im not at this level with my M code, and will come back and study later. I had a similar but different problem last week that I wonder if a similar solution could solve. The issue was the need to remove a technical string like this from a text string, hence “cleaning” The source. Any thoughts on an approach for this?

    Reply

    • Yes, you can use this approach as well.
      After the string has been identified that matches the pattern, you determine its position using Text.PositionOf
      Then you use Text.RemoveRange with the position parameters (and length of the string) to remove it from the main string.

      Reply

  2. How easy it is if IT can put the Payroll key, in a separate field, and not in the Description field 😉

    ..the “Assembly line” of the process just become so much easier and cheaper (OpEx):

    Reply

  3. it would be great if you have the option to copy your code with copy/paste. Picture are nice, but not always user friendly.

    Reply

  4. Hi Imke 🙂
    Very elegant and very nice piece of code 🙂
    I sent you a file with my method (different) to do this.
    Check it out if you want 🙂
    sq

    Reply

    • Hi Bill, that’s really cool!
      Copying the code here:
      fnCheckPattern = (t as text, length as number) => //, placeholder as list, validValue as list) =>
      let
      Step1 = Text.ToList(t),
      Step2 = List.Transform(Step1, each try Placeholder{List.PositionOf(ValidValue, _)} otherwise “|”),
      Step3 = Text.Combine(Step2),
      Result = Text.Range(t, Text.PositionOf(Step3, Pattern), length)
      in
      Result
      Thanks and sq 😉

      Reply

      • Old post that maybe is not checked for comments but in the code above what would Pattern be?

        Reply

  5. Dear Imke,

    I hope you are well.

    Can you explain how you use ‘previous[Result]’ in the code above, i.e. how it works?

    You declare ‘[Result = {}]’ as the Seed, so this is a Record with one field called Result which is a List.

    I’m not sure of the syntax – can you show a brief example of how this is filled for a couple of iterations?

    Many thanks and best wishes, Matt.

    Reply

  6. Hello again Imke,

    I think I know how this syntax is working now, some code on DataChant’s site gave me a clue,

    = List.Accumulate(
    {1, 2, 100, -40, 4},
    [min=#infinity,max=-#infinity],
    (state, current) =>
    [min=if state[min] > current then current else state[min],
    max=if state[max] < current then current else state[max]]
    ),

    Many thanks, Matt.

    Reply

Leave a Reply