A typical task when cleaning data is to extract substrings from string that follow a certain pattern. In this post I’m going to describe a method that uses the List.Accumulate function for it. Extract a pattern string.
I have to extract a payroll key from a description field that starts with 8 number, followed by an “-” and another number.
So I’m after the 12345678-0.
I plan to approach this by
- stepping through the string and check each character if it is valid.
- If so, store it
- and if not, forget it. And in that case, also forget the stored values so far so the collection starts from scratch.
- Then if a series of matches builds up, I will have to check the count of the stored values to not exceed the length of my target pattern.
- Once the length is reached, no further checks shall be performed and the found values be returned.
My aim is to find a generic way so that this solution can be adapted to many other similar use cases as well. Therefore the pattern must be described in a scalable way and the identification of the pattern elements should be easy to compute.
First comes defining the pattern structure. Therefore I create a placeholder string, that holds one placeholder symbol for each kind of valid values:
This reads as: For an x, any number between 0 and 9 is valid and for a y, only a “-” is allowed.
Therefore I create a table (“ValidValues”) with valid values for each kind of position:
Then I create another table (“Position”) where I define the pattern and match these values to the actual pattern:
So for each position in the target string, I define the placeholder that identifies the valid value(s) from the first table. This is the key that I now use to merge that ValidValues-table to my positions table:
When I expand out the “ValidValues”-column I get a long table with all valid values for each position in the pattern:
This is a structure that lets me easily grab the valid values for each position with the following syntax:
Table.SelectRows(Positions, each [Position] = <NumberI’mAfter>) [ValidValue]
The blue part selects all rows which contain the number that I’m after and the green part selects the last column and transforms it to a list that is easy digestible for further computation. I’m going to use this formula in the List.Accumulate operation later on.
Now I start preparations to walk through the string that contains my pattern. Therefore I turn the text into a list:
Text.ToList(“aölsfdk0125-fds da12345678-0asdf aölsdfj”)
Then I use List.Accumulate to step through this list, check each element and perform the task outlined in my plan above:
BTW: This code has been formatted with the great new Power Query Formatter tool .
The following picture contains the description of what each step does:
After the ListAccumulate, I select the field “Result” (in step “Result”). It contains the list with matching strings which I then combine in the last step.
Please check the enclosed file to see it in action: RegexDummy1_.zip
If you’ve enjoyed this tutorial so far, please stay tuned for the next episode. There I will transform the query to a function that can be applied to all rows of a table and adjust it to make the last 2 characters optional, so that also strings with just 8 numbers in them can be found.
Enjoy & stay queryious 😉