Clean up or harmonize mis- or differently spelled category data with Power Query

A typical problem with data that has been created by manual entries is that category values are often misspelled or missed. So in this article I’m showing a very powerful technique on how to deal with this problem to clean up dirty category data. It was inspired by the “Preppin’ data” challenge whose instructions you can read here.

Task

Categorize dirty data

Solution

Create or import a table with all allowed category values:

Categories table

Merge that table to the column with dirty data and “Use Fuzzy matching to perform the merge”.

Fuzzy merge to match valid categories

This will activate some very clever AI algorithms that check for similarities between the dirty data and the allowed categories.

There are a couple of options that you can use to fine-tune the matching algorithm. See Miguel Escobar’s great article about all those options: Fuzzy Matching in Power BI / Power Query — Powered Solutions. But for the current challenge at hand, the default settings just work fine and return correct matches for all rows. In an upcoming post I’ll show a method how to handle cases where no matches have been found. That will include data driven alerts in the Power BI service as well.

 Video

If you want to see how this works in detail, please check out the video.

Enjoy and stay queryious 😉

Comments (3) Write a comment

  1. Thank You. I am also doing the Preppin Data challenges in Tableau Prep and Power BI, I didn’t know about fuzzy match so did manual find and replace multiple times.

    Reply

Leave a Reply