This article was edited on 2nd Nov 2019. JoinKind.Local has been removed to avoid problems with merges on tables with primary keys:
In this article you’ll learn how to speed up the aggregation of joined/merged tables by orders of magnitude (I recorded up to 30 times faster execution times). This method works for merges where both table have multiple rows for each keys. If one of your tables has a primary key, the method Chris Webb describes here works just as good: Chris Webb’s article on how to improve performance on aggregations after joins using primary keys .
You can follow along the different methods in this file:
When you join a table to another table in Power Query, the UI gives you the option to either expand the columns (default) or aggregate the contents of the joint tables. That’s useful if multiple rows are returned for the rows of the table that has been joined to (left table):
Today I want to share a scenario where a running total calculation in the query editor saved a model that run out of memory when done with DAX:
The model couldn’t be refreshed and returned out of memory error with a calculated column in the fact table of over 20 Mio rows (from a csv-file). A running total should be calculated for each “JourneyID”, of which there were over 1 Mio in the table itself. This rose memory consumption during refresh by over 300 % – until it finally errored out:
SUM ( Fact[Entries] )
– SUM ( Fact[Exits] );
ALLEXCEPT ( Fact; Fact[JourneyID] );
<= EARLIER ( Fact[StopId] )
Lately I was working on a fairly advanced allocation algorithm on large data which forced me to search for different tricks to improve performance than those that you can find on my site here already.
I was using List.Generate to check for every month in my table, if there was enough free capacity on a platform to start new wells. As every well had a certain production scheme (producing different amounts for a certain length of time), I first had to check the total production amount of active wells before I could determine the spare capacity for a new month. So I had to look into every active well, grab the capacity of the new month and add it up.
Therefore I’ve stored the active production schemes in one table in my List.Generate-record. That lead to an exponentially decreasing performance unfortunately.
Solution to improve performance of List.Generate
Buffering my tables in the “next”-function reduced the query duration by almost 70% !
Although a Table.Buffer or List.Buffer is always high on my list when it comes to performance issues, I was fairly surprised to see that behaviour here: As List.Generate returns the last element of its list as an argument for the next step, I was always assuming that this would be cached (and that was the reason because List.Generate performs recursive operations faster than the native recursion in M). Also, I had just referenced that table once ane in such a case, a buffer would normally not have come into my mind. (But desperation sometimes leads to unexpected actions …)
I also buffered a table that had just been referenced within the current record (and not recursively) and this improved performance as well. (Although in that case, the tables has been referenced multiple times within the current record). But this buffer didn’t have such a big impact on performance than the one on the table that was referenced by the recursive action.
Here is some pseudo-code illustrating the general principle:
Solution with buffers:
How to improve performance of List.Generate: Use Table.Buffer
Is that new to you or have you made the same experience? Which grades of performance improvements did you achieve with this method? Please let me know in the comments!
Enjoy & stay queryious 😉
When you import Excel sheets who have empty leading or trailing columns and rows (showing null-values), you can substantially improve the complexity and speed of your import process with a simple trick:
Remove the reasons for the empty trailing rows and columns 😉
Usually, when you import data from an Excel sheet, Power Query will automatically detect the used range in a sheet and will just return those rows and columns who have content in it. So how can it come that in some cases, additional rows or columns are returned who have nothing but empty values in them?
The reason for it can be cell formatting of empty cells. They often occur in old workbooks where cells have been deleted. These cells will be returned with a null-value during the import process with Power Query. See this blogpost for more details of potential pitfalls that come with it.
The “Inquire” Excel Add-On lets you clean any excess cell formatting. After you’ve executed this command, Power Query will not import any of those leading or trailing empty rows or columns any more. Often this will reduce the file size of the Excel files dramatically as well.
You will benefit from:
- simpler query logic
- potentially huge improved import speed, due to the reduced file size
Enjoy and stay queryious 😉
Recently I’ve distributed some techniques for partial matches or relative joins between tables using PowerQuery or the query editor in PowerBI. They are very flexible and powerful – yet slow.
To improve performance you can check if there is a chance to “partition” your table using a Table.Group. If you have an equality expression in your statement like we had in our rolling-12-months-exercise here for example:
You can boost performance into a different dimension by grouping your table on the “Associate”-table instead like this:
During my evaluation of Power Query as a reporting engine I wondered why we should keep detail fields in our cube at all if the preferred output is a flat table anyway. Cubes are meant for aggregation, aren’t they?
Especially in the Finance- & Accounting area you will come across many cubes with detail fields because sometimes you simply need to perform analysis on ledger entry level. But this seems like a loose/loose scenario in my eyes: Not only do these detail reports often perform badly, their biggest negative impact might lie in the fact that they cause the fact tables to be x-times bigger than the next aggregation level, thereby decreasing the overall performance of the cube.
So how about this approach then: Use Power Query for your reports on detail level: Directly connect to your fact table in the DWH and merge to your SSAS-data in order to retrieve the attributes/filters only. Or keep your fact tables in a dedicated DB if your DWH serves other purposes as well and you fear the performance impacts of those queries.
So this would leave the cubes’ fact tables with much less data -> improving performance.
I tried some scenarios that worked fine. But putting the fact tables into a separate tabular model instead of a relational DB performed quite badly.
Does anyone have experience with this approach? If you know someone who might, please forward.
What do you think about this approach, any other obstacles I’ve missed?