4

Custom Queries For “Detect Data Changes” In Power BI Incremental Refresh

 1 year ago
source link: https://blog.crossjoin.co.uk/2022/07/31/custom-queries-for-detect-data-changes-in-power-bi-incremental-refresh/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Custom Queries For “Detect Data Changes” In Power BI Incremental Refresh

One feature of Power BI incremental refresh I’ve always been meaning to test out is the ability to create your own M queries to work with the “detect data changes” feature, and last week I finally had the chance to do it. The documentation is reasonably detailed but I thought it would be a good idea to show a worked example of how to use it to get direct control over what data is refreshed during an incremental refresh.

First of all I created a simple dataset with incremental refresh enabled. The source was a SQL Server table with two columns: Date (actually a datetime column) and Sales.

image.png?resize=226%2C128&ssl=1

I then configured incremental refresh as follows:

image-1.png?resize=442%2C459&ssl=1

In the background this created six yearly partitions:

image-2.png?resize=475%2C166&ssl=1

Nothing interesting here so far, but the real challenge lies ahead: how exactly do you use custom queries with “detect data changes”?

I created a new table in my SQL Server database called DetectDataChangesTable with one row for every partition in the dataset (even though the incremental refresh configuration above means only the 2021 and 2022 partitions will ever be refreshed) and the values for the RangeStart and RangeEnd M parameters that would be set when each partition is refreshed:

image-3.png?resize=361%2C143&ssl=1

I then created an M query in my dataset called DetectDataChangesQuery that connected to this table, filtered the RangeStart column by the current value of the RangeStart M parameter and the RangeEndColumn by the current value of the RangeEnd M parameter, and then returned just the Output column:

let
Source = Sql.Databases(
"ThisIsMySQLServerName"
),
IncrementalRefreshDemo = Source
{[Name = "IncrementalRefreshDemo"]}
[Data],
dbo_DetectDataChangesTable
= IncrementalRefreshDemo
{
[
Schema = "dbo",
Item = "DetectDataChangesTable"
]
}
[Data],
FilterByParams = Table.SelectRows(
dbo_DetectDataChangesTable,
each [RangeStart]
= RangeStart and [RangeEnd]
= RangeEnd
),
#"Removed Other Columns"
= Table.SelectColumns(
FilterByParams,
{"Output"}
)
in
#"Removed Other Columns"

Here’s the output of the query in the Power Query Editor with the RangeStart M parameter set to 1/1/2021 and the RangeEnd M parameter set to 1/1/2022:

image-4.png?resize=204%2C61&ssl=1

The important thing to point out here is that while the documentation says the query must return a scalar value, in fact the query needs to return a table with one column and one row containing a single scalar value.

After publishing the dataset once again, then next thing to do was to set the pollingExpression property described in the documentation. I did this by connecting to the dataset via the XMLA Endpoint using Tabular Editor 3, then clicking on the Sales table and looking in the Refresh Policy section in the Properties pane. I set the property to the name of the query I just created, DetectDataChangesQuery:

image-5.png?resize=500%2C772&ssl=1

I then forced a full refresh of the Sales table, including all partitions, by running a TMSL script in SQL Server Management Studio and setting the applyRefreshPolicy parameter to false, as documented here. Here’s the TMSL script:

{
"refresh": {
"type": "full",
"applyRefreshPolicy": false,
"objects": [
{
"database": "IncrementalRefreshDetectDataChangesTest",
"table": "Sales"
}
]
}
}

Scripting the entire table out to TMSL I could then see the refreshBookmark property on the two partitions (2021 and 2022) which could be refreshed in an incremental refresh set to 1, the value returned for those partitions in the Output column of the DetectDataChangesQuery query:

image-6.png?resize=287%2C672&ssl=1

The refreshBookmark property is important because it stores the value that Power BI compares with the output of the DetectDataChangesQuery query on subsequent dataset refreshes to determine if the partition needs to be refreshed. So, in this case, the value of refreshBookmart is 1 for the 2021 partition but if in a future refresh the DetectDataChangesQuery returns a different value for this partition then Power BI knows it needs to be refreshed.

I then went back to the DetectDataChangesTable table in SQL and set the Output column to be 2 for the row relating to the 2021 partition:

image-8.png?resize=369%2C144&ssl=1

Next, went back to SQL Server Management Studio and refreshed the table using a TMSL script with applyRefreshPolicy set to true (which is the default, and what would happen if you refreshed the dataset through the Power BI portal).

{
"refresh": {
"type": "full",
"applyRefreshPolicy": true,
"objects": [
{
"database": "IncrementalRefreshDetectDataChangesTest",
"table": "Sales"
}
]
}
}

In the Messages pane of the query window I saw that Power BI had detected the value returned by DetectDataChangesQuery for the 2021 partition had changed, and that therefore the partition needed to be refreshed:

image-9.png?resize=550%2C180&ssl=1

Lower down in the Messages pane the output confirmed that only the 2021 partition was being refreshed:

image-11.png?resize=550%2C18&ssl=1

In Profiler I saw three SQL queries. The first two were to query the DetectDataChangesTable table for the two partitions that might be refreshed to check to see if the value returned in the Output column was different:

select [_].[Output]
from [dbo].[DetectDataChangesTable] as [_]
where ([_].[RangeStart] = convert(datetime2, '2022-01-01 00:00:00')
and [_].[RangeStart] is not null)
and ([_].[RangeEnd] = convert(datetime2, '2023-01-01 00:00:00')
and [_].[RangeEnd] is not null)
select [_].[Output]
from [dbo].[DetectDataChangesTable] as [_]
where ([_].[RangeStart] = convert(datetime2, '2021-01-01 00:00:00')
and [_].[RangeStart] is not null)
and ([_].[RangeEnd] = convert(datetime2, '2022-01-01 00:00:00')
and [_].[RangeEnd] is not null)

The third was to get the data for the 2021 partition, which was the only partition that needed to be refreshed:

select [_].[Date],
[_].[Sales]
from [dbo].[Sales] as [_]
where [_].[Date] >= convert(datetime2, '2021-01-01 00:00:00')
and [_].[Date] < convert(datetime2, '2022-01-01 00:00:00')

Finally, scripting the Sales table again to TMSL after the refresh had completed showed that the refreshBookmark property had changed to 2 for the 2021 partition:

image-10.png?resize=273%2C352&ssl=1

And that’s it. I really like this feature but I’ve never seen anyone use this in the real world though, which is a shame. Maybe this blog will inspire someone out there to try it in production?

Like this:

Loading...

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK