Design a site like this with WordPress.com
Get started

Power BI 2020 Updates Incremental Processing for Power BI Pro from (Source) Azure SQL Database (Bug)

This is really the most exciting update for probably all of us working with Power BI. Currently we already have a backlog or reports in Pro that have required Incremental loading for some time so its great to be able to finally get the chance to try it

Our Project, to view Tweets with sentiments etc in a Power BI Report involves the following Services:

Logic Apps

A Logic app that is triggered when a new tweet is posted regarding a specific company

Content Moderator – Logic Apps uses the content moderator to check for profanity

Cognitive Services – Logic apps uses the Cognitive Service to add a sentiment score and to find key phrases

There is also a second Logic App that uses the same logic for Tweets posted by the company.

Azure Storage Account – Tabular Data Storage

  • The Logic Apps loads the tweet information into a Tweets Table
  • The Keyphrases into a Keyphrases table that connects to the tweets table
  • The Media items into a Media table that connects to the tweets table

Data Factory

Data Factory is used to load the information from the Tabular Data Storage into a SQL Server Data base staging area incrementally

The logic is fairly straight forward in that data items are inserted. Nothing is updated or deleted

There is a Pipeline for each single table

The SQL For the Lookup for the data set Mentions

SELECT MAX(WatermarkValue) AS WatermarkValue From [staging].[watermarktable]
WHERE WatermarkValue IS NOT NULL
AND TableName = 'staging.mentions'

the Watermark is a table in SQL that is updated with the Max Date at the end of the process

The query for the Source

CreatedAt gt '@{activity('LookupWatermarkOld').output.firstRow.WaterMarkValue}'

Basically brings through records greater that the value in the lookup table

I have a pipeline over each separate pipeline to run them all sequentially

Next comes a pipeline to run all the stored Procedures to move data from staging to dims and facts in SQL

At the end of these stored procedures we move the date on to the max date in the watermark table (And at the beginning in case there is an error in the SQL pipeline)

Doing this means that Data Factory only loads new records and doesn’t have to reload the staging area ever time

The Top level pipeline runs all the incremental Copy pipelines and then the Stored Procedures

Lets have a look at our watermark table before and then after a load

And a look at our last import of tweets in the audit table.

There are more audit tables to help find any issues. This is after the run on the 13th March (Test 2)

Data Factory Trigger

Because its not in full use at the moment the data set is loaded once a week on a Sunday at 12:50 and until this is being retested its being set to off

Azure SQL Database

In here we have the Watermark Table. All the audit tables, the staging tables and dimensions and facsts

the Fact and Dimensions are currently created via Stored procedures but the hope is to try and change to data flows.

Power BI

the Data is imported into Power BI Pro (Full Process) so the model is dropped and recreated.

Azure Data Studio

Is there any way we can have a look at what is going on when we load. Yes, by using Azure Data Studio

https://docs.microsoft.com/en-us/sql/azure-data-studio/download-azure-data-studio?view=sql-server-ver15

Once installed, connect to the SQL Database that is your data source

So long as you have the profiler extention installed you can Launch Profiler

If you don’t have it, you can download the extension

Once Launched Start a Profiler Session

Now we need sometime to Profile. go into Power BI service, Datasets.


click on refresh now and then go to data Studio to see whats happening

From Logon to Logout during the run, it took 20 minutes because the entire model is refreshed. obviously it would be really good if we could get the time down using incremental refresh

Before you set up Incremental processing, ensure that the services preceding the Power BI Load have been well tested and signed off.

Incremental Processing in Power BI Pro

In Power BI desktop. Incremental refresh is now out of preview so no need to go to Options and Settings to turn it on anymore.

Define your Incremental refresh policy

If the system are acting as they should and there are no bugs or issues

  • New rows are added into the data set
  • No historical data is updated or deleted
  • Incremental loading can be added to every table apart from media tables because there are that many records. they can be left as full upload

Set up incremental refresh in Desktop. Create Parameters

It isn’t practical to hold all your data when you are working in desktop if you are working with a Large model.

Go to Power Query Editor

Select Manage Parameters

The two parameters that need setting up for incremental loading are RangeStart, RangeEnd

These are pre defined parameter names only available in Premium for Incremental processing

Range Start and Range End are set in the background when you run power BI. They partition the data

You need to be aware of Query folding here. This is when, you write lots of steps in M Query to transform the data and where possible they are applied at source. So RangeStart and RangeEnd will be pushed to the source system. Its not recommended to run incremental processing on data sources that cant query fold (flat files, web feeds) You do get a warning message if you cant fold the query

The suggested values are simply ones you add that get amended later during processing.

this start date was chosen because at present, the data only started to get collected in 2019 so there is only around a year of data

Filter the data in the model

Still in Power Query Editor, all the tables that require incremental load need to have the rangeStart and RangeEnd paramters adding to the filtered date column

Incremental refresh isn’t designed to support cases where the filtered date column is updated in the source system.

With this in mind, imagine you have a sales table with an Orderdate and an UpdateDate. the OrderDate is static. The UpdateDate will be updated if there are any changes to the record.

Order date would need to be chosen as its static, so lets go through the tweet tables and set the filters. Click on the Column header icon to get to the filters

In power BI you don’t need so much data to do the testing, so this is great to keep the data smaller in desktop. At the moment, its using the default settings we provided.

  • dim.mentionsKeyphrases Twittertimestamp
  • dim. mentionsTweet CreatedAtDateTime
  • dim.BusinessKeyphrases TwitterTimeStamp2
  • dim.BusinessTweets CreatedAt
  • dim.BusinessReplies CreatedAt
  • fact.Mentions Date (For this, date was created from the date time Key in Power BI Query Editor)

Close and Apply

Define your Incremental Refresh Policy in Power BI Pro

go to your first table and choose incremental refresh

Storing everything  for 5 years. its set to months so the partitions are smaller

If this is running every single day then you would only need to refresh rows in the last 1 day. However as a just in case 1 month has been used, in case for any reason the job is suspended or doesnt run.

Detect Data Changes has been used. The months data will only be refreshed if the ImportDate for this record is changed (Or there are new records)

No records are deleted so we don’t need to worry about this

Publish the new Power BI Report and Data Flow

You might be thinking at this point, but I dont want the filters that I have set for Desktop to be applied in Service. I want to see all my data in Service

Dont worry, in Service RangeStart and RangeEnd don’t keep the dates specified for the filters in Desktop.

they are set via your incremental refresh policy. So they are set as the partitions for our 60 months (Instead of setting it to 5 years, meaning there is one RangeStart and OneRangeEnd, you get RangeStart for Month one, RangeEnd for Month 1, RangeStart for Month2, RangeEnd for Month2 etc, breaking your 5 years down into much smaller partitions to work with,

You need to set up the Incremental Refresh policy for every table that has been filtered with RangeStart and RangeEnd

Test the Process

I have a visual for Number of Tweets

Today so far there are 11 Tweets

I also have an audit report

  1. The Logic App has been processing tweets realtime into Table Storage
  2. Run Data Factory (2 new records)
  3. Reprocess Power BI Pro data Error Resource name and Location need to match
  4. If there hadn’t been an error we would move to Azure Data Studio Check. Note that it now takes a second to run
  5. Check the Visual

Error Resource name and Location need to match

The data set now has a warning sign. after speaking to Microsoft this is a Known issue and should be fixed in April. it is believed to be something to do with detect Data Changes So basically…… to be continued

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: