Azure Data Factory Moving from development and Production – Part 2. Using Key vault for Linked Services

In Azure Data Factory Moving from development and Production We looked at how we can use Azure DevOps to move the Json Code for Development Data Factory from development to Production.

Its going well, I have however been left with an issue. every time I move into Production details for the Linked Services have to be re added. Lets have a look at the SQL Server and the Data Lake gen 2 account.

Development

Notice that the information has been entered manually including the Storage account Key.

Again, in this instance the information has been entered manually. SQL Server Authentication is being used because we have a user in the SQL DB with all the privileges that Data Factory Needs.

DevOps Data Factory release Pipeline

Go into Edit of the Release Pipeline

Within Prod Stage we have an Agent Process

We are looking for the Section Override Template Parameters

Note that currently Account Key and SQL Database Connection String are null.

Provisioning Azure Key vault to hold the Secrets

Managed Identity for Data Factory

Copy your Azure Data Factory Name from Data Factory in Azure

You need to have a Key vault set up in Development

GET and LIST allows Data Factory to get information from the Key Vault for secrets

Paste the data factory name into Select Principal

Key Vault, create a Secret for the Azure Data Lake Storage

For the Key Vault Secret. I gave it the Secret value by copying across the Access Key from the Azure Storage Account Keys Section

The Content type was simply set as the name of the Storage Account for this excercise

In Data Factory Create a Linked Service to the Key Vault

Test and ensure it successfully connects

Use the New Key Vault to reset the data Lake Linked Service

How does this Data Lake Linked Service change the DevOps Release Pipeline?

Its time to release our new Data factory settings into Production. Make sure you have Published Data Factory into Devops Git.

Production Key vault Updates

We need to update Production in the same way as Development

  1. In Production Key vault add the Production data factory name to Access Policies (as an Application) With Get and List on the Secret
  2. Ensure that there is a Secret for the Production Data Lake Key AzureDataLakeStorageGen2_LS_accountKey
  3. Check your Key vault connection works in Production before the next step

Azure DevOps Repos

In Azure Devops go to your Data Factory Repos

Notice that your Linked Service information for the Data Lake now mentions the Key Vault Secret. its not hardcoded anymore which is exactly what we want

Azure DevOps Release Pipeline

Go to Edit in the Data Factory release pipeline

When the job in Prod is clicked on, you can go to the Override Parameters Section. And notice there is now an error

AzureKeyVault1_properties_typeProperties_baseUrl is the missing Parameter. Basically at this point you need to delete the code in the Override template Parameters box and then click the button to regenerate the new parameters

Override with production information (I saved the code so I could re copy the bits I need.

Once done, notice that the -AzureDataLakeStorageGen2_LS_accountKey “” parameter is now gone because its being handled by the key vault.

Lets Save and Create a Release

New failures in the Release

2021-02-08T13:45:13.7321486Z ##[error]ResourceNotFound: The Resource ‘Microsoft.DataFactory/factories/prod-uks-Project-adf’ under resource group ‘prd-uks-Project-rg’ was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix

Make sure that your override parameters are ok. I updated:

  • Data Factory name from Data Factory
  • Primary endpoint data Lake Storage from Properties
  • Vault URI from Key vault Properties

Repeat the Process for SQL Database

With everything in place we need to introduce a connection string into Key Vault

I have a user set up in my SQL database. the user has GRANT SELECT, INSERT, UPDATE, DELETE, EXEC, ALTER on all Schemas

I want to include the user name and password in the Connection string and use SQL authentication

Secret Name

AzureSQLDatabase-project-ConnectionString

Secret

 Server=tcp:dev-uks-project-sql.database.windows.net,1433; Database= dev-uks-project-sqldb; User Id= projectDBowner;Password= Password1;

the connection string has been set as above. For more information on connection strings see SQL Server connection Strings

Go back to Data factory and set up the new secret for SQL Server

This is successful

Data Factory and DevOps

  1. back in Data Factory Publish the new linked Service Code
  2. go into Dev Repos and check in Linked Service code you are happy with the new Key vault information
  3. Go to Prod Key vault and make sure you are the Secret is set with the Connection String for SQL DB
  4. Test the Key vault secret works in Prod
  5. Back in DevOps Go to Release pipelines and Edit for the adf Release CD pipeline (Release Pipelines are Continuous Delivery. Build pipelines are CI for Continuous Integration)
  6. Edit Prod Stage (I only have Prod) Arm Template Deployment Task, Copy Overwrite Template Parameters code into a file for later
  7. Delete the code and click the … to get the latest parameter information
  8. Re add your production parameters, most can be taken from the code you just copied.
  9. Create a new Release
  10. go to Linked Services in Data Factory and check they are still Production. They still use Key vault and they still work

Now this is all in place, Development Data factory can be published up to production. there is no need to reset Linked Services and all your information about Keys and passwords are hidden in the Key Vault

Azure Data Factory – Moving from Development to Production

When working on larger projects we need to merge changes from Developers. When all the changes are in the central branch we can then have an automated process to move development to Production

Smoke tests

In computer programming and software testing, smoke testing is preliminary testing to reveal simple failures severe enough to, for example, reject a prospective software release

Integration testing

Integration testing is the phase in software testing in which individual software modules are combined and tested as a group. Integration testing is conducted to evaluate the compliance of a system or component with specified functional requirements. It occurs after unit testing and before validation testing

Resources Involved with the current Project

  • Azure DevOps
  • Azure SQL Server
  • Azure SQL Database
  • Azure Data Factory
  • Azure Data Lake Gen 2 Storage
  • Azure Blob Storage
  • Azure Key vault

Each resource has its own specific requirements when moving from Dev to Prod.

We will be looking at all of them separately along with all the security requirements that are required to ensure that everything works on the Production side

This post specifically relates to Azure Data Factory and DevOps

Azure data factory CI/DC Lifecycle

GIT does all the creating of the feature branches and then merging them back into main (Master)

Git is used for version controlling.

In terms of Data Factories, you will have a Dev Factory, a UAT factory (If Used) and a Prod Data factory. You only need to intergrate your development data factory with GIT.

The Pull request merges feature into master

Once published we need to move the changes  to the next environment, in this case Prod (When ready)

This is where Azure Devops Pipelines come into play

If we are using the Azure Devops Pipelines for continuous development the following things will happen

  • The devops Pipeline will get the powershell script from the master branch
  • The get the ARM template from the publish branch
  • Deploy the Power Shell script to the next environment
  • Deploy the arm template to the next environment

Why use Git with data Factory

  • Source control allows you to track and audit changes
  • You can do partial saves when for example you have an error. Data Factory wont allow you to publish but with Git you can save where you are and then resolve issues another time
  • It allows you to collaborate more with team members
  • Better CI/CD when deploying to multiple environments
  • Data Factory is time times faster with a GIT back end that it is when authoring against the data factory service because resources are downloaded from GIT
  • Adding your code into Git rather than simply into the Azure Service is actually more secure and faster to process

Setting Up Git

We already have an Azure Devops Project with Repos and Pipelines turned on

We already have an Azure Subscriptions and Resource Groups for both Production and Development Environments

There is already a working Data Factory in development

In this example Git was set up through the Data Factory management hub (the toolbox)

DevOps Git was used for this project rather than GitHub because we have Azure DevOps

Settings

The Project Name matches the Project in Devops

The Collaboration branch is used for Publishing and by default it’s the master branch. You can change the setting in case you want to publish from another branch.

Import existing resources to repository means that all the work done before adding Git can be added to the repository.

Devops is now set up.

Close Azure Data Factory so we can reopen it again to go through the GIT process (If it is open)

Where to find your Azure DevOps

You should now be able to select your own Area in Devops / Repos and select the created project within Azure Devops and Repos

You will need an Account in Azure DevOPs Click on Repos and that account must be higher than Stakeholder to access Repos

You can then select the Project you have created

Using Git with Azure Data Factory

In Azure Open up your Data Factory

Git has been Enabled (Go to Manage to review Git)

The master branch is the main branch with all the development work on it

We now develop a new Feature. Create a Feature Branch + New Branch

We are now in the feature branch and I am simply adding a description to a Stored Procedure Activity in the pipeline. However this is where you will now do your development work rather than within the master

For the test, the description of a Pipeline is updated. Once completed my changes are in the feature 1 branch. I can now save my feature

You don’t need to publish to save the work. Save all will save your feature, even if there are errors

You can go across to Devops to see your Files and history created in Azure Devops for the feature branch (We will look at this once merged back into Production)

Once happy Create the pull request

This takes you to a screen to include more details

Here I have also included the Iteration we are currently working on in Devops Boards.

A few tags are also added. Usually, someone will review the work and will also be added here.

The next screen allows you to approve the change and Complete the change

In this case I have approved. You can also do other things like make suggestions and reject

Completing allows us to complete the work and removes the feature branch. Now all the development in the feature branch will be added to the main, master branch.

In Data Factory, go back to the master branch and note that your feature updates are included

We now publish the changes  in Master Branch which creates the adf publish branch. This publish branch creates the ARM template that represents the pipelines, linked services, triggers etc.

Once published, In Devops Repos , there are now files to work with

You can see your change within the master branch

(The changes would normally be highlighted on the two comparison screens)

Here we open the Pipelines folder got the compare tab and find the before and after code

And you can also see your history

The Arm templates is in the adf_Publish branch, if you select this branch

Once done we need to move the changes  to the next environment, in this case Prod (When ready)

This is where Azure Devops Pipelines come into play

Continuous Development using Azure DevOps

We need another Data Factory object to publish changes to

In this case, the Production has been created with Azure Portal within the Production Subscription and Production Resource Group

Git Configuration is not needed on the Production resource. Skip this step

Create your tags and Review and Create

DevOps Pipelines

For this specific Project, We don’t want to update production automatically when we publish to Dev. We want this to be something that we can do manually.

Go to Pipelines and create a new release Pipeline (In DevOps)

Click on Empty job because we don’t want to start with a template

And because for this project there is no UAT, just Production name the Release Pipeline Prod

Click on the X to close the blade

We need to sort out the Artefact section of the Pipeline
 
Click on Add an Artefact and choose an artefact from Azure Repos
 
We may as well add adf_Publish branch which contains the ARM templates
And the Master branch

the Source alias was updated with _adf_publish

Both Pipelines are Azure Repos artefacts

Next We move to Prod and Start adding tasks

Click on 1 job, 0 tasks to get to tasks

Click + against Agent Job to add the task Our task is for ARM Template deployment


Click Add

Then click on the new Task to configure

The first section is where you select your production environment

Next you need to select the ARM template and the ARM template parameters file. These are always updated in the Devops artefact everytime you publish to dev.

The JSON templates are in the adf_publish branch

Now you need to override the template parameters because these are all for Dev and we need them to be production.  These are:

These will be specific to your own data Factory environment. In this instance we need to sort out the information for the Key vault and data lake storage account

factoryName

This one is easy. The only difference is changing dev to prd

AzureDataLakeStorageGen2_LS_properties_typeProperties_url

The Production Data lake Storage account must be set up in dev and prod before continuing. Go to this Storage account resource

This information is also stored in our Key Vault as a secret which we can hopefully use at a later date.

It is taken from Storage Account, Properties. We want the Primary endpoint Storage for the data lake

Copy the Primary Endpoint URL and override the old with the new Prod URL in DevOps

AzureKeyVault1_properties_typeProperties_baseUrl

We need to update https://dev-uks-project-kv.vault.azure.net/

Lets get this overridden. We already have a Key vault Set up in production. Get the URI from Overview in the Production Key Vault Service

and lets add this into our DevOps parameter

AzureDataLakeStorageGen2_LS_accountKey

This is empty but we could add to it later in the process.

Account keys are the kind of things that should be kept as secrets in Key vault in both Dev And Prod

Lets get them set up. Just for the time being, lets ensure we have the Data Lake storage account key within our development and Production Key vaults

Key Vault.   

Within Key vault in development create a secret with the nameAzureDataLakeStorageGen2LSaccountKey

And the key from the storage account comes from……

And Repeat for Production Key vault

For the time being through lets leave this blank now we have captured the information in the key vault. It should come useful at a later date

AzureSqlDatabaseTPRS_LS_connectionString

This was also empty within the parameters for dev.

You can get the connection string value by going to your SQL data base. Connection Strings. PHP and finding the try statement

And here is the Connection String value for production

Server=tcp: prd-uks-project-sql.database.windows.net,1433; Database= prd-uks-project-sqldb;

You can also add this information into Key Vault as a secret and repeat for Production

For the first instance we are going to leave empty as per the dev parameters. At some point we should be able to set up the Security Principal so we can change the hardcoded values to Secrets

The parameters created in dev are now overridden with the production values

The Pipeline is then named

Create a release

Once Saved. Click back on Releases

For this tye of release we only want to do it manually

Create a Release for our very first manual release

Click back on releases

And click on release 1 to see how it is doing

You can click on Logs under the Stages box to get more information

Now you should be able to go back to the production data Factory and see that everything has been set up exactly like Dev.

Go and have a look at linked Services in the Production data Factory

Note that they are all set with the Production information

We now have a process to move Dev to Prod whenever we want

The Process

Throughout the sprint, the development team will have been working on Feature branches. These branches are then commited into the master pipeline and deployed to Dev

Once you are happy that you want to move your Data Factory across from dev into Prod. Go to DevOps Release pipeline

Create Release to create a new release

It uses the Artefact of the Arm template which is always up to date after a publish.

This will create a new release and move the new information to Prod

All your resources will be to be able to quickly move from Dev to Prod and we will look at this in further posts

Data Factory, Moving multiple lookup worksheets from Excel to one lookup table in SQL Server

A current project has an xlsx containing around 40 lookups in individual worksheets

Each worksheet consists of a code and a description

We decide that we want every single lookup in one lookups table in SQL Server.

This will have a Lookup Name, Code and Description that we can then use for the rest of the project

We want to do everything in one go in Data Factory.

For this Im going to use a simple example with just 3 worksheets

Azure Data Lake Gen 2

We are going to store the source data within a data lake.

The Source data looks like this

Lookup B worksheet

Lookup C Worksheet

SQL Server

I have an Azure SQL Database and on it I create the one table that all the reference lookups will go into

GO
CREATE TABLE [staging].[Lookups](
[LabelKey] [int] IDENTITY(1,1) NOT NULL,
[LabelName] varchar NULL,
[Code] [int] NULL,
[LabelDescr] varchar NULL,
[Importdate] [datetime] NULL
) ON [PRIMARY]
GO
ALTER TABLE [staging].[Lookups] ADD DEFAULT (getdate()) FOR [Importdate]
GO

LabelKey has been added just to create a valid key for the table. LabelName has also been added which will be the name of the worksheet.

Finally ImportDate is added because we want to know exactly what time this data was imported into the table

Now we need to provide Data Factory with a list of worksheets

CREATE TABLE [staging].[LookupNames](
[LabelKey] [int] IDENTITY(1,1) NOT NULL,
[Labels] varchar NULL,
[Importdate] [datetime] NULL
) ON [PRIMARY]
GO
ALTER TABLE [staging].[LookupNames] ADD DEFAULT (getdate()) FOR [Importdate]
GO

Lookup Names is our seed table and will provide us with the worksheet names

we have populated it like this

SELECT 'Lookup A' UNION
SELECT 'Lookup B' UNION
SELECT 'Lookup C' 

Data Factory

Linked Services

Firstly we need to provide our linked services. Source and destination

go to Linked services via

and choose new.

call it ADLS_LS and select your Azure Subscription and Storage account.

At this point the connection was tested and was successful so we didn’t need to do anything further

Next, create your Azure SQL Database Linked Service

And call is SQLDB_LS (Or what ever you feel is the right naming convention. _LS is good because you can see exactly what are the linked services in the JSON script created

Again add in your details (We used a role that we created in SQL Server DB specifically for data factory with GRANT EXEC, SELECT, INSERT, UPDATE, DELETE on all the schemas)

Ensure the connection is successful

Data Sets

Now to come up with the actual source and destination datasets. If we parameterise them then we can reuse a single data set for lots of other activities within the pipeline

Click on the and choose New dataset

Choose the Format. In this case its Excel

We don’t want to specify any of the location values until we get to the pipeline, including the worksheet

Make sure First row as header is ticked (Unless you don’t have a header in Excel)

And create parameters

This means we can use this one Data set for all the SQL data sources

Pipelines

Now to create the pipeline specifically for the lookup

This is the basic pipeline we are going to add.

Lookup

First of all In Activities search for lookup and drag this into the pane

This uses the SQL dataset because we are going to use our SQL table that contains all the names of the worksheets.

Note that first row only is not ticked because we are bringing all the information through

ForEach

@activity(‘GetLookups’).output.Value

We are going to get the entire data set (Value) fed into the GetLookups Lookup.

Sequential is ticked because we are going to move through all the worksheets names in the table (Ensure that your Worksheets have exactly the same name as what is specified in your table)

Click on the Activities (1) to get to the activity

Copy Activity within the Foreach

We now set up the source of the copy activity

We use all the parameters within the dataset and add in the information from our Azure data Lake Gen 2 in the Storage Resource.

Within our Lookups table there is a column called labelname and we are going to populate this with the Labels column from our item. Out Item in the foreach loop and was created via the Lookup. And that lookup contained all the columns from our LookupNames SQL Table

The data will go into the Lookups table

Thats everything. You should be able to test your Pipeline by clicking debug and the Foreach should move through worksheet specified within your lookupnames table and add your information into SQL

Truncating lookup tables before re adding data

we want to be able to repeat this process and unless we add a truncate statement into our process we will keep readding the same information

We can add the following Stored procedure into SQL

/*
05/10/2020 Debbie Edwards - Peak - Truncate lookups
EXEC [staging].[USP_Truncatelookups]
*/
Create PROCEDURE [staging].[USP_Truncatelookups]
AS
BEGIN
IF EXISTS(SELECT * FROM [dbo].[sysobjects] WHERE Name = 'lookups')
TRUNCATE TABLE [staging].[Lookups]
DBCC CHECKIDENT ('Staging.Lookups', RESEED, 1)
END

And this can be added to the the Pipeline before the foreach loop and Lookup with a Stored Procedure Activity

You wont be able to see the Stored procedure if you havent granted EXEC access to the specific Database Role name and schema

Always give the least amount of privileges and them move up if you need to

--Bring back information about the members in your roles
SELECT DP1.name AS DatabaseRoleName,
isnull (DP2.name, 'No members') AS DatabaseUserName
FROM sys.database_role_members AS DRM
RIGHT OUTER JOIN sys.database_principals AS DP1
ON DRM.role_principal_id = DP1.principal_id
LEFT OUTER JOIN sys.database_principals AS DP2
ON DRM.member_principal_id = DP2.principal_id
WHERE DP1.type = 'R'
ORDER BY DP1.name;
SELECT DISTINCT rp.name,
ObjectType = rp.type_desc,
PermissionType = pm.class_desc,
pm.permission_name,
pm.state_desc,
ObjectType = CASE
WHEN obj.type_desc IS NULL
OR obj.type_desc = 'SYSTEM_TABLE' THEN
pm.class_desc
ELSE obj.type_desc
END,
s.Name as SchemaName,
[ObjectName] = Isnull(ss.name, Object_name(pm.major_id))
FROM sys.database_principals rp
INNER JOIN sys.database_permissions pm
ON pm.grantee_principal_id = rp.principal_id
LEFT JOIN sys.schemas ss
ON pm.major_id = ss.schema_id
LEFT JOIN sys.objects obj
ON pm.[major_id] = obj.[object_id]
LEFT JOIN sys.schemas s
ON s.schema_id = obj.schema_id
WHERE rp.type_desc = 'DATABASE_ROLE'
AND pm.class_desc <> 'DATABASE'
AND rp.name = 'db_NameofRole'

you should hopefully have a good pipeline to run in your lookup information into one lookup table and truncate that table when ever you run the process

Power BI 2020 Updates Incremental Processing for Power BI Pro from (Source) Azure SQL Database (Bug)

This is really the most exciting update for probably all of us working with Power BI. Currently we already have a backlog or reports in Pro that have required Incremental loading for some time so its great to be able to finally get the chance to try it

Our Project, to view Tweets with sentiments etc in a Power BI Report involves the following Services:

Logic Apps

A Logic app that is triggered when a new tweet is posted regarding a specific company

Content Moderator – Logic Apps uses the content moderator to check for profanity

Cognitive Services – Logic apps uses the Cognitive Service to add a sentiment score and to find key phrases

There is also a second Logic App that uses the same logic for Tweets posted by the company.

Azure Storage Account – Tabular Data Storage

  • The Logic Apps loads the tweet information into a Tweets Table
  • The Keyphrases into a Keyphrases table that connects to the tweets table
  • The Media items into a Media table that connects to the tweets table

Data Factory

Data Factory is used to load the information from the Tabular Data Storage into a SQL Server Data base staging area incrementally

The logic is fairly straight forward in that data items are inserted. Nothing is updated or deleted

There is a Pipeline for each single table

The SQL For the Lookup for the data set Mentions

SELECT MAX(WatermarkValue) AS WatermarkValue From [staging].[watermarktable]
WHERE WatermarkValue IS NOT NULL
AND TableName = 'staging.mentions'

the Watermark is a table in SQL that is updated with the Max Date at the end of the process

The query for the Source

CreatedAt gt '@{activity('LookupWatermarkOld').output.firstRow.WaterMarkValue}'

Basically brings through records greater that the value in the lookup table

I have a pipeline over each separate pipeline to run them all sequentially

Next comes a pipeline to run all the stored Procedures to move data from staging to dims and facts in SQL

At the end of these stored procedures we move the date on to the max date in the watermark table (And at the beginning in case there is an error in the SQL pipeline)

Doing this means that Data Factory only loads new records and doesn’t have to reload the staging area ever time

The Top level pipeline runs all the incremental Copy pipelines and then the Stored Procedures

Lets have a look at our watermark table before and then after a load

And a look at our last import of tweets in the audit table.

There are more audit tables to help find any issues. This is after the run on the 13th March (Test 2)

Data Factory Trigger

Because its not in full use at the moment the data set is loaded once a week on a Sunday at 12:50 and until this is being retested its being set to off

Azure SQL Database

In here we have the Watermark Table. All the audit tables, the staging tables and dimensions and facsts

the Fact and Dimensions are currently created via Stored procedures but the hope is to try and change to data flows.

Power BI

the Data is imported into Power BI Pro (Full Process) so the model is dropped and recreated.

Azure Data Studio

Is there any way we can have a look at what is going on when we load. Yes, by using Azure Data Studio

https://docs.microsoft.com/en-us/sql/azure-data-studio/download-azure-data-studio?view=sql-server-ver15

Once installed, connect to the SQL Database that is your data source

So long as you have the profiler extention installed you can Launch Profiler

If you don’t have it, you can download the extension

Once Launched Start a Profiler Session

Now we need sometime to Profile. go into Power BI service, Datasets.


click on refresh now and then go to data Studio to see whats happening

From Logon to Logout during the run, it took 20 minutes because the entire model is refreshed. obviously it would be really good if we could get the time down using incremental refresh

Before you set up Incremental processing, ensure that the services preceding the Power BI Load have been well tested and signed off.

Incremental Processing in Power BI Pro

In Power BI desktop. Incremental refresh is now out of preview so no need to go to Options and Settings to turn it on anymore.

Define your Incremental refresh policy

If the system are acting as they should and there are no bugs or issues

  • New rows are added into the data set
  • No historical data is updated or deleted
  • Incremental loading can be added to every table apart from media tables because there are that many records. they can be left as full upload

Set up incremental refresh in Desktop. Create Parameters

It isn’t practical to hold all your data when you are working in desktop if you are working with a Large model.

Go to Power Query Editor

Select Manage Parameters

The two parameters that need setting up for incremental loading are RangeStart, RangeEnd

These are pre defined parameter names only available in Premium for Incremental processing

Range Start and Range End are set in the background when you run power BI. They partition the data

You need to be aware of Query folding here. This is when, you write lots of steps in M Query to transform the data and where possible they are applied at source. So RangeStart and RangeEnd will be pushed to the source system. Its not recommended to run incremental processing on data sources that cant query fold (flat files, web feeds) You do get a warning message if you cant fold the query

The suggested values are simply ones you add that get amended later during processing.

this start date was chosen because at present, the data only started to get collected in 2019 so there is only around a year of data

Filter the data in the model

Still in Power Query Editor, all the tables that require incremental load need to have the rangeStart and RangeEnd paramters adding to the filtered date column

Incremental refresh isn’t designed to support cases where the filtered date column is updated in the source system.

With this in mind, imagine you have a sales table with an Orderdate and an UpdateDate. the OrderDate is static. The UpdateDate will be updated if there are any changes to the record.

Order date would need to be chosen as its static, so lets go through the tweet tables and set the filters. Click on the Column header icon to get to the filters

In power BI you don’t need so much data to do the testing, so this is great to keep the data smaller in desktop. At the moment, its using the default settings we provided.

  • dim.mentionsKeyphrases Twittertimestamp
  • dim. mentionsTweet CreatedAtDateTime
  • dim.BusinessKeyphrases TwitterTimeStamp2
  • dim.BusinessTweets CreatedAt
  • dim.BusinessReplies CreatedAt
  • fact.Mentions Date (For this, date was created from the date time Key in Power BI Query Editor)

Close and Apply

Define your Incremental Refresh Policy in Power BI Pro

go to your first table and choose incremental refresh

Storing everything  for 5 years. its set to months so the partitions are smaller

If this is running every single day then you would only need to refresh rows in the last 1 day. However as a just in case 1 month has been used, in case for any reason the job is suspended or doesnt run.

Detect Data Changes has been used. The months data will only be refreshed if the ImportDate for this record is changed (Or there are new records)

No records are deleted so we don’t need to worry about this

Publish the new Power BI Report and Data Flow

You might be thinking at this point, but I dont want the filters that I have set for Desktop to be applied in Service. I want to see all my data in Service

Dont worry, in Service RangeStart and RangeEnd don’t keep the dates specified for the filters in Desktop.

they are set via your incremental refresh policy. So they are set as the partitions for our 60 months (Instead of setting it to 5 years, meaning there is one RangeStart and OneRangeEnd, you get RangeStart for Month one, RangeEnd for Month 1, RangeStart for Month2, RangeEnd for Month2 etc, breaking your 5 years down into much smaller partitions to work with,

You need to set up the Incremental Refresh policy for every table that has been filtered with RangeStart and RangeEnd

Test the Process

I have a visual for Number of Tweets

Today so far there are 11 Tweets

I also have an audit report

  1. The Logic App has been processing tweets realtime into Table Storage
  2. Run Data Factory (2 new records)
  3. Reprocess Power BI Pro data Error Resource name and Location need to match
  4. If there hadn’t been an error we would move to Azure Data Studio Check. Note that it now takes a second to run
  5. Check the Visual

Error Resource name and Location need to match

The data set now has a warning sign. after speaking to Microsoft this is a Known issue and should be fixed in April. it is believed to be something to do with detect Data Changes So basically…… to be continued

Data Factory to move daily snapshot data in Azure SQL DB to files in a Gen2 Data Lake

I have the following problem to resolve (And this was my initial attempt at figuring out a solution)

I have a data source with data that contains a date. the data contains daily snapshot of a record. this means that a record will be in the data set once per day. This will amount to a lot of data and we would rather hold it as files in Azure Data Lake Gen2 Storage

The logic is to pull daily files from the source database into dated files within the Azure data Lake

Once running this will probably pick up a days data because the rest are already created. However on the initial run I want it to pick up all the days that have been loading.

At the minute I have about 6 months of data to load

Tables in Azure SQL Database (destination)

CREATE TABLE [audit].[MemberDailyMetricsDates](
     [DATEUTC] datetime2 NULL
 ) ON [PRIMARY]

This table collects all the dates from the source snapshot table

Remember, the table records the same record every day with any changes to the record.

CREATE TABLE [audit].[IdWatermarks](
     [TableName] nvarchar NOT NULL,
     [WatermarkValue] [bigint] NOT NULL,
     [WatermarkDate] [datetime] NULL,
     [WatermarkDate2] datetime2 NULL
 ) ON [PRIMARY]
 GO

this is where I add the dates from the tables to show where we are.

For example if we add the last lot of data from 02/01/2020 then this value will be stored in the watermark table. I record them in different formats just in case.

CREATE TABLE [audit].[ProcessingMetrics](
     [ID] [int] IDENTITY(1,1) NOT NULL,
     [TableName] varchar NULL,
     [DateProcessed] [datetime] NULL,
     [DateUTC] datetime2 NOT NULL,
     [Duration] [int] NOT NULL,
     [NoRows] [int] NULL,
 PRIMARY KEY CLUSTERED 
 (
     [ID] ASC
 )WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
 ) ON [PRIMARY]
 GO
 ALTER TABLE [audit].[ProcessingMetrics] ADD  DEFAULT (getdate()) FOR [DateProcessed]
 GO

We can record meta data from the data factory Pipeline as it runs

Stored Procedures in SQL Database (destination)

/*Debbie Edwards
 17/12/2019 Created initial SP
 Create a watermark table
 EXEC [audit].[audit].[USP_Watermark]    '2014-12-31 00:00:00.000'
 */

 ALTER PROCEDURE [audit].[USP_Watermark] @NullDate datetime2

 AS   

 BEGIN

 IF EXISTS (Select * FROM [audit].[IdWatermarks]  
 WHERE TableName = 'DailyMetrics')

 DELETE FROM [audit].[IdWatermarks]  
 WHERE TableName = 'DailyMetrics';

 --DECLARE @NullDate datetime2 SET @NullDate = '2014-12-31 00:00:00.000'

 WITH CTEDailyMetricsDates (DATEUTC)
 AS
 (SELECT ISNULL(MAX(DATEUTC),@NullDate) FROM [audit].[DailyMetricsDates]) 

 INSERT INTO [audit].[IdWatermarks] 
 (TableName, WatermarkValue, WatermarkDate, WatermarkDate2)
 SELECT 'DailyMetrics',CONVERT(bigint,CONVERT(datetime, MAX(DATEUTC))), MAX(DATEUTC), MAX(DATEUTC)
 FROM CTEDailyMetricsDates
 END 

This is the Stored procedure that you run to create the watermark

Currently I’m just running this for one table. You could redesign to run this SP for different tables.

Also note, If there is nothing in the table I am setting a default date to work from @NullDate

/*Debbie Edwards
 18/12/2019 Created initial SP
 Update Processing details
 EXEC [audit].[USP_UpdateMetrics]
 */
 ALTER PROCEDURE [audit].[USP_UpdateMetrics] @TableName varchar(100),   @DateUTC datetime2, 
 @Duration int,  @NoRows int

 AS   

 BEGIN

 INSERT INTO [audit].[ProcessingMetrics]
 (TableName, DateUTC, Duration, NoRows)
 VALUES

 (@TableName, @DateUTC, @Duration, @NoRows)

 END

the Pipeline will run this stored Procedure to add meta data to the ProcessingMetrics table

Data Lake Gen2 Storage Account

Along with the Destination SQL Database you need to have an Azure Data Lake

Quick Set up for the Azure Data Lake. Setting up an Azure Data Lake V2 to use with power BI dataflows in Service (As a data source) …. and Setting up a service principal will give you information on how to:

  • Set up the Storage Account
  • Ensure Hierarchical name space is enabled
  • Create a file system
  • Grant reader Role to power BI service
  • Set up the Service Principal (Set up the App Registration)

Set up Data Factory

Now we are ready to set up everything with a Data Factory

create your data Factory and then go to author and monitor

Connections

AzureSQLDatabasereportingdb is the Source Azure SQL Database

Our Destination is a Destination Azure SQL Database AzureSQLDatabase[destination]reportingdb

And we have a gen 2 Data Lake Storage Account AzureDataLakeStorageGen2 (Which will need the Service Principal account setting up in order to use) See setting up a service principal….

DataSets

the Linked Services are in place. Now we can add Data sets

Source Data Set – Azure SQL database

Add an Azure SQL Server Dataset. the first thing we need are two parameters for the TableName and Table Schema

Connect this up to the Source Linked Service and use the parameters to create a table with schema

The Schema will change when ever you use different table parameters so no need to set at this point

Destination Data Set – Azure SQL database

Destination Data Set – Azure data Lake Gen2

First of all, create a FileName parameter. Because we are creating multiple files, each file needs to be renamed to the UTCDate from the Foreach loop later. We need to set this up within the For each rather than here. Therefore will just set the name as a parameter at this point.

What we want to do is add the file into our file system created in the data lake. and we want to add a date to every single file because these will be the snapshots

In add dynamic content

@dataset().FileName

This is where the actual information for Filename from the data set will be added within the copy activity later.

I imported a schema from a test file I already created.

Pipeline 1 Create Member Daily Metric Dates into destination database

Our first part of the process is to create a list of dates in the source data that we dont yet have in the Data Lake. The first time we run it, thats literally every date we have so far in the source data base

Lets have a look at the Activities in the Pipeline

LookupWatermarkOld

I know that my dates are in datetime2 format so I’m using this and changing it to WatermarkValue so all the steps always use WatermarkValue no matter what the format. Here is the query in full

SELECT MAX(WatermarkDate2) AS WatermarkValue From [audit].[IdWatermarks]
WHERE WatermarkDate2 IS NOT NULL
AND TableName = 'DailyMetrics'

We record the table name of the source in the Watermark table. Basically this will tell us what the date we need to use to work from. remember, we set up a default if there is nothing in the dataset to work with which will run everything into the data lake

CopyDailyMetricsDateUTC

This is where we get to the Copy part of the activity. We are simply copying the dates into the table of the snapshots that we haven’t done yet.

Here we add the parameters of the table we want to use from Source. These parameters were set up in the data set section.

Add a Query. We want the Distinct Date from the source table Where the DATEUTC (The column in the table) is greater than WatermarkValue from the Previous Watermark Activity

SELECT DISTINCT DateUTC FROM [dbo].[DailyMetrics]
WHERE DateUTC > ‘@{activity(‘LookupWatermarkOld’).output.firstRow.WaterMarkValue}’

Now we are at the destination data set

The only item in here is DateUTC

USP.Watermark

This will trigger the Watermark Stored procedure and uses the Azure SQL database destination Linked Service

This is everything required for Section 1 that will run all the dates into our table based on snapshots we haven’t yet prepared

Pipeline 2 CopyMemberDailyMetricsDateUTC

Again, lets have a look at these Activities in more detail

LookupDates

this is the Lookup Activity that takes the full record set of Dates from the destination SQL Server

Note that we haven’t ticked First row only because the entire data set is required

CopyDailyFilesIntoDL (ForEach Activity)

We are taking the output.value from our activity LookupDates

Because it only has one column we don’t need to specify anything further. output.value means the entire record set

Sequential – The Items will be read one by one from the data set

there are 2 activities within this foreach loop. Doubly click on the activity to see what is happening in the loop

Again Lets have a look at this in more detail

CopyDailyMetrics

The source is the table in the source database because we want to store everything in daily snapshot files.

The Query is a SELECT FROM WHERE Statement. Select all the columns from the DailyMetric table where the DateUTC in the source data is Equal to DateUTC in @Item which is generated by the ForEach activity 

the sink uses the Data Lake. Remember, in the data set we set the Filename parameter and here is where is gets set. Click on the Filename to view Dynamic content

This is Concatenating a file name, the Item UTC Date and .csv


Use @item() to iterate over a single enumeration in ForEach activity . This value is generated by the ForEach activity (In the Pipeline)

We are setting the date as a string in order to use within the file name. Because the For each loop is setting the item for us and we are inside this activity we can create the filename in this copy activity rather than in the data set its self.

We can then simply Import schemas to view the mappings between the source and sink

SPUpdateDetails

Finally, we want to keep a record of everything we are going to be doing in the loop so we can run our stored Procedure to add the meta data into a table in our source database

We can use out SQL destination Linked Service for this

the Stored procedure contains 4 parameters.

We can take the Date from the Item in the foreach loop.

Duration, and NoRows can be added from metadata. See https://docs.microsoft.com/en-gb/azure/data-factory/copy-activity-overview Monitor programmatically

  • Duration @activity(‘CopyDailyMetrics’).output.copyDuration
  • No Rows @activity(‘CopyDailyMetrics’).output.rowsRead

the table name is simple DailyMetrics

USP_TruncateMemberDailyMetricsDates

Finally, now that everything is complete we can truncate the date table. Outside of the foreachloop

the max date is held in Watermark which will be used to create the new files Next time (Set in Pipeline 1)

Test your solution

Now we have everything in place we can test each pipeline. By Clicking debug

You can see the files coming in via Azure Table Storage

If you get errors, a great way of debugging is to go into the code. I had an error and after a file nam it had /n/n.

I simply removed this empty row and it worked.

Pipeline 3

There is now a Pipeline 1 to create the dates. And pipeline 2 to create the files. This is good because they can be tested separately.

We need a top level Pipeline to run them

And this is the new model for the flow of this Data factory

Add a trigger

now we have successfully tested the Pipelines we can set them up in a trigger by adding a parent pipeline that runs all the Pipelines in order

Another Post will be created on Adding Triggers

Considerations

You need to make sure that the last file (When you run it is complete) In our case the snapshot table is run at 12 AM and takes around 20 minutes so we need to set this pipeline off at 1 AM

Add Alerts to Monitor the Pipeline

https://azure.microsoft.com/en-gb/blog/create-alerts-to-proactively-monitor-your-data-factory-pipelines/

Another Post will be created on Adding Monitors

Setting up a Service Principal for Azure Data Lake Gen 2 (Storage) to use with Data Factory

An Azure service principal is a security identity used by user-created apps, services, and automation tools to access specific Azure resources. Think of it as a ‘user identity’ (login and password or certificate) with a specific role, and tightly controlled permissions to access your resources

Azure Service Principal

I am constantly having to remind myself how to set up the Service Principal for Access to things like Azure Data Lake Gen 2 when I am setting up a data factory (Or using the storage with another web app).

So I wanted to write a blog post specifically on this.

As the example, imagine you are moving data from an Azure SQL Database to files in Azure Data Lake Gen 2 using Azure Data Factory.

You attempt to add a Data Lake Connection but you need a Service Principal account to get everything Authorised.

You need this so the Data Factory will be authorised to read and add data into your data lake

An application (E.g. data Factory) must be able to participate in a flow that requires authentication. It needs to establish Secure credentials. The default method for this is a client ID and a Secret Key.

There are two types of permissions

Application Permissions No user context is required. The App (E.g. data Factory) needs to access the Web API By its self

Delegated Permissions The Client Application (E.g. data Factory) needs to access the Web API as a Signed in User.

Create an App

In Azure choose App Registrations

Here you can create an app – New Registration

Provide a name for your app. e.g. DataFactoryDataLakeApp

Grant your Registered App permissions to Azure Storage

This will enable your app to authorise Requests to the Storage Account With Azure Active Directory (AD)

You can get to your app by going to Azure Active Directory

Then App Registrations and choose the App

In your new App, go to Overview and View API Permissions

Next go to Add a permission

Go to Azure Storage API which contains Data Lake Gen 2

Notice that we are setting up Delegated Permissions for Azure Storage

You are warned that Permissions have been changed and you need to wait a few minutes to grant admin consent.

I am not an admin so I always get my admin to go into Azure Active Directory and Grant Admin Consent for Peak Indicators

Note that your app now has configured permissions for Azure Active Directory Graph and Azure Storage

Assign your new app to a subscription

Now you have an app you need to assign Contributor status to it to the level of service you require in Azure, Subscription level, Resource group level or resource level.

For this app I am going to set it up against the subscription. First go to the Subscription you want to add it to and then Access Control (IAM)

I have added the app as a contributor

Creating a Key Vault

We will be selecting and creating IDs in the next steps, but instead of simply remembering your secret. Why not store it in a Key Vault.

  • Centralise Application Secrets
  • Store Secrets and Keys Securely
  • Monitor Access And Use

Lets set one up in our Proof of Concept area.

Create a Key vault if you don’t have one already

remember to add any tags you need before Review + Create

Once completed you can go to the resource (E.g. Data Factory) but for the time being that is all you need to do

Application ID and Tenant ID

You can now go into your new app in Azure (App registrations) to get more details for Data Factory (When you set up the connection)

Tenant from Data Factory will be mapped to Directory (Tenant ID) from the App Overview

Service Principal ID from Data Factory will be mapped to Application (Client) ID From the App Overview

Create a Client Secret

Next, create your Client Secret.

In your App go to Certificates and Secrets

Click New Client Secret

Im going to allow this secret to Expire in a year (Anything using the app will start to fail so you would need to set a new secret and re-authorise)

We can add this into the Key vault so we don’t lose it because once you have finished here you dont see it again.

Open a new Azure Window and Go to your new Key Vault

Go to Secrets

Click + Generate Import

Notice I have set the expiration date to match the expiry date of the app

Ensuring the Access is set for the Data Lake Storage

For this you need to have a Data Lake Gen 2 set up and Microsoft Azure Storage Explorer downloaded

In Microsoft Azure Storage Explorer, navigate to the storage

Then Right click on the File System (In this case factresellersales) go to Manage Access and add the app.

Notice that we have set Read Write and Execute for the app on the file system and all the files will inherit these permissions

Adding The Data Lake Gen 2 Connector in Data Factory (Test)

I have a Data Lake Gen 2 with some files and I want to move them into a SQL Data base.

To test, Open or create a Data Factory

Go into Author and Monitor. Then Author

Go to Connections, +New and Choose Azure Data Lake Gen 2

Tenant = Directory (Tenant ID) from the App Overview

Service Principal ID = Application (Client) ID From the App Overview

Service Principal Key (You can get it from Azure Key Vault. Click ON secrets,Then the name and current version

You can then copy the secret value and add it into Data Factory

Test your Connection

Create the Data Lake Data Set

Here is where you know that all your efforts all worthwhile.

Create a new Dataset which will be an Azure Datalake Gen 2

This is great. I have access to the files in the data lake. Achievement unlocked.

Incremental Processing in Data Factory using Watermark Table

I am pulling tweets into an Azure Table Storage area and then processing them into a Warehouse

The following shows the very basic Data factory set up

Connections

I have created a Linked Service for the Azure Storage Table PowerBIMentions

And another Linked Service for my Azure SQL Server Table PowerBIMentions

Datasets

the Storage Table was set up in Logic Apps. I wont go into detail about the schema here

I have created a SQL Table to Mirror the Azure Table Storage

Watermark Table in SQL

The Watermark Table is as follows

CREATE TABLE [staging].[watermarktable](
[TableName] varchar NOT NULL,
[WaterMarkDate] [datetime] NULL,
[WaterMarkValue] varchar NULL,
[WaterMarkOffSetDate] datetimeoffset NULL
) ON [PRIMARY]

For the time being, the Watermark value is to set the Date in the same format as is in the Azure Table storage.

Ignore the offest date for the time being

I have then created a stored Procedure to add the table name storage.PowerBIMentions, the maximum created Date from what is in the table and then some extra script to format that date into the correct format in Azure Table storage.

To understand how this should be formatted you can look at the Azure Table using Microsoft Azure Storage Explorer

https://azure.microsoft.com/en-gb/features/storage-explorer/

As you can see the format is YYYY-MM-DDHH:MM:SS.000Z

I haven’t included the code for the SP here but it is fairly straightforward. You take the max date from the table you are interested in and also include code for format the value to the correct value

Pipeline

Added a Pipeline called CopyPowerBIMentionsIncremental

The Source is the Storage mentions table

the Sink (destination) is the Azure SQL Table

And then we can easily import the schemas and map

Then run the very simple Pipeline which just moved the data. Then Runs the Watermark Stored procedure to recreate the MAX Date time

Change Pipeline to include a lookup activity

Now the next time we want to process we have to only process new records into the table.

the Query to select the correct date from the Watermark table

SELECT MAX(WatermarkValue) AS WatermarkValue From [staging].[watermarktable]
WHERE WatermarkValue IS NOT NULL
AND TableName = ‘staging.PowerBImentions’

You can test this in SQL Server Management Studio

CreatedAt gt ‘@{activity(‘LookupWatermarkOld’).output.firstRow.WaterMarkValue}’

CreatedAt is the column in the source Azure SQL Table table which has the following date format

OperatorURI expression
Equaleq
GreaterThangt
GreaterThanOrEqualge
LessThanlt
LessThanOrEqualle
NotEqualne

Then it uses the Lookup activity and the WaterMarkValue from the SQL Table

Test the New Watermark Logic

in SSMS

SELECT COUNT(*) FROM [staging].[PowerBImentions]

there are 16 records and the max date is 2019-11-01T16:00:55.000Z

Run the new data Factory (Debug)

The new record count is 21

The new Max Date is 2019-11-01T17:01:52.000Z

And Lets to a quick check to make sure there are no duplicates

SELECT TweetText , COUNT() FROM [staging].[PowerBImentions] GROUP BY TweetText HAVING COUNT() > 1

its all working. We have no duplicates and this will ultimately help keep the processing down

There is another way to do incremental processing by using change tracking. Ill be looking at that in a later post

Power BI Service Data Lineage View

I was logging into Power BI this morning when I saw this exciting new feature

we are always looking at new solutions to provide good data lineage so this is well worth a look

Data lineage includes the data origin, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process. 

Wikipedia

I have an App workspace set up for Adventureworks so lets have a look at Lineage using this project

Column 1 is my data source. I can see I’m using a local database and I’m also using an xlsx spreadsheet to bring in data.

In most of my projects I’m working on the ETL in Data factory, transforming data in Stored Procedures etc. for example, for a social media feed, I have a logic app that moves tweets to an Azure Data Storage NOSQL table. Data Factory then transfers this data across into a central Azure Data Warehouse. The Power BI Lineage would pick up at the data Warehouse stage. It wont take into account that there is a lot of work previous to this

Column 2 is the data set in Power BI

Column 3 provides Report information

Column 4 displays the Dashboards

You can click on a data flow node to drill down into more detail

Currently you cant go any further to look at the data items

Click on the Link icon to see the data flow for that item. In this case the Report.

This is a great start but there definitely needs to be more information here to make it something that you would want to use as a proper Data Lineage Tool

  • It would be good to see the fields in each Entity for the Data Sets
  • As an extra, it would be great to see what fields are being used in Measures and calculated Fields
  • Reports – For me, Id like to know for every page in my report
    • What field am I using from the data source
    • What calculated columns I have created (Even better with the DAX Logic)
    • Any Name changes from Data Source to Power BI
    • What measures I have created (Even better with the DAX Logic)
  • For the Dashboard, What items I am using in the dashboards (Fields, Measures, Calculated Columns
  • An Important part of data lineage is getting and understanding of the entire process. This includes data transformations pre Power BI. If you cant do that in here, it would be great to be able to extract all the information out so you can use it in some way with your other Linage information to provide the full story. for example:

Azure Data Catalogue

Azure Data Catalog is a fully managed cloud service. Users can discover and consume data sources via the catalog and is a single , central place for all the organisation to contribute and understand all your data sources.

https://eun-su1.azuredatacatalog.com/#/home

I have already registered Our Data Catalog, and I have downloaded the desktop app

As an Example I want to connect to Azure Table Storage (Connect using Azure Account name and Access Key)

At this point I’m registering everything in the storage table. then I can view the information in the Azure Portal.

You can add a friendly Name, description, Add in expert (in this case me). Tags and management information

I have added Data Preview so you can view the data within the object. there is also documentation and Column information to look at

In the data catalog you can manually add lots of description to your tables along with documentation.

This is great for providing lots of information about your data . You can explore databases and open the information in other formats (Great if you need to supply information to another Data lineage package

I will be having a look at the Azure Data catalog in more detail later to see how it could help to provide full data lineage

Azure Data Factory

Data factory is the Azure ETL Orchestration tool. Go into Monitoring for Lineage Information. However, there doesn’t seem to be a way to export this information to use. Data Factory wont take into account the work done in, for example a stored Procedure

Again this is another area to look into more.

Stored Procedures

When you use Stored Procedures to transform you data, its harder to provide automated Linage on your code. There are automated data lineage tool for SQL out there, but it would be great if there was a specific Tool within Azure that creates Data Lineage information from your Stored Procedures

Azure Logic Apps

Data for my project is collected via Logic Apps before being Processed into an Azure Data Warehouse.

Essentially, we need out data lineage to capture everything all in one place.

And just as important. everything should be as automated as possible. If I quickly create a measure, the data lineage should reflect this with no manual input needed (Unless you want to add some description to the new measure as to why it was created)

Azure – Data Factory – changing Source path of a file from Full File name to Wildcard

I originally had one file to import into a SQL Database  Survey.txt

The files are placed in Azure blob storage ready to be imported

I then use Data Factory to import the file into the sink (Azure SQL Database)

However, the data is actually in one worksheet a year. For full logic I need to be able to add a worksheet to the blob storage to get it imported and each worksheet will contain the year.

This means I need to change the Source and Pipeline in Data Factory

First of all remove the file name from the file path. I used 1 file to set up the Schema. All files are the same so this should be OK.

Next I go to the Pipeline and set up the Wildcard in here Survey*.txt

When the Pipeline is run, it will take all worksheets against for example Survey

Create your website with WordPress.com
Get started