Design a site like this with WordPress.com
Get started

Azure Logic App – Copying a file from Sharepoint to a Data Lake

I have been asked to set up a Logic app in Azure (That is Power Automate for anyone outside Azure) to copy specific file(s) from a Sharepoint folder and add to an Azure Data Lake.

The first example file is around 16,00 rows and not likely to grow too significantly. This is the same with the other files.

There is a specific use case behind this First logic app:

  • The Data in the csv file(s) is updated every day so the file name remains the same
  • We need to copy the File and overwrite the file in the data lake every day after the task has been done to update the Sharepoint File (Around 5PM every day)
  • we want the Logic App to run via Data Factory
  • Once the logic app has run we want to trigger the pipeline to populate the SQL database from the file in the data lake.

Set up the Logic App

In azure go to Logic App and New

Log Analytics: to get richer debugging information about your logic apps during runtime

Consumption Plan: Easiest to get started and fully managed (Pay as you go model). Workflows increase slowly or are fairly static

Standard Plan: Newer than the consumption plan. Works on a single tenant. Works on a flat monthly fee which gives you potential cost savings.

Create the Logic App

Once you have added tags and created its time to create the logic App

Because we want to trigger in Azure Data Factory we want to go for When a HTTP request is triggered

The HTTP Post URL will be used in Data Factory to trigger the Logic App.

I have added a JSON Schema that supports some of the important information for this project. Like Container for the data lake, Folder , File name and isFolder (Which becomes more important a little later.

{   
     "properties": {       
        "Container": {            
           "type": "string"        
        },        
        "fileName": {            
           "type": "string"        
        },       
       "folder": {           
            "type": "string"        
       },        
       "isFolder": {            
           "type": "boolean"        
       }    
  },    
"type": "object"
}

List Folder

Now we want to List Sharepoint folder. So create a new step and search for List Folder

Returns files contained in a Sharepoint Folder.

Next you have to Sign into Sharepoint with a valid account that has access to the Sharepoint site.

Here is where we have a question. For this test, my own username and password has been used but obviously I change my password at certain points which means that this will need manually updating when this happens.

What we need is a way of logging into Sharepoint that isn’t user related and we can use within the logic app. This needs further thought and investigation.

When you log in you create a Sharepoint API connection in Azure Resource Group

To get the site address you can go into Sharepoint, Click on the … against the file and copy link.

The link needed amending slightly because it needs to be tenant.sharepoint.com/sites/ProjectMainArea

If you have access you should then be able to click the folder against File Identifier and select the correct area

For Each

Next Stop, For each ‘Body’ from the List Folder step. We get the File Content. Go to Next Step and choose the For Each Condition (Because there will be multiple files)

Get File Content

Now We want to Get File Content From Sharepoint

Gets File contents using the File Identifier. The contents can be copied somewhere else or used as an attachment

You need to access the same Sharepoint site address as before. Then click on File identifier and choose ID from the Sharepoint Dynamic Content pop up

so here we can see that from the list folder step we have lots of file metadata we can use like DisplayName. ID, LastModified etc.

We know we need ID for Get File Content

We are at a point where we can run this now as a test.

Note that so far we have this set up

but we hit specific issues

Status 404 File not found

cannot write more bytes to the buffer than the configured maximum buffer size of 10457600

So we have two issues to resolve and after a bit of help on the Q&A Forums we find out that:

List Files “Returns files contained in a Sharepoint Folder. ” Actually also returns folders which are erroring because they are not files

Logic Apps aren’t really set up for large files. There doesn’t appear any way we can get past the size issue. So we need to check our files and also think of ways to bring through smaller data sets if needs be.

Thankfully our files are way below the threshold and the business thinks that they won’t increase too much.

So here is where we can start applying criteria, which we want to do anyway because we only want certain files.

  1. If its a folder we don’t want to use it
  2. If its over 10457600 in size we don’t want to use it
  3. Only bring through files called…….

So we need to change our For Each

Within For each add a new step and search for Condition

And add your conditions (And Or)

Then you can move the Get File content into True

So If IsFolder is false and the size is less than 10457600 we can grab file A OR File B.

When you now test this Logic App Get File content should succeed with most not even hitting the criteria.

Create Blob

Finally within the True section we need to add the file to our Data Lake.

Search for Create Blob

Here you have to sign into your Blob Storage which again creates another API Connection in Azure

You have to supply the Storage account name and choose an authentication type. Access Key has been used, the details added here. Normally in data Factory the Access Key is obtained through a Key Vault so, more information is needed to come up with the most secure way of doing this. There are two other authentication types to choose from.

More investigation is needed into these other approaches.

Now we can do a full test of the Logic App

Testing the Logic App

When you trigger the logic app

The Body contains a long list of every object. Really handy to know what the details are inside this action.

To test this was copied into a word document.

Next comes the Get File Content

Now most of the files don’t satisfy the condition.

Next was clicked to get to a file in Get File Content (first one appeared as number 32)

And now we can see the Body of the Create Blob. (This happens for every file specified in the criteria

And if you use Microsoft Storage Explorer app you can check that they have indeed been updated (Either its a new file or it updates what is already there)

Data Factory

Now we have saved the Logic App we want to trigger it in Data Factory

Create a pipeline and choose a web activity

Copy the URL from the Logic App and paste here

For the Body I simply used the Simply JSON at the start of this article.

Now you can trigger this pipeline along with all your other pipelines to run the Data into your Data Lake and then into SQL to be used for Analytics.

https://www.mssqltips.com/sqlservertip/5893/transfer-files-from-sharepoint-to-blob-storage-with-azure-logic-apps/

Advertisement

Azure SQL Database Dev to Production Part 4

I have had quite a lot of issues in regards to the whole dev to prod process with the SQL database. my last attempt which I wrote in this blog worked well until I shut the project down. Then once reopened I would always lose my Project or GIT, so I went back to the drawing board and did a lot more research and here are my new findings.

There is a Data Factory part to this but I have already made a blog about this and it has consistently worked ever since setting up the dev to prod process

Resources used

  • Azure SQL Database and Server
  • Visual Studio (Enterprise)
  • Azure DevOps

Azure Devops Repository

First of all you need to have an Azure Devops set up (I wont go into detail on this here)

In the Devops repos I have a Folder for Data Factory in the repository. The Folder for SQLDB will be created later.

  • In Devops Ensure you have a GIT repos Created Then Click CLONE to copy the GIT location URL

In this example I am cloning right at the top of the repository

  • Click the Copy Button

Visual Studio

You cannot do this as yet in Visual Studio code. It has to be Visual Studio and I have Visual Studio Enterprise 2019

  • You can Copy the Repository Location from the Clone Copy (Or Browse a Repository – Azure DevOps or GitHub)
  • make sure you are happy with the path for the Local Copy
  • Click Clone. Your Local repository is then shows in Solution Explorer
  • This has added the folder on your C Drive (It added the top level and the dataFactory and SQL DB Folders)
  • And You can see this project in Solution Explorer
  • Copy path of C Drive (And the folder for example SQLDB)
  • Visual Studio – In the top bar choose create File – New – Project
  • Choose SQL Server Database Project – Next

The Project with be SQLDB and will contain the SQL Objects

  • Click Create

In Visual Studio SOLUTION EXPLORER: You can see your empty database objects.

On the C Drive, Note you now have a SQLDB folder along with the Data Factory Folder

  • Right click on the database name in Solution Explorer and go to properties

Its important to be in the right version for the target platform

  • Right click on database name and Import – database
  • Select the Connection location of the development database (and Show connection properties to make sure your username and password are ok and the database connects.
  • Import objects into the Local Project (No need to Select it in the above box)
  • then Click Finish Note that all your objects are now in Solution Explorer and on the C: Drive (Your local copy)
  • Is the project Complete? – Build – Rebuild solution which checks and validates the objects

Any time anything changes you need to rebuild your solution to update the code.

Warnings and Errors

For Warning and errors you can see all the issues by clicking on them. The build may fail because of errors. These always need resolving before you send to the target DB e.g. Production

Error Example Warning:  SQL71558: The object reference [staging].[].[KEY] differs only by case from the object definition [staging].[ST2].[Key].

  • Click click on the database in Solution Explorer and go to Properties.
  • In project Settings untick validate casing on identities

Error Example Warning SQL71502: Procedure: [dim].[USP_Date] has an unresolved reference to object [sys].[all_objects].

You can add the master database as a reference (Right Click on references)

Add Database Reference

Rebuild your codebase. Its important here to make sure your warnings and errors have been dealt with

Rebuild updates your project locally after updating – I will look at how making changes with for example SQL Server Management Studio changes the process in a later blog.

  • Publish to Git Repository – GIT Changes and make a note of your change Commit All

Then click the arrow to Push changes to the GIT repository

We now have the code in the repository in DevOps

It seems annoyingly easy to slightly mess your folder structure up. Here I have a SQL DB Folder and another SQLDB Folder inside.

I only wanted the one. This does keep happening to me and its very frustrating. Any pointers to where I went wrong would be really appreciated

Create your CI (Continuous Integration) Pipeline in Azure Devops

Now we have the Code in GIT we can create our artifacts for the release pipeline.

  • In Azure Devops go to Pipelines and Release Pipelines
  • Click New Pipeline
  • And choose your repository
  • Select a template

Right click on tasks and remove selected tasks until you are left with the following

You don’t really have to do much with these three jobs

At the Pipeline level ensure you use the right Agent. For example Windows -2019. We had errors because we use an  OpenJson function in the SQL code but setting the right agent resolved this issue

All the other jobs are parameterised. This should now be all set

  • Save and Queue and you can then run the pipeline to create your artifacts Save and Run
  • There may be warnings here. For some reason the warnings you clear in Visual Studio seem to show in Devops. I would like to do a bit more research on this.
  • But If a warning hasn’t failed the process you should now have your Continuous Integration artifacts.

Create your CD (Continuous Delivery) Release Pipeline in Azure Devops

Now we are onto Continuous Delivery. Moving the new code into Azure SQL DB

  • In Azure Devops to to Pipelines – Releases – New release
  • So for Artifacts click Add

We want to use the latest build

now Add a Stage. In our case we are using Prod so its a simple release

  • Start with an Empty Job

Add a task to the job

Here we chose the production Subscription

And we link to the DACPAC file that was created with the build from visual Studio. The DACPAC contains all the objects in SQL

The database uses variables and you can set these up in the variables tab

You can create a release to update your Production database

Once pushed, check your SQL database in Production to make sure you are happy that your changes have gone through.

And you can save your visual studio project and reopen. the next stage is to update some objects and go through the process again so watch this space

Autoscaling with Power BI Premium Gen 2

We have been working with Power BI Premium for a few weeks. Simply switching it on and seeing how it goes.

However, Its worth paying some attention to Autoscaling that you can get with Premium Gen 2 (Preview)

Auto Scaling

Previously our Power BI Premium capacity could struggle when there was high capacity. For example If we reached full capacity, someone’s automatic refreshes would fail. Or if too many things were happening on the server, report users would find that reports were taking longer to render.

There are lots of use cases where this could happen so Auto scaling is definitely something that can help with these issues.

You can now scale and autoscale using Azure Pay as you go, which is around £62 per vCore for a 24 hour period

We use the DTU Pricing structure in Azure for SQL DBs. What is the difference between the DTU and the vCore pricing structure?

Autoscaling is an opt in feature and can be charged to an Azure Subscription

Once the spike is over, scale down happens and you stop paying for the scale up.

Autoscale Notifications

Toasts pop up in Admin Portals Capacity Settings to let your admin know when autoscale is running.

It would be good to be able to tell everyone working with Power BI when this is happening simply for reference

Get Started with setting up Autoscaling in Azure

Go into Azure and Select Subscriptions. You need to have made the decision before hand on which subscription can be used for Autoscaling billing.

Next create an auto scaling resource group

Enable Autoscale in Power BI Admin Portal

To do this you need to be the Power BI Admin (Or Global Admin)

Another addition is that the person needs to be at least a contributor on the Azure Subscription to go through all the steps succesfully

Go to Capacity Settings

Make sure Premium Generation 2 is already enabled

Then Select Manage Auto Scale

Enable auto scale and then select your Azure Subscription

And then assign the number of vCores to the Autoscale

Here we have set the max of 2.

How many vCores does a Premium P1 capacity have?

8 virtual Cores

Once completed you are all set. There are some questions. Apart from the Toast pop ups are there other ways to monitor and log the usage of Autoscaling?

This needs its own page but there are Apps you can try like https://appsource.microsoft.com/en-us/product/power-bi/pbi_pcmm.capacity-metrics-dxt?tab=Overview

The big takeaway from all this is that we should never be in a situation where we are surprised that we have reached capacity. Or, if we do set up Auto Scaling it should not be used on a day to day basis.

More investigation is needed on how to set up proper monitoring so we have full knowledge about what is going on in Premium Capacity. And we must never forget that there will be Pro workspaces already set up and these shouldn’t go under the radar either.

We will have a look at these issues in future posts

Azure Data Factory Moving from development and Production – Part 2. Using Key vault for Linked Services

In Azure Data Factory Moving from development and Production We looked at how we can use Azure DevOps to move the Json Code for Development Data Factory from development to Production.

Its going well, I have however been left with an issue. every time I move into Production details for the Linked Services have to be re added. Lets have a look at the SQL Server and the Data Lake gen 2 account.

Development

Notice that the information has been entered manually including the Storage account Key.

Again, in this instance the information has been entered manually. SQL Server Authentication is being used because we have a user in the SQL DB with all the privileges that Data Factory Needs.

DevOps Data Factory release Pipeline

Go into Edit of the Release Pipeline

Within Prod Stage we have an Agent Process

We are looking for the Section Override Template Parameters

Note that currently Account Key and SQL Database Connection String are null.

Provisioning Azure Key vault to hold the Secrets

Managed Identity for Data Factory

Copy your Azure Data Factory Name from Data Factory in Azure

You need to have a Key vault set up in Development

GET and LIST allows Data Factory to get information from the Key Vault for secrets

Paste the data factory name into Select Principal

Key Vault, create a Secret for the Azure Data Lake Storage

For the Key Vault Secret. I gave it the Secret value by copying across the Access Key from the Azure Storage Account Keys Section

The Content type was simply set as the name of the Storage Account for this excercise

In Data Factory Create a Linked Service to the Key Vault

Test and ensure it successfully connects

Use the New Key Vault to reset the data Lake Linked Service

How does this Data Lake Linked Service change the DevOps Release Pipeline?

Its time to release our new Data factory settings into Production. Make sure you have Published Data Factory into Devops Git.

Production Key vault Updates

We need to update Production in the same way as Development

  1. In Production Key vault add the Production data factory name to Access Policies (as an Application) With Get and List on the Secret
  2. Ensure that there is a Secret for the Production Data Lake Key AzureDataLakeStorageGen2_LS_accountKey
  3. Check your Key vault connection works in Production before the next step

Azure DevOps Repos

In Azure Devops go to your Data Factory Repos

Notice that your Linked Service information for the Data Lake now mentions the Key Vault Secret. its not hardcoded anymore which is exactly what we want

Azure DevOps Release Pipeline

Go to Edit in the Data Factory release pipeline

When the job in Prod is clicked on, you can go to the Override Parameters Section. And notice there is now an error

AzureKeyVault1_properties_typeProperties_baseUrl is the missing Parameter. Basically at this point you need to delete the code in the Override template Parameters box and then click the button to regenerate the new parameters

Override with production information (I saved the code so I could re copy the bits I need.

Once done, notice that the -AzureDataLakeStorageGen2_LS_accountKey “” parameter is now gone because its being handled by the key vault.

Lets Save and Create a Release

New failures in the Release

2021-02-08T13:45:13.7321486Z ##[error]ResourceNotFound: The Resource ‘Microsoft.DataFactory/factories/prod-uks-Project-adf’ under resource group ‘prd-uks-Project-rg’ was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix

Make sure that your override parameters are ok. I updated:

  • Data Factory name from Data Factory
  • Primary endpoint data Lake Storage from Properties
  • Vault URI from Key vault Properties

Repeat the Process for SQL Database

With everything in place we need to introduce a connection string into Key Vault

I have a user set up in my SQL database. the user has GRANT SELECT, INSERT, UPDATE, DELETE, EXEC, ALTER on all Schemas

I want to include the user name and password in the Connection string and use SQL authentication

Secret Name

AzureSQLDatabase-project-ConnectionString

Secret

 Server=tcp:dev-uks-project-sql.database.windows.net,1433; Database= dev-uks-project-sqldb; User Id= projectDBowner;Password= Password1;

the connection string has been set as above. For more information on connection strings see SQL Server connection Strings

Go back to Data factory and set up the new secret for SQL Server

This is successful

Data Factory and DevOps

  1. back in Data Factory Publish the new linked Service Code
  2. go into Dev Repos and check in Linked Service code you are happy with the new Key vault information
  3. Go to Prod Key vault and make sure you are the Secret is set with the Connection String for SQL DB
  4. Test the Key vault secret works in Prod
  5. Back in DevOps Go to Release pipelines and Edit for the adf Release CD pipeline (Release Pipelines are Continuous Delivery. Build pipelines are CI for Continuous Integration)
  6. Edit Prod Stage (I only have Prod) Arm Template Deployment Task, Copy Overwrite Template Parameters code into a file for later
  7. Delete the code and click the … to get the latest parameter information
  8. Re add your production parameters, most can be taken from the code you just copied.
  9. Create a new Release
  10. go to Linked Services in Data Factory and check they are still Production. They still use Key vault and they still work

Now this is all in place, Development Data factory can be published up to production. there is no need to reset Linked Services and all your information about Keys and passwords are hidden in the Key Vault

Azure SQL Database. Publishing from Development to Production Part 2

The Dev to Prod Process

Initially in part one we set up the process with Visual Studio 2019 and Devops and moved all our objects across to Production. Then with Data Factory we moved data from the production data lake into Production Azure SQL DB

We have all the Source data in a data lake (and its been assured that the Production data lake is the same as the Development data Lake)

We have a Data Factory in Production that goes through a DevOps release pipeline so we should now be able to use the Production Data Factory to Load all the Production Data into the Production SQL database on a regular basis.

What happens when you already have Objects and data in Target?

Only the changes will be released. So the next time you release into production you are releasing the delta

Lets see this in action

  • The Initial SQL database was released into Production with Visual Studio
  • A Production Data Factory moved all the data into the new Objects
  • Now we have an updated Dev SQL Database.

Open your visual Studio Project

Open the project that was created in the last session.

In SQL Server Object Explorer

You have the Azure Server and Database. Right click and Refresh

And you have the Local Project DB which contains all the Objects. We can Check the Schema differences between the project DB and the latest DB within Azure

Go to the dev SQL Database (Click on the icon to the left of the server to open it up)

on the Azure SQL Database. Right click and Choose Schema Compare

For Select Target , select Select Target

Open the Local DB Project. The Database in this case is the same name as your Solution in Solution Explorer. (Now I know I should have given my solution the project name and my Project an extension of _DB to differentiate the two)

Click OK

Click Compare.

Now you get to see what has been deleted. in this case a table and a Procedure has been dropped

Next we can see changes. If the Table Change is clicked on, you get more information about that change in the object definitions. In this case a test column has been added.

This creates a quandary when it comes to loading in the data because this table should be fully populated but the new column will be blank. Is it possible to do a full load for these updated tables with Data Factory, OR do we need to look at something more complex?

And finally additions. In this case there are lots of new Tables procedures and two new functions.

Once happy Click Update and your changes will be published into Solution Explorer

To Check, have a look for some of the new tables, SPs etc in Solution Explorer

Once completed you can click the x icon to close the comparison window and you can save your Comparison information

Rebuild the Project in Visual Studio

Now we want to Rebuild our Project within Solution Explorer

Right click on the Project in Solution Explorer and choose Rebuild. this rebuilds all the files.

  • Rebuild rebuild your entire project
  • Build just rebuilds on the changes

Process your Changes with GIT

Now its in your project you need to process those changes with GIT

in Git changes. Commit all and Push

And remember to add a message

This image has an empty alt attribute; its file name is image-94.png

These objects should now be in Devops. You can go to Devops Repos. then to your database specific project and check for new tables, SPs etc

My new junk dimension object are there so this is all working.

Release the new database objects into Production database

now all the code is in the repos we can push the new and updated objects into production with a DevOps Release Pipeline.

There is already data in my production database. As as initial starting point I do a quick check on a few tables to get a feel of the data.

This SQL Script allows you to do a quick check on row counts in the production database

SELECT
QUOTENAME(SCHEMA_NAME(sOBJ.schema_id)) + '.' + QUOTENAME(sOBJ.name) AS [TableName]
, SUM(sPTN.Rows) AS [RowCount]
FROM
sys.objects AS sOBJ
INNER JOIN sys.partitions AS sPTN
ON sOBJ.object_id = sPTN.object_id
WHERE
sOBJ.type = 'U'
AND sOBJ.is_ms_shipped = 0x0
AND index_id < 2 -- 0:Heap, 1:Clustered
GROUP BY
sOBJ.schema_id
, sOBJ.name
ORDER BY [TableName]
GO

Azure DevOps

Choose the database repository (You should also have a repository for data factory)

Build Pipelines

Go to Pipelines. Before releasing to Prod we actually have to build all our code into an artifact for the release pipeline

Click on your Project _DBComponentsCI pipeline (Continuous integration) set up in Part 1

Lets remind ourselves of this pipeline by clicking Edit

This image has an empty alt attribute; its file name is image-188.png

We build the solution file from the Repos in devops. Then Copy files to the staging directory. Finally publish the artifact ready for release.

Come out of Edit and this time choose Run Pipeline

And Run.

Once its run, there are warnings again but for the time being I’m going to ignore these

Otherwise the build pipeline runs successfully.

Release pipeline

Now we have rebuilt the artifact using the build pipeline, go to pipelines, Releases and to your DB_CD release (continuous delivery)

We have a successful release. I can run the above SQL and check for differences. for a start, there were 39 objects and now there are 43 so you can immediately see that our production database has been updated.

The row count shows that we haven’t lost any data. We have simply updated the objects.

Part three will allow us to look more in-depth at the data side. How we deal with the data in Dev and Prod

Azure Data Factory – Moving from Development to Production

When working on larger projects we need to merge changes from Developers. When all the changes are in the central branch we can then have an automated process to move development to Production

Smoke tests

In computer programming and software testing, smoke testing is preliminary testing to reveal simple failures severe enough to, for example, reject a prospective software release

Integration testing

Integration testing is the phase in software testing in which individual software modules are combined and tested as a group. Integration testing is conducted to evaluate the compliance of a system or component with specified functional requirements. It occurs after unit testing and before validation testing

Resources Involved with the current Project

  • Azure DevOps
  • Azure SQL Server
  • Azure SQL Database
  • Azure Data Factory
  • Azure Data Lake Gen 2 Storage
  • Azure Blob Storage
  • Azure Key vault

Each resource has its own specific requirements when moving from Dev to Prod.

We will be looking at all of them separately along with all the security requirements that are required to ensure that everything works on the Production side

This post specifically relates to Azure Data Factory and DevOps

Azure data factory CI/DC Lifecycle

GIT does all the creating of the feature branches and then merging them back into main (Master)

Git is used for version controlling.

In terms of Data Factories, you will have a Dev Factory, a UAT factory (If Used) and a Prod Data factory. You only need to intergrate your development data factory with GIT.

The Pull request merges feature into master

Once published we need to move the changes  to the next environment, in this case Prod (When ready)

This is where Azure Devops Pipelines come into play

If we are using the Azure Devops Pipelines for continuous development the following things will happen

  • The devops Pipeline will get the powershell script from the master branch
  • The get the ARM template from the publish branch
  • Deploy the Power Shell script to the next environment
  • Deploy the arm template to the next environment

Why use Git with data Factory

  • Source control allows you to track and audit changes
  • You can do partial saves when for example you have an error. Data Factory wont allow you to publish but with Git you can save where you are and then resolve issues another time
  • It allows you to collaborate more with team members
  • Better CI/CD when deploying to multiple environments
  • Data Factory is time times faster with a GIT back end that it is when authoring against the data factory service because resources are downloaded from GIT
  • Adding your code into Git rather than simply into the Azure Service is actually more secure and faster to process

Setting Up Git

We already have an Azure Devops Project with Repos and Pipelines turned on

We already have an Azure Subscriptions and Resource Groups for both Production and Development Environments

There is already a working Data Factory in development

In this example Git was set up through the Data Factory management hub (the toolbox)

DevOps Git was used for this project rather than GitHub because we have Azure DevOps

Settings

The Project Name matches the Project in Devops

The Collaboration branch is used for Publishing and by default it’s the master branch. You can change the setting in case you want to publish from another branch.

Import existing resources to repository means that all the work done before adding Git can be added to the repository.

Devops is now set up.

Close Azure Data Factory so we can reopen it again to go through the GIT process (If it is open)

Where to find your Azure DevOps

You should now be able to select your own Area in Devops / Repos and select the created project within Azure Devops and Repos

You will need an Account in Azure DevOPs Click on Repos and that account must be higher than Stakeholder to access Repos

You can then select the Project you have created

Using Git with Azure Data Factory

In Azure Open up your Data Factory

Git has been Enabled (Go to Manage to review Git)

The master branch is the main branch with all the development work on it

We now develop a new Feature. Create a Feature Branch + New Branch

We are now in the feature branch and I am simply adding a description to a Stored Procedure Activity in the pipeline. However this is where you will now do your development work rather than within the master

For the test, the description of a Pipeline is updated. Once completed my changes are in the feature 1 branch. I can now save my feature

You don’t need to publish to save the work. Save all will save your feature, even if there are errors

You can go across to Devops to see your Files and history created in Azure Devops for the feature branch (We will look at this once merged back into Production)

Once happy Create the pull request

This takes you to a screen to include more details

Here I have also included the Iteration we are currently working on in Devops Boards.

A few tags are also added. Usually, someone will review the work and will also be added here.

The next screen allows you to approve the change and Complete the change

In this case I have approved. You can also do other things like make suggestions and reject

Completing allows us to complete the work and removes the feature branch. Now all the development in the feature branch will be added to the main, master branch.

In Data Factory, go back to the master branch and note that your feature updates are included

We now publish the changes  in Master Branch which creates the adf publish branch. This publish branch creates the ARM template that represents the pipelines, linked services, triggers etc.

Once published, In Devops Repos , there are now files to work with

You can see your change within the master branch

(The changes would normally be highlighted on the two comparison screens)

Here we open the Pipelines folder got the compare tab and find the before and after code

And you can also see your history

The Arm templates is in the adf_Publish branch, if you select this branch

Once done we need to move the changes  to the next environment, in this case Prod (When ready)

This is where Azure Devops Pipelines come into play

Continuous Development using Azure DevOps

We need another Data Factory object to publish changes to

In this case, the Production has been created with Azure Portal within the Production Subscription and Production Resource Group

Git Configuration is not needed on the Production resource. Skip this step

Create your tags and Review and Create

DevOps Pipelines

For this specific Project, We don’t want to update production automatically when we publish to Dev. We want this to be something that we can do manually.

Go to Pipelines and create a new release Pipeline (In DevOps)

Click on Empty job because we don’t want to start with a template

And because for this project there is no UAT, just Production name the Release Pipeline Prod

Click on the X to close the blade

We need to sort out the Artefact section of the Pipeline
 
Click on Add an Artefact and choose an artefact from Azure Repos
 
We may as well add adf_Publish branch which contains the ARM templates
And the Master branch

the Source alias was updated with _adf_publish

Both Pipelines are Azure Repos artefacts

Next We move to Prod and Start adding tasks

Click on 1 job, 0 tasks to get to tasks

Click + against Agent Job to add the task Our task is for ARM Template deployment


Click Add

Then click on the new Task to configure

The first section is where you select your production environment

Next you need to select the ARM template and the ARM template parameters file. These are always updated in the Devops artefact everytime you publish to dev.

The JSON templates are in the adf_publish branch

Now you need to override the template parameters because these are all for Dev and we need them to be production.  These are:

These will be specific to your own data Factory environment. In this instance we need to sort out the information for the Key vault and data lake storage account

factoryName

This one is easy. The only difference is changing dev to prd

AzureDataLakeStorageGen2_LS_properties_typeProperties_url

The Production Data lake Storage account must be set up in dev and prod before continuing. Go to this Storage account resource

This information is also stored in our Key Vault as a secret which we can hopefully use at a later date.

It is taken from Storage Account, Properties. We want the Primary endpoint Storage for the data lake

Copy the Primary Endpoint URL and override the old with the new Prod URL in DevOps

AzureKeyVault1_properties_typeProperties_baseUrl

We need to update https://dev-uks-project-kv.vault.azure.net/

Lets get this overridden. We already have a Key vault Set up in production. Get the URI from Overview in the Production Key Vault Service

and lets add this into our DevOps parameter

AzureDataLakeStorageGen2_LS_accountKey

This is empty but we could add to it later in the process.

Account keys are the kind of things that should be kept as secrets in Key vault in both Dev And Prod

Lets get them set up. Just for the time being, lets ensure we have the Data Lake storage account key within our development and Production Key vaults

Key Vault.   

Within Key vault in development create a secret with the nameAzureDataLakeStorageGen2LSaccountKey

And the key from the storage account comes from……

And Repeat for Production Key vault

For the time being through lets leave this blank now we have captured the information in the key vault. It should come useful at a later date

AzureSqlDatabaseTPRS_LS_connectionString

This was also empty within the parameters for dev.

You can get the connection string value by going to your SQL data base. Connection Strings. PHP and finding the try statement

And here is the Connection String value for production

Server=tcp: prd-uks-project-sql.database.windows.net,1433; Database= prd-uks-project-sqldb;

You can also add this information into Key Vault as a secret and repeat for Production

For the first instance we are going to leave empty as per the dev parameters. At some point we should be able to set up the Security Principal so we can change the hardcoded values to Secrets

The parameters created in dev are now overridden with the production values

The Pipeline is then named

Create a release

Once Saved. Click back on Releases

For this tye of release we only want to do it manually

Create a Release for our very first manual release

Click back on releases

And click on release 1 to see how it is doing

You can click on Logs under the Stages box to get more information

Now you should be able to go back to the production data Factory and see that everything has been set up exactly like Dev.

Go and have a look at linked Services in the Production data Factory

Note that they are all set with the Production information

We now have a process to move Dev to Prod whenever we want

The Process

Throughout the sprint, the development team will have been working on Feature branches. These branches are then commited into the master pipeline and deployed to Dev

Once you are happy that you want to move your Data Factory across from dev into Prod. Go to DevOps Release pipeline

Create Release to create a new release

It uses the Artefact of the Arm template which is always up to date after a publish.

This will create a new release and move the new information to Prod

All your resources will be to be able to quickly move from Dev to Prod and we will look at this in further posts

Data Factory, Moving multiple lookup worksheets from Excel to one lookup table in SQL Server

A current project has an xlsx containing around 40 lookups in individual worksheets

Each worksheet consists of a code and a description

We decide that we want every single lookup in one lookups table in SQL Server.

This will have a Lookup Name, Code and Description that we can then use for the rest of the project

We want to do everything in one go in Data Factory.

For this Im going to use a simple example with just 3 worksheets

Azure Data Lake Gen 2

We are going to store the source data within a data lake.

The Source data looks like this

Lookup B worksheet

Lookup C Worksheet

SQL Server

I have an Azure SQL Database and on it I create the one table that all the reference lookups will go into

GO
CREATE TABLE [staging].[Lookups](
[LabelKey] [int] IDENTITY(1,1) NOT NULL,
[LabelName] varchar NULL,
[Code] [int] NULL,
[LabelDescr] varchar NULL,
[Importdate] [datetime] NULL
) ON [PRIMARY]
GO
ALTER TABLE [staging].[Lookups] ADD DEFAULT (getdate()) FOR [Importdate]
GO

LabelKey has been added just to create a valid key for the table. LabelName has also been added which will be the name of the worksheet.

Finally ImportDate is added because we want to know exactly what time this data was imported into the table

Now we need to provide Data Factory with a list of worksheets

CREATE TABLE [staging].[LookupNames](
[LabelKey] [int] IDENTITY(1,1) NOT NULL,
[Labels] varchar NULL,
[Importdate] [datetime] NULL
) ON [PRIMARY]
GO
ALTER TABLE [staging].[LookupNames] ADD DEFAULT (getdate()) FOR [Importdate]
GO

Lookup Names is our seed table and will provide us with the worksheet names

we have populated it like this

SELECT 'Lookup A' UNION
SELECT 'Lookup B' UNION
SELECT 'Lookup C' 

Data Factory

Linked Services

Firstly we need to provide our linked services. Source and destination

go to Linked services via

and choose new.

call it ADLS_LS and select your Azure Subscription and Storage account.

At this point the connection was tested and was successful so we didn’t need to do anything further

Next, create your Azure SQL Database Linked Service

And call is SQLDB_LS (Or what ever you feel is the right naming convention. _LS is good because you can see exactly what are the linked services in the JSON script created

Again add in your details (We used a role that we created in SQL Server DB specifically for data factory with GRANT EXEC, SELECT, INSERT, UPDATE, DELETE on all the schemas)

Ensure the connection is successful

Data Sets

Now to come up with the actual source and destination datasets. If we parameterise them then we can reuse a single data set for lots of other activities within the pipeline

Click on the and choose New dataset

Choose the Format. In this case its Excel

We don’t want to specify any of the location values until we get to the pipeline, including the worksheet

Make sure First row as header is ticked (Unless you don’t have a header in Excel)

And create parameters

This means we can use this one Data set for all the SQL data sources

Pipelines

Now to create the pipeline specifically for the lookup

This is the basic pipeline we are going to add.

Lookup

First of all In Activities search for lookup and drag this into the pane

This uses the SQL dataset because we are going to use our SQL table that contains all the names of the worksheets.

Note that first row only is not ticked because we are bringing all the information through

ForEach

@activity(‘GetLookups’).output.Value

We are going to get the entire data set (Value) fed into the GetLookups Lookup.

Sequential is ticked because we are going to move through all the worksheets names in the table (Ensure that your Worksheets have exactly the same name as what is specified in your table)

Click on the Activities (1) to get to the activity

Copy Activity within the Foreach

We now set up the source of the copy activity

We use all the parameters within the dataset and add in the information from our Azure data Lake Gen 2 in the Storage Resource.

Within our Lookups table there is a column called labelname and we are going to populate this with the Labels column from our item. Out Item in the foreach loop and was created via the Lookup. And that lookup contained all the columns from our LookupNames SQL Table

The data will go into the Lookups table

Thats everything. You should be able to test your Pipeline by clicking debug and the Foreach should move through worksheet specified within your lookupnames table and add your information into SQL

Truncating lookup tables before re adding data

we want to be able to repeat this process and unless we add a truncate statement into our process we will keep readding the same information

We can add the following Stored procedure into SQL

/*
05/10/2020 Debbie Edwards - Peak - Truncate lookups
EXEC [staging].[USP_Truncatelookups]
*/
Create PROCEDURE [staging].[USP_Truncatelookups]
AS
BEGIN
IF EXISTS(SELECT * FROM [dbo].[sysobjects] WHERE Name = 'lookups')
TRUNCATE TABLE [staging].[Lookups]
DBCC CHECKIDENT ('Staging.Lookups', RESEED, 1)
END

And this can be added to the the Pipeline before the foreach loop and Lookup with a Stored Procedure Activity

You wont be able to see the Stored procedure if you havent granted EXEC access to the specific Database Role name and schema

Always give the least amount of privileges and them move up if you need to

--Bring back information about the members in your roles
SELECT DP1.name AS DatabaseRoleName,
isnull (DP2.name, 'No members') AS DatabaseUserName
FROM sys.database_role_members AS DRM
RIGHT OUTER JOIN sys.database_principals AS DP1
ON DRM.role_principal_id = DP1.principal_id
LEFT OUTER JOIN sys.database_principals AS DP2
ON DRM.member_principal_id = DP2.principal_id
WHERE DP1.type = 'R'
ORDER BY DP1.name;
SELECT DISTINCT rp.name,
ObjectType = rp.type_desc,
PermissionType = pm.class_desc,
pm.permission_name,
pm.state_desc,
ObjectType = CASE
WHEN obj.type_desc IS NULL
OR obj.type_desc = 'SYSTEM_TABLE' THEN
pm.class_desc
ELSE obj.type_desc
END,
s.Name as SchemaName,
[ObjectName] = Isnull(ss.name, Object_name(pm.major_id))
FROM sys.database_principals rp
INNER JOIN sys.database_permissions pm
ON pm.grantee_principal_id = rp.principal_id
LEFT JOIN sys.schemas ss
ON pm.major_id = ss.schema_id
LEFT JOIN sys.objects obj
ON pm.[major_id] = obj.[object_id]
LEFT JOIN sys.schemas s
ON s.schema_id = obj.schema_id
WHERE rp.type_desc = 'DATABASE_ROLE'
AND pm.class_desc <> 'DATABASE'
AND rp.name = 'db_NameofRole'

you should hopefully have a good pipeline to run in your lookup information into one lookup table and truncate that table when ever you run the process

Power BI Dataflow issues. Let the whole dev team know

Currently, if your dataflow fails the only person who will be notified is the owner of the dataflow.

We want all our developers within the team to know. There doesn’t appear to be any way to do this at the moment but there is a workaround that was suggested to me by on the Power BI Forums by collinq as an idea starter and I thought I would run with it and see what happens.

It all relies on a Refresh date

Refresh date

In my main dataflow I have the following Query

This was created from a blank query

let  Source = #table(type table[LastRefresh=datetime], {{DateTime.LocalNow()}})in  Source

This gets updated every time there is a refresh on the main dataflow

Create a New Report in Power BI Desktop

Go to Power BI desktop and Get Data

Dataflow is the source of the data

And we only need this object

I am going to edit the Query in Power Query Editor

Last Refresh date has been split into Date and time.

Then a custom column was created for todays date

DateTime.LocalNow()

This was split into Date and Time. It is very likely that we may decide to use time later so this is why it has been added for now.

Now we find the number of days between the last refresh and today

0 -Duration.Days(Duration.From([last Refresh Date]-[Todays Date]))

0- is added to remove the minus at the start of the number so -1 becomes 1

Close and Apply and a card is added

Publish to Service

Power BI Service

Go to the report that has just been created

And Pin the card to a dashboard. In this case, an Issues dashboard has been created

The idea at this point is. If the Refresh date isn’t refreshed the number of days between will become 2 because todays date will change and we can be notified

This will need testing

Power BI Dashboard (Create Alerts)

Go to the dashboard

Choose Manage Alerts

We want to know if it goes above 1. this means that last nights refresh has failed to happen.

(But if it fails to happen, will the Current date refresh?)

At present, an alert will only be sent to the owner of the report and it will be seen within Power BI but we want everyone to know.

This is why we are going to use Power Automate / Logic Apps

We have an Azure Subscription so I can add a Logic App within our subscription

Instead of clicking the above link we are going to go into Azure, but the principal will be the same

Save the Alert

Schedule a refresh

The new report has created a dataset

go to Settings and Scheduled refresh to keep this up to date

Azure Logic Apps

Create a new Logic App in Azure

Search for Power BI. We want the trigger ‘When a data driven alert is triggered(Preview)

I am signing in with my own credentials (Note that my password is updated every month so if these credentials are used it will need adding into the governance.

Our alert has been saved and added to the alert list

Just for the time being its being left at 3 times a day

We have our trigger, now we need to know what will happen on the trigger

New Step

for the time being chose a fairly easy option of sending an email

You can search for the dynamic content as you create the body and subject. Here we want to bring to attention the value in the tile and the alert threshold.

  • The HTML <li> element is used to represent an item in a list
  • The <strong> tag is used to separate the text from the rest of the content. Browsers traditionally bold the text found within the <strong> tag
  • The <big> tag is used to make the text one size bigger
  • The <ultag defines an unordered (bulleted) list
  • The <a> tag defines a hyperlink, which is used to link from one page to another. The most important attribute of the <a> element is the href attribute, which indicates the link’s destination.

I have added two users to the email so they can both be notified

Save your logic app. Its ready

Testing the New Process

The dataflow is schedule to refresh at 11 PM

Dataflow issues data flow is scheduled at 12 AM

On the night of the 28th of September, everything failed. I got the emails because I am the dataflow owner but no email from the actual set up.

Testing has failed

lets have a look to see whats happened.

We have two fails, and one on the dataflow we have set up

It looks like the refresh token expired. Please go to this dataset’s settings page, and reenter the OAuth2 credentials for the Extension data source.

Going into the report and we still see this

Which is incorrect.

We would get more of an understanding if we could match up the dates to what is happening.

However its clearly not updated

Dataflow Settings

Scheduled refresh is on and set to run at 12 midnight. The errors were emailed through just after 11.

The alert is there.

lets go back to desktop and add some more information.

After a refresh in Desktop we can now see this information

Which is correct. this says to me that even though we have a refresh going on, it didn’t refresh in Service possibly. The new report is published up to Service.

Back in Service

This is interesting. Our new multi row card shows the correct information. However our card still says 1 day which isn’t correct.

A quick refresh of the data set and we can still see one on the card so we have a difference in between Service and Desktop.

Refresh of the report and now its worked and we can see 2 days difference

So there are a few issues here. Why did it not refresh the card on the data set refresh but it did when the actual report was refreshed?

Its actually the dashboard that is doing the work here. the new multi card is pinned to the dashboard. lets go and have a look at it.

The dashboard only updated once the new visual was pinned to it

So the failure has been that the report and dashboard didn’t refresh, even though it is set to refresh.

You can get to the data set refresh history in Data sets and then Refresh History

And you can get to the Dataflow refresh history via Dataflows

Data Set Issues Refresh History

Dataflow Issues Refresh History

The actual Schedule seems to be fine. All I can think of is that possibly at 12 it is still 1 day so I could possibly introduce more data refreshes to the dataflow issues data set

Test 2 Adding more refreshes on the Dataflow issues data set

Its a very quick refresh because its just two dates. Lets see if this changes things.

Quick Incremental Refresh check List

Incremental Refresh came available for Power BI Pro a few months ago but when tested there was am issue. Error Resource Name and Location Name Need to Match. This should have been fixed in April so here is a quick check list of how you approach incremental Refresh

Define your Incremental Refresh Policy

  • What is your data source?
  • Are new rows simply added to the dataset in power BI?
  • Are records deleted?
  • can old rows be updated?
  • If rows can be updated, how far back does this go?
  • How many years worth of data do you want to retain?
  • Which tables in your data set need incremental refresh?
  • Can you define the Static Date within the table that will be used for the Incremental refresh?
  • have you published to P

Each of these points are very important and will establish what you need to do to set up the incremental refresh, from your data source up to power BI Desktop and Service.

Set up incremental Refresh in Power Query Editor. Create Parameters

Go to transform data to get to the power Query Editor (You can either be in desktop or creating a dataflow in Service)

The two parameters that need setting up for incremental loading are RangeStart, RangeEnd

Range Start and Range End are set in the background when you run power BI. They partition the data

Query FoldingRangeStart and RangeEnd will be pushed to the source system. Its not recommended to run incremental processing on data sources that cant query fold (flat files, web feeds) You do get a warning message if you cant fold the query

You cant Query fold over a Spreadsheet. Its possible to query fold over a Sharepoint list. the recommendation is to set Incremental processing up over a relational data store.

For the desktop. allow yourself a good slice of the data to work with. for example, a year, two years worth of data.

Filter the data in the Model

Still in Power Query Editor.

Add your parameters to every table in your data set that requires incremental load

Find your static date. E.g. Order date, Received Date etc

Close and Apply

Define your Incremental Refresh policy in Power BI Desktop

Go to your first table and choose incremental refresh

Example screen shot of an Incremental refresh policy

Store Rows

In the Above example we are storing everything  for 5 years. its set to months so the partitions are smaller

Refresh Rows

Our Refresh policy is very simple. Data is inserted. not deleted or imported.

If this was running every single day then you would only need to refresh rows in the last 1 day. However as a just in case 1 month has been used, in case for any reason the job is suspended or doesn’t run.

Detect Data Changes

Detect Data Changes has been used. The months data will only be refreshed if the ImportDate for this record is changed (Or there are new records)

No records are deleted so we don’t need to worry about this

If you want to use Detect Data changes you must have an Update date on your source data. This may impact your data source.

  • Are you running straight from source into Power BI and there is no Update Date available?

Then you will need to make the decision to have a reporting database layer, Where you can add UpdateDate logic to your table

  • Is there a possibility that records are also deleted?

You need to deal with this slightly differently

Set Soft Delete in your reporting data

Add isDeletedColumn and update LastUpdatetime and isdeleted to 1 in the warehouse

This will come through to your model as updated and you can filter out all the isDeleted records

  • Publish the new Power BI Report and Data Flow

You might be thinking at this point, but I dont want the filters that I have set for Desktop to be applied in Service. I want to see all my data in Service

Dont worry, in Service RangeStart and RangeEnd don’t keep the dates specified for the filters in Desktop.

they are set via your incremental refresh policy. So they are set as the partitions for our 60 months (Instead of setting it to 5 years, meaning there is one RangeStart and OneRangeEnd, you get RangeStart for Month one, RangeEnd for Month 1, RangeStart for Month2, RangeEnd for Month2 etc, breaking your 5 years down into much smaller partitions to work with,

Testing Your Incremental Refresh

Make sure before you add the incremental Refresh you have done a full process of your data. Its good to get an idea of the time you are working with

Azure Data Studio

One way of testing is with Azure Data Studio.

https://docs.microsoft.com/en-us/sql/azure-data-studio/download-azure-data-studio?view=sql-server-ver15

Once installed, connect to the SQL Database that is your data source

So long as you have the profiler extention installed you can Launch Profiler

If you don’t have it, you can download the extension

Once Launched Start a Profiler Session

Go into Power BI service, Datasets.


click on refresh now and then go to data Studio to see whats happening

From Logon to Logout during the run, it took 20 minutes because the entire model is refreshed. obviously it would be really good if we could get the time down using incremental refresh

Before you set up Incremental processing, ensure that the services preceding the Power BI Load have been well tested and signed off.

Once Incremental refresh is in place, you can refresh again and check your findings in Azure DAX Studio.

the current recommendation, due to all the possible data source amendments and issues, its recommended to start with a relational data base as your reporting layer. Then you can Query Fold, Add IsDeleted and Updatedate columns and logic to this data source.

Azure Built In Tagging Policy. Update Resource by Inheriting Subscription Tag

Ensure your Azure Resources are tagged with the Tag from Subscription

In this example we have the following requirements

  • The costCentre tag has been manually added to the Subscription
  • Every resource must be created with a costCentre
  • The CostCentre tag must exist on the resources
  • The Resources inherit from the container they are in, but can be manually overridden

Costcentre Tag Configuration

  • Modify Resources to add the CostCentre tag from the parent Subscription where missing

Time to assign a new Policy

No Need to select a Resource group because they haven’t been created yet and this applies to all resource groups that have yet to be created

Choose Inherit a tag from the subscription if missing

The Assignment Name has been changed to include the tag

And the tab name is added as a parameter

We need a Managed Identity because this has a modify effect

Then go to Review + Create and Create

Test the New Modify Policy

This is the tag on the Subscription

Create a New Resource Group (And remember to apply any tags of Policies you have already set

Within this Resource Group, Create a New Resource (Dont add the tag)

Once created, the Resource will inherit the Tag from the Subscription