Azure Data Factory – Moving from Development to Production

When working on larger projects we need to merge changes from Developers. When all the changes are in the central branch we can then have an automated process to move development to Production

Smoke tests

In computer programming and software testing, smoke testing is preliminary testing to reveal simple failures severe enough to, for example, reject a prospective software release

Integration testing

Integration testing is the phase in software testing in which individual software modules are combined and tested as a group. Integration testing is conducted to evaluate the compliance of a system or component with specified functional requirements. It occurs after unit testing and before validation testing

Resources Involved with the current Project

  • Azure DevOps
  • Azure SQL Server
  • Azure SQL Database
  • Azure Data Factory
  • Azure Data Lake Gen 2 Storage
  • Azure Blob Storage
  • Azure Key vault

Each resource has its own specific requirements when moving from Dev to Prod.

We will be looking at all of them separately along with all the security requirements that are required to ensure that everything works on the Production side

This post specifically relates to Azure Data Factory and DevOps

Azure data factory CI/DC Lifecycle

GIT does all the creating of the feature branches and then merging them back into main (Master)

Git is used for version controlling.

In terms of Data Factories, you will have a Dev Factory, a UAT factory (If Used) and a Prod Data factory. You only need to intergrate your development data factory with GIT.

The Pull request merges feature into master

Once published we need to move the changes  to the next environment, in this case Prod (When ready)

This is where Azure Devops Pipelines come into play

If we are using the Azure Devops Pipelines for continuous development the following things will happen

  • The devops Pipeline will get the powershell script from the master branch
  • The get the ARM template from the publish branch
  • Deploy the Power Shell script to the next environment
  • Deploy the arm template to the next environment

Why use Git with data Factory

  • Source control allows you to track and audit changes
  • You can do partial saves when for example you have an error. Data Factory wont allow you to publish but with Git you can save where you are and then resolve issues another time
  • It allows you to collaborate more with team members
  • Better CI/CD when deploying to multiple environments
  • Data Factory is time times faster with a GIT back end that it is when authoring against the data factory service because resources are downloaded from GIT
  • Adding your code into Git rather than simply into the Azure Service is actually more secure and faster to process

Setting Up Git

We already have an Azure Devops Project with Repos and Pipelines turned on

We already have an Azure Subscriptions and Resource Groups for both Production and Development Environments

There is already a working Data Factory in development

In this example Git was set up through the Data Factory management hub (the toolbox)

DevOps Git was used for this project rather than GitHub because we have Azure DevOps

Settings

The Project Name matches the Project in Devops

The Collaboration branch is used for Publishing and by default it’s the master branch. You can change the setting in case you want to publish from another branch.

Import existing resources to repository means that all the work done before adding Git can be added to the repository.

Devops is now set up.

Close Azure Data Factory so we can reopen it again to go through the GIT process (If it is open)

Where to find your Azure DevOps

You should now be able to select your own Area in Devops / Repos and select the created project within Azure Devops and Repos

You will need an Account in Azure DevOPs Click on Repos and that account must be higher than Stakeholder to access Repos

You can then select the Project you have created

Using Git with Azure Data Factory

In Azure Open up your Data Factory

Git has been Enabled (Go to Manage to review Git)

The master branch is the main branch with all the development work on it

We now develop a new Feature. Create a Feature Branch + New Branch

We are now in the feature branch and I am simply adding a description to a Stored Procedure Activity in the pipeline. However this is where you will now do your development work rather than within the master

For the test, the description of a Pipeline is updated. Once completed my changes are in the feature 1 branch. I can now save my feature

You don’t need to publish to save the work. Save all will save your feature, even if there are errors

You can go across to Devops to see your Files and history created in Azure Devops for the feature branch (We will look at this once merged back into Production)

Once happy Create the pull request

This takes you to a screen to include more details

Here I have also included the Iteration we are currently working on in Devops Boards.

A few tags are also added. Usually, someone will review the work and will also be added here.

The next screen allows you to approve the change and Complete the change

In this case I have approved. You can also do other things like make suggestions and reject

Completing allows us to complete the work and removes the feature branch. Now all the development in the feature branch will be added to the main, master branch.

In Data Factory, go back to the master branch and note that your feature updates are included

We now publish the changes  in Master Branch which creates the adf publish branch. This publish branch creates the ARM template that represents the pipelines, linked services, triggers etc.

Once published, In Devops Repos , there are now files to work with

You can see your change within the master branch

(The changes would normally be highlighted on the two comparison screens)

Here we open the Pipelines folder got the compare tab and find the before and after code

And you can also see your history

The Arm templates is in the adf_Publish branch, if you select this branch

Once done we need to move the changes  to the next environment, in this case Prod (When ready)

This is where Azure Devops Pipelines come into play

Continuous Development using Azure DevOps

We need another Data Factory object to publish changes to

In this case, the Production has been created with Azure Portal within the Production Subscription and Production Resource Group

Git Configuration is not needed on the Production resource. Skip this step

Create your tags and Review and Create

DevOps Pipelines

For this specific Project, We don’t want to update production automatically when we publish to Dev. We want this to be something that we can do manually.

Go to Pipelines and create a new release Pipeline (In DevOps)

Click on Empty job because we don’t want to start with a template

And because for this project there is no UAT, just Production name the Release Pipeline Prod

Click on the X to close the blade

We need to sort out the Artefact section of the Pipeline
 
Click on Add an Artefact and choose an artefact from Azure Repos
 
We may as well add adf_Publish branch which contains the ARM templates
And the Master branch

the Source alias was updated with _adf_publish

Both Pipelines are Azure Repos artefacts

Next We move to Prod and Start adding tasks

Click on 1 job, 0 tasks to get to tasks

Click + against Agent Job to add the task Our task is for ARM Template deployment


Click Add

Then click on the new Task to configure

The first section is where you select your production environment

Next you need to select the ARM template and the ARM template parameters file. These are always updated in the Devops artefact everytime you publish to dev.

The JSON templates are in the adf_publish branch

Now you need to override the template parameters because these are all for Dev and we need them to be production.  These are:

These will be specific to your own data Factory environment. In this instance we need to sort out the information for the Key vault and data lake storage account

factoryName

This one is easy. The only difference is changing dev to prd

AzureDataLakeStorageGen2_LS_properties_typeProperties_url

The Production Data lake Storage account must be set up in dev and prod before continuing. Go to this Storage account resource

This information is also stored in our Key Vault as a secret which we can hopefully use at a later date.

It is taken from Storage Account, Properties. We want the Primary endpoint Storage for the data lake

Copy the Primary Endpoint URL and override the old with the new Prod URL in DevOps

AzureKeyVault1_properties_typeProperties_baseUrl

We need to update https://dev-uks-project-kv.vault.azure.net/

Lets get this overridden. We already have a Key vault Set up in production. Get the URI from Overview in the Production Key Vault Service

and lets add this into our DevOps parameter

AzureDataLakeStorageGen2_LS_accountKey

This is empty but we could add to it later in the process.

Account keys are the kind of things that should be kept as secrets in Key vault in both Dev And Prod

Lets get them set up. Just for the time being, lets ensure we have the Data Lake storage account key within our development and Production Key vaults

Key Vault.   

Within Key vault in development create a secret with the nameAzureDataLakeStorageGen2LSaccountKey

And the key from the storage account comes from……

And Repeat for Production Key vault

For the time being through lets leave this blank now we have captured the information in the key vault. It should come useful at a later date

AzureSqlDatabaseTPRS_LS_connectionString

This was also empty within the parameters for dev.

You can get the connection string value by going to your SQL data base. Connection Strings. PHP and finding the try statement

And here is the Connection String value for production

Server=tcp: prd-uks-project-sql.database.windows.net,1433; Database= prd-uks-project-sqldb;

You can also add this information into Key Vault as a secret and repeat for Production

For the first instance we are going to leave empty as per the dev parameters. At some point we should be able to set up the Security Principal so we can change the hardcoded values to Secrets

The parameters created in dev are now overridden with the production values

The Pipeline is then named

Create a release

Once Saved. Click back on Releases

For this tye of release we only want to do it manually

Create a Release for our very first manual release

Click back on releases

And click on release 1 to see how it is doing

You can click on Logs under the Stages box to get more information

Now you should be able to go back to the production data Factory and see that everything has been set up exactly like Dev.

Go and have a look at linked Services in the Production data Factory

Note that they are all set with the Production information

We now have a process to move Dev to Prod whenever we want

The Process

Throughout the sprint, the development team will have been working on Feature branches. These branches are then commited into the master pipeline and deployed to Dev

Once you are happy that you want to move your Data Factory across from dev into Prod. Go to DevOps Release pipeline

Create Release to create a new release

It uses the Artefact of the Arm template which is always up to date after a publish.

This will create a new release and move the new information to Prod

All your resources will be to be able to quickly move from Dev to Prod and we will look at this in further posts

Data Factory, Moving multiple lookup worksheets from Excel to one lookup table in SQL Server

A current project has an xlsx containing around 40 lookups in individual worksheets

Each worksheet consists of a code and a description

We decide that we want every single lookup in one lookups table in SQL Server.

This will have a Lookup Name, Code and Description that we can then use for the rest of the project

We want to do everything in one go in Data Factory.

For this Im going to use a simple example with just 3 worksheets

Azure Data Lake Gen 2

We are going to store the source data within a data lake.

The Source data looks like this

Lookup B worksheet

Lookup C Worksheet

SQL Server

I have an Azure SQL Database and on it I create the one table that all the reference lookups will go into

GO
CREATE TABLE [staging].[Lookups](
[LabelKey] [int] IDENTITY(1,1) NOT NULL,
[LabelName] varchar NULL,
[Code] [int] NULL,
[LabelDescr] varchar NULL,
[Importdate] [datetime] NULL
) ON [PRIMARY]
GO
ALTER TABLE [staging].[Lookups] ADD DEFAULT (getdate()) FOR [Importdate]
GO

LabelKey has been added just to create a valid key for the table. LabelName has also been added which will be the name of the worksheet.

Finally ImportDate is added because we want to know exactly what time this data was imported into the table

Now we need to provide Data Factory with a list of worksheets

CREATE TABLE [staging].[LookupNames](
[LabelKey] [int] IDENTITY(1,1) NOT NULL,
[Labels] varchar NULL,
[Importdate] [datetime] NULL
) ON [PRIMARY]
GO
ALTER TABLE [staging].[LookupNames] ADD DEFAULT (getdate()) FOR [Importdate]
GO

Lookup Names is our seed table and will provide us with the worksheet names

we have populated it like this

SELECT 'Lookup A' UNION
SELECT 'Lookup B' UNION
SELECT 'Lookup C' 

Data Factory

Linked Services

Firstly we need to provide our linked services. Source and destination

go to Linked services via

and choose new.

call it ADLS_LS and select your Azure Subscription and Storage account.

At this point the connection was tested and was successful so we didn’t need to do anything further

Next, create your Azure SQL Database Linked Service

And call is SQLDB_LS (Or what ever you feel is the right naming convention. _LS is good because you can see exactly what are the linked services in the JSON script created

Again add in your details (We used a role that we created in SQL Server DB specifically for data factory with GRANT EXEC, SELECT, INSERT, UPDATE, DELETE on all the schemas)

Ensure the connection is successful

Data Sets

Now to come up with the actual source and destination datasets. If we parameterise them then we can reuse a single data set for lots of other activities within the pipeline

Click on the and choose New dataset

Choose the Format. In this case its Excel

We don’t want to specify any of the location values until we get to the pipeline, including the worksheet

Make sure First row as header is ticked (Unless you don’t have a header in Excel)

And create parameters

This means we can use this one Data set for all the SQL data sources

Pipelines

Now to create the pipeline specifically for the lookup

This is the basic pipeline we are going to add.

Lookup

First of all In Activities search for lookup and drag this into the pane

This uses the SQL dataset because we are going to use our SQL table that contains all the names of the worksheets.

Note that first row only is not ticked because we are bringing all the information through

ForEach

@activity(‘GetLookups’).output.Value

We are going to get the entire data set (Value) fed into the GetLookups Lookup.

Sequential is ticked because we are going to move through all the worksheets names in the table (Ensure that your Worksheets have exactly the same name as what is specified in your table)

Click on the Activities (1) to get to the activity

Copy Activity within the Foreach

We now set up the source of the copy activity

We use all the parameters within the dataset and add in the information from our Azure data Lake Gen 2 in the Storage Resource.

Within our Lookups table there is a column called labelname and we are going to populate this with the Labels column from our item. Out Item in the foreach loop and was created via the Lookup. And that lookup contained all the columns from our LookupNames SQL Table

The data will go into the Lookups table

Thats everything. You should be able to test your Pipeline by clicking debug and the Foreach should move through worksheet specified within your lookupnames table and add your information into SQL

Truncating lookup tables before re adding data

we want to be able to repeat this process and unless we add a truncate statement into our process we will keep readding the same information

We can add the following Stored procedure into SQL

/*
05/10/2020 Debbie Edwards - Peak - Truncate lookups
EXEC [staging].[USP_Truncatelookups]
*/
Create PROCEDURE [staging].[USP_Truncatelookups]
AS
BEGIN
IF EXISTS(SELECT * FROM [dbo].[sysobjects] WHERE Name = 'lookups')
TRUNCATE TABLE [staging].[Lookups]
DBCC CHECKIDENT ('Staging.Lookups', RESEED, 1)
END

And this can be added to the the Pipeline before the foreach loop and Lookup with a Stored Procedure Activity

You wont be able to see the Stored procedure if you havent granted EXEC access to the specific Database Role name and schema

Always give the least amount of privileges and them move up if you need to

--Bring back information about the members in your roles
SELECT DP1.name AS DatabaseRoleName,
isnull (DP2.name, 'No members') AS DatabaseUserName
FROM sys.database_role_members AS DRM
RIGHT OUTER JOIN sys.database_principals AS DP1
ON DRM.role_principal_id = DP1.principal_id
LEFT OUTER JOIN sys.database_principals AS DP2
ON DRM.member_principal_id = DP2.principal_id
WHERE DP1.type = 'R'
ORDER BY DP1.name;
SELECT DISTINCT rp.name,
ObjectType = rp.type_desc,
PermissionType = pm.class_desc,
pm.permission_name,
pm.state_desc,
ObjectType = CASE
WHEN obj.type_desc IS NULL
OR obj.type_desc = 'SYSTEM_TABLE' THEN
pm.class_desc
ELSE obj.type_desc
END,
s.Name as SchemaName,
[ObjectName] = Isnull(ss.name, Object_name(pm.major_id))
FROM sys.database_principals rp
INNER JOIN sys.database_permissions pm
ON pm.grantee_principal_id = rp.principal_id
LEFT JOIN sys.schemas ss
ON pm.major_id = ss.schema_id
LEFT JOIN sys.objects obj
ON pm.[major_id] = obj.[object_id]
LEFT JOIN sys.schemas s
ON s.schema_id = obj.schema_id
WHERE rp.type_desc = 'DATABASE_ROLE'
AND pm.class_desc <> 'DATABASE'
AND rp.name = 'db_NameofRole'

you should hopefully have a good pipeline to run in your lookup information into one lookup table and truncate that table when ever you run the process

Setting up a Service Principal for Azure Data Lake Gen 2 (Storage) to use with Data Factory

An Azure service principal is a security identity used by user-created apps, services, and automation tools to access specific Azure resources. Think of it as a ‘user identity’ (login and password or certificate) with a specific role, and tightly controlled permissions to access your resources

Azure Service Principal

I am constantly having to remind myself how to set up the Service Principal for Access to things like Azure Data Lake Gen 2 when I am setting up a data factory (Or using the storage with another web app).

So I wanted to write a blog post specifically on this.

As the example, imagine you are moving data from an Azure SQL Database to files in Azure Data Lake Gen 2 using Azure Data Factory.

You attempt to add a Data Lake Connection but you need a Service Principal account to get everything Authorised.

You need this so the Data Factory will be authorised to read and add data into your data lake

An application (E.g. data Factory) must be able to participate in a flow that requires authentication. It needs to establish Secure credentials. The default method for this is a client ID and a Secret Key.

There are two types of permissions

Application Permissions No user context is required. The App (E.g. data Factory) needs to access the Web API By its self

Delegated Permissions The Client Application (E.g. data Factory) needs to access the Web API as a Signed in User.

Create an App

In Azure choose App Registrations

Here you can create an app – New Registration

Provide a name for your app. e.g. DataFactoryDataLakeApp

Grant your Registered App permissions to Azure Storage

This will enable your app to authorise Requests to the Storage Account With Azure Active Directory (AD)

You can get to your app by going to Azure Active Directory

Then App Registrations and choose the App

In your new App, go to Overview and View API Permissions

Next go to Add a permission

Go to Azure Storage API which contains Data Lake Gen 2

Notice that we are setting up Delegated Permissions for Azure Storage

You are warned that Permissions have been changed and you need to wait a few minutes to grant admin consent.

I am not an admin so I always get my admin to go into Azure Active Directory and Grant Admin Consent for Peak Indicators

Note that your app now has configured permissions for Azure Active Directory Graph and Azure Storage

Assign your new app to a subscription

Now you have an app you need to assign Contributor status to it to the level of service you require in Azure, Subscription level, Resource group level or resource level.

For this app I am going to set it up against the subscription. First go to the Subscription you want to add it to and then Access Control (IAM)

I have added the app as a contributor

Creating a Key Vault

We will be selecting and creating IDs in the next steps, but instead of simply remembering your secret. Why not store it in a Key Vault.

  • Centralise Application Secrets
  • Store Secrets and Keys Securely
  • Monitor Access And Use

Lets set one up in our Proof of Concept area.

Create a Key vault if you don’t have one already

remember to add any tags you need before Review + Create

Once completed you can go to the resource (E.g. Data Factory) but for the time being that is all you need to do

Application ID and Tenant ID

You can now go into your new app in Azure (App registrations) to get more details for Data Factory (When you set up the connection)

Tenant from Data Factory will be mapped to Directory (Tenant ID) from the App Overview

Service Principal ID from Data Factory will be mapped to Application (Client) ID From the App Overview

Create a Client Secret

Next, create your Client Secret.

In your App go to Certificates and Secrets

Click New Client Secret

Im going to allow this secret to Expire in a year (Anything using the app will start to fail so you would need to set a new secret and re-authorise)

We can add this into the Key vault so we don’t lose it because once you have finished here you dont see it again.

Open a new Azure Window and Go to your new Key Vault

Go to Secrets

Click + Generate Import

Notice I have set the expiration date to match the expiry date of the app

Ensuring the Access is set for the Data Lake Storage

For this you need to have a Data Lake Gen 2 set up and Microsoft Azure Storage Explorer downloaded

In Microsoft Azure Storage Explorer, navigate to the storage

Then Right click on the File System (In this case factresellersales) go to Manage Access and add the app.

Notice that we have set Read Write and Execute for the app on the file system and all the files will inherit these permissions

Adding The Data Lake Gen 2 Connector in Data Factory (Test)

I have a Data Lake Gen 2 with some files and I want to move them into a SQL Data base.

To test, Open or create a Data Factory

Go into Author and Monitor. Then Author

Go to Connections, +New and Choose Azure Data Lake Gen 2

Tenant = Directory (Tenant ID) from the App Overview

Service Principal ID = Application (Client) ID From the App Overview

Service Principal Key (You can get it from Azure Key Vault. Click ON secrets,Then the name and current version

You can then copy the secret value and add it into Data Factory

Test your Connection

Create the Data Lake Data Set

Here is where you know that all your efforts all worthwhile.

Create a new Dataset which will be an Azure Datalake Gen 2

This is great. I have access to the files in the data lake. Achievement unlocked.

Adding a Logic App to Refresh your Analysis Services Data

You have a shiny new Analysis Services model containing all the data you need for your Power BI reports and Dashboards

It has been added into an Analysis Service because A There is a lot of data, possibly too much for a Power BI Imports, B You want to do incremental processing without needing Power BI Premium and C You want to refresh more than 8 times a day.

Everything is all set up but the final part of the puzzle is how to schedule the refresh into the Analysis Services Tabular model.

Its easy enough in Power BI, You simply set the schedule in the Service. Lets have a look at how to do this using one of the Options. Logic Apps

Create a Service Principal (SPN)

First we need to Create a new Azure Active Directory Application and Service Principal to use with Role based Access Control. Whenever you have code that needs to access and or modify resources (The logic App will Refresh the Analysis Services data) You need to create an Identity for the App.

Sign into Azure

Go to Azure Active Directory

Next go to App Registrations and + New registration

The web URL is simply one set like the example in the documentation

Then click Register

Assign the Application to a role

Before you assign resources,, the app must be assigned a role. For this example we are going to assign a role at the subscription scope for our Proof of Concept Subscription

Go to All Services and then Subscriptions

Im going to select the Proof of Concept Subscription

Next select Access Control and add Role assignment

Note I have selected the new analysisservicesapp

Click save to add the Role Assignment

The Service Principal is set up. Now we need to get the values so we can sign into the app

Get Sign in Values

Go to Azure Active Directory

Go to App Registrations again and select the new application

Copy the Tenant ID for later use

Copy the Application ID for later use

Set up your Credentials

I was going to create a certificate but I found it an extremely complicated process and I couldn’t in the end export the created certificate in Powershell

Because I haven’t been able to Create and Export a certificate I am going to use Client Secrets instead

Click on New Client Secret

Once added you need to copy the ID for use later. You wont be able to access this information again so DONT LOSE IT

Give the Service Principal Name Authority to administer the Analysis Services

Still in your App click on API Permissions

Next View API Permissions

+ Add a permission and find Azure Analysis Services in APIs my organisation uses

Ensure that the Permission allows you to read and write all models (Tick the box)

then Add permissions

Note that although permissions have changed we still need admin consent

Grant Admin Consent

Granting admin consent requires you to sign in as global administrator, an application administrator, or a cloud application administrator.

From Azure Active Directory go to App registrations and select the app

Within the App go to App Permissions

grant Admin Consent

Configure Permissions in Azure Analysis Services

Next we need to ensure the new Server principal created as server administrator permissions in Analysis Services

Open SQL Server Management Studio by Right clicking and Run as Administrator, Connect to the Analysis Service

Add in your User name (In my case the email address) but not the Password.

You add your Office 365 password when you sign into your Account (Next)

Right click on the Analysis Services name and go to Properties, Then Security

Click add, then Search for the app that has been created. This can then be added as a server administrator (Allows it to add models, amend models, refresh data etc)

Click OK

Create the Logic App

Finally, we can create the Logic App in Azure

The Logic App will be triggered by a HTTP Request which will be triggered by Data Factory the Azure Orchestration tool

In Azure Go to Logic Apps and Add

Then New Step and search for HTTP and then HTTP

Method

Post. Post appends the form data inside the body of the HTTP request

URI (Unified Resource Indicator)

URI = https://your server region/servers/aas server name/models/your database name/refreshes

Here is my example:

https://ukwest.asazure.windows.net/servers/adventureworksas/models/AdventureWorksV2/refreshes

The Server is the Analysis Services created in Visual Studio.

The Model is the model that we want to refresh. There may be multiple models on the server

Headers

As per the example I have set to Content-Type and application/json

Queries

Nothing set

Body

This is where you set up al the processing information

{

    “Type”: “Full”,

    “CommitMode”: “transactional”,

    “MaxParallelism”: 2,

    “RetryCount”: 2,

    “Objects”: [

        {

            “table”: “DimCustomer”

        },

        {

            “table”: “DimDate”

        },

      {

            “table”: “DimEmployee”

        },

      {

            “table”: “DimProduct”

        },

      {

            “table”: “DimProductCategory”

        },

     {

            “table”: “DimProductSubCategory”

        },

     {

            “table”: “DimSalesTerritory”

        },

     {

            “table”: “FactResellerSales”

        }

    ]

}

Type: Full

the type of Processing to perform. refresh command types can be

  • full – Process the Entire Model and recalculate all the dependents (Measures, columns etc)
  • clearValues – Clear values in the objects and dependents
  • calculate – recalculate your formulas
  • dataOnly – refresh the data in the objects, does not force recalculation
  • automatic – If the object needs refreshing and recalculating then do so.
  • add- Append data to the partition and recalculate dependents (measures, columns etc)
  • defragment – Defragmentation option will clean up values in dictionaries that are no longer used

CommitMode: transactional

Determines if objects are committed in batches of when complete. Models include

  • default
  • transactional
  • partialBatch

Its always a good idea to set up Partitions within your data, especially on the Fact table to process the data

However the above is an example specifiying what to process and when. You can simplify the code if you wish

{

    “CommitMode”: “transactional”,

    “MaxParallelism”: 2,

    “RetryCount”: 2,

    “Type”: “Full”,

}

The following is an example of adding Partitions into your code. In this example, DimCustomer is one Partition.

“table”: “DimCustomer”,

“partition”: “DimCustomer”

The fact table can be made up of many partitions. For example you could have one for each year. Then you can decide which partitions to process. For example, you may only want to process the current years data.

Authentication

Active Directory OAuth

Tenant

Use the Tenant ID that we collected previously

Audience

https:/*asazure.windows.net

Client ID

Use the Client ID that we collected Previously

Credential Type

Secret (Remember that we added the Secret to the App)

Secret

The Secret ID that we saved (Remember, for this one you cant go back to Azure and look at the secret. Its a one time ID Only)

Save and test the Logic App by Clicking Run

Consume Logic App with Azure Data Factory

We now have a logic app that will incrementally process the data. We now need to schedule this by creating a Data Factory or orchestrate the processing

First we need to copy the HTTP POST URL within the logic App When a HTTP request is received activity

And then create a new Data Factory in Azure

for the time being I am not enabling GIT

In author and Monitor create a pipeline and drag a web activity across to the design pane.

And then set up the schedule by adding a trigger

this should now be enough to process all your data into the Analysis Service model which has a line connection into Power BI.

What we need to do is to check that this is doing its job. i will be looking at this in a later blog post

Power BI Service Data Lineage View

I was logging into Power BI this morning when I saw this exciting new feature

we are always looking at new solutions to provide good data lineage so this is well worth a look

Data lineage includes the data origin, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process. 

Wikipedia

I have an App workspace set up for Adventureworks so lets have a look at Lineage using this project

Column 1 is my data source. I can see I’m using a local database and I’m also using an xlsx spreadsheet to bring in data.

In most of my projects I’m working on the ETL in Data factory, transforming data in Stored Procedures etc. for example, for a social media feed, I have a logic app that moves tweets to an Azure Data Storage NOSQL table. Data Factory then transfers this data across into a central Azure Data Warehouse. The Power BI Lineage would pick up at the data Warehouse stage. It wont take into account that there is a lot of work previous to this

Column 2 is the data set in Power BI

Column 3 provides Report information

Column 4 displays the Dashboards

You can click on a data flow node to drill down into more detail

Currently you cant go any further to look at the data items

Click on the Link icon to see the data flow for that item. In this case the Report.

This is a great start but there definitely needs to be more information here to make it something that you would want to use as a proper Data Lineage Tool

  • It would be good to see the fields in each Entity for the Data Sets
  • As an extra, it would be great to see what fields are being used in Measures and calculated Fields
  • Reports – For me, Id like to know for every page in my report
    • What field am I using from the data source
    • What calculated columns I have created (Even better with the DAX Logic)
    • Any Name changes from Data Source to Power BI
    • What measures I have created (Even better with the DAX Logic)
  • For the Dashboard, What items I am using in the dashboards (Fields, Measures, Calculated Columns
  • An Important part of data lineage is getting and understanding of the entire process. This includes data transformations pre Power BI. If you cant do that in here, it would be great to be able to extract all the information out so you can use it in some way with your other Linage information to provide the full story. for example:

Azure Data Catalogue

Azure Data Catalog is a fully managed cloud service. Users can discover and consume data sources via the catalog and is a single , central place for all the organisation to contribute and understand all your data sources.

https://eun-su1.azuredatacatalog.com/#/home

I have already registered Our Data Catalog, and I have downloaded the desktop app

As an Example I want to connect to Azure Table Storage (Connect using Azure Account name and Access Key)

At this point I’m registering everything in the storage table. then I can view the information in the Azure Portal.

You can add a friendly Name, description, Add in expert (in this case me). Tags and management information

I have added Data Preview so you can view the data within the object. there is also documentation and Column information to look at

In the data catalog you can manually add lots of description to your tables along with documentation.

This is great for providing lots of information about your data . You can explore databases and open the information in other formats (Great if you need to supply information to another Data lineage package

I will be having a look at the Azure Data catalog in more detail later to see how it could help to provide full data lineage

Azure Data Factory

Data factory is the Azure ETL Orchestration tool. Go into Monitoring for Lineage Information. However, there doesn’t seem to be a way to export this information to use. Data Factory wont take into account the work done in, for example a stored Procedure

Again this is another area to look into more.

Stored Procedures

When you use Stored Procedures to transform you data, its harder to provide automated Linage on your code. There are automated data lineage tool for SQL out there, but it would be great if there was a specific Tool within Azure that creates Data Lineage information from your Stored Procedures

Azure Logic Apps

Data for my project is collected via Logic Apps before being Processed into an Azure Data Warehouse.

Essentially, we need out data lineage to capture everything all in one place.

And just as important. everything should be as automated as possible. If I quickly create a measure, the data lineage should reflect this with no manual input needed (Unless you want to add some description to the new measure as to why it was created)

Azure – Data Factory – changing Source path of a file from Full File name to Wildcard

I originally had one file to import into a SQL Database  Survey.txt

The files are placed in Azure blob storage ready to be imported

I then use Data Factory to import the file into the sink (Azure SQL Database)

However, the data is actually in one worksheet a year. For full logic I need to be able to add a worksheet to the blob storage to get it imported and each worksheet will contain the year.

This means I need to change the Source and Pipeline in Data Factory

First of all remove the file name from the file path. I used 1 file to set up the Schema. All files are the same so this should be OK.

Next I go to the Pipeline and set up the Wildcard in here Survey*.txt

When the Pipeline is run, it will take all worksheets against for example Survey

Create your website with WordPress.com
Get started