Use Data Lake Storage V2 as Dataflow Storage

This blog post follows on from https://debbiesmspowerbiazureblog.home.blog/2019/11/28/setting-up-an-azure-data-lake-v2-to-use-with-power-bi-data-flows-in-service-as-a-data-source/

Dataflows are essentially an online collection and storage tool. Power Query Connects to data at source and collects and transforms that data. The dataflow then stores the data in a table within the cloud. They are stored in Data Lakes which is all automated for you.

Dataflows unify data from all your different sources. It should be noted that a Data Warehouse is still the recommended architecture with Data Flows over the top.

Dataflows also introduce the concept of the Common Data Service (CDS) and the Common Data Model (CDM). CDM allows organisations to use data formats to provide consistency across deployments. Now we have Azure Data Lake gen2 storage can be combined with data flows to store the data flows to provide and structured centralised data source.

Thanks to https://docs.microsoft.com/en-us/common-data-model/use for helping me understand the differences between the two

When you integrate CDM with Data Lake Gen 2 you get structural consistency and you can use CDM Folders in the lake that contain your schemas in standard CDM format.

Dataflow definitions and data are stored in Model.json format. If you have a Model.json file, it shows that you are compliant with CDM.

Dataflows store their definition and data in CDM folders, in the Model.json formats. If you have Model.Json it shows you are adhering to CDM.

Of course, this can be quite difficult when you are working with data that does not adhere to CDM format. I can see it being really useful when you are starting from scratch but I have done some work looking at my data sources and they are always quite far from CDM format.

You can find information about the CDM here https://docs.microsoft.com/en-us/common-data-model/

And more information about CDS here

https://docs.microsoft.com/en-us/powerapps/maker/common-data-service/data-platform-intro

Advantages to setting up your own Data Lake Gen2 Store

Essentially, you can get by using the internal Data Lake Store but there are lots of reasons (And some of these advantages will be coming in future releases) why setting your own store up is a good thing.

  • Great for Re-use, If you are given access to the workspace you can use the dataflows already created for you.
  • Centralise your prepped data

Remember this doesn’t replace your data warehouse. It just adds a preparation and transformation layer above the data warehouse. Instead of having to wait to get your small change added to the warehouse you can add it to this layer.

  • Data within your data lake can be integrated into other solutions by developers
  • Data Lake Gen 2 is enormously scale-able for data
  • Dataflow data and definition file can be leveraged by developers for AI Services

Steps to Set Azure Data Lake Gen2 as Dataflow Storage

  • The storage account must be created in the same Azure Active Directory tenant as the Power BI tenant.
  • The storage account must be created in the same region as your Power BI tenant. To determine where you Power BI tenant is located
  • The storage account must have the Hierarchical Name Space feature enabled.
  • Power BI service must be granted a Reader role on the storage account.
  • A File system named powerbi must be created.
  • Power BI services must be authorized to the powerbi filesystem you create.

Once configured, it cant be changed. You cant go back to the default data store for your dataflows

Set Storage Account in Same Region as your Power BI tenant

Log into Power BI service , then Help ? and About Power BI

The Data is Stored in North Europe Ireland

When you set up your Data Lake Ensure North Europe region is selected

Set up Azure Data Lake V2 in Azure Portal

When you go into Azure Currently and look for Data Lake V2 you can only Find Gen 1

So the question is, how do you set up Gen 2 in Azure Portal? (Currently we are on the 25th November 2019. Hopefully this may get easier in the future)

First of all I go to the Subscription where I want to add the new data Lake v2

Open up the Portal menu (Now hidden to the left of the screen)

Choose Create a resource

next Choose Storage and Storage Account

Note that the Account kind is Storage V2 (General Purpose)

Make sure Location is the same as Power BI Service. I’m not using this functionality at the moment but there is no harm in applying this logic.

It is recommended to set replication setting to Read-access geo-redundant storage (RA-GRS)

For the time being, I am leaving everything else as standard

Next go to Advanced

the most important setting here is the Data Lake Storage Gen 2 . Enable the Hierarchical namespace and your storage account will now be created as data Lake Storage V2

Click Review and Create

Grant Reader Role to Power BI Service

This is all built in so it should be fairly straightforward.

In Azure go to your new storage account (If you aren’t already in it) and go to Add role Assignment

One there, choose the reader role and select Power BI Service which as you can see is already in the list.

It takes about 30 minutes for this to take effect.

Create a powerbi file System

Power BI Needs to use a Filesystem specifically named powerbi  so data flows can be stored in this specific file system.

We now have a few options available to us. I have some files to add so I am going to add them to a container

Click on Containers And then + File System

Note that to store dataflows its called powerbi Click OK

clicking on your new storage account(s) you are told to Download Azure Storage Explorer.

I already have this Azure Storage Explorer downloaded. If you don’t have this, its something you will absolutely need to work with Azure Storage accounts.

Once downloaded Open Azure Storage Explorer

You will need to Add in your Azure Storage Accounts by clicking the little connector icon

You will be asked to sign into your Account with your Office 365 credentials and 2fa authentication

This will log you into all your Subscriptions and Services

You are good to go

Here you find your subscription, Then go to the data Lake Storage Gen 2 and find the new File system powerbi.

Grant Power BI permissions to the file system

Before we connect we need to grant permission for Power BI to use the File System (Again this is specific to using DLV2 as a data flow store but at this point we may as well set up the permissions)

go to Azure Portal and Azure Active Directory

then select Enterprise Applications

Change the Application type Drop down to All Applications

Power Query Online and Power BI Premium and Power BI Service are in the list.

You will need the Object IDs of these applications.

Back to Azure Storage Explorer (Manage Access)

Navigate back to powerbi file system, Right click and Choose Manage Access

Click Add, Grab the object ID of Power BI Service to Manage Access

Set Read, Write and Execute Access to Service and Repeat the Process for Power BI Premium

Repeat for Power Query Online but Set Write and Execute Access

Other Also needs setting up as follows

Connect the datalake Gen Storage Account to Power BI Dataflows

To do this you need to be Power BI Admin. go to Power BI Service and navigate to the Admin Portal

From here Connect your Azure Data Lake Storage Gen2.

Add your Subscription ID, Resource group Name and Storage Account name of your Data Lake

It is now connected to Power BI

Allow Admins to Assign Workspaces

Finally, still in Admin Portal, go to dataflow Settings

Switch Allow Workspace admins to assign workspaces to this storage account to On

Workspace admins can now assign workflows to the filesystem created

Things to Consider

Here is where it starts to get a little bit hazy

  • This is all very much still in preview and there will be lots of updates coming
  • Once your dataflow storage location is configured it Cant be changed so dont do this on a whim.
  • You have to be an owner of the dataflow or be authorised to the CDM folder in the data lake to use the data flow
  • Once you have created a dataflow you ant change the storage location
  • It is your organisations data flow so there can only be one.

because of this, and the fact that its still in development I am going to wait to set up a central storage account for our workflows.

Im still unsure what you would do with Workflows that are already set up, Do they stay in the default area or can you reprocess them into the central data lake.

What happens if you want to move the data lake to a new subscription? is it not possible?

I will be going back to this when I have a few more answers to these questions