This blog post follows on from https://debbiesmspowerbiazureblog.home.blog/2019/11/28/setting-up-an-azure-data-lake-v2-to-use-with-power-bi-data-flows-in-service-as-a-data-source/
Dataflows are essentially an online collection and storage tool. Power Query Connects to data at source and collects and transforms that data. The dataflow then stores the data in a table within the cloud. They are stored in Data Lakes which is all automated for you.
Dataflows unify data from all your different sources. It should be noted that a Data Warehouse is still the recommended architecture with Data Flows over the top.
Dataflows also introduce the concept of the Common Data Service (CDS) and the Common Data Model (CDM). CDM allows organisations to use data formats to provide consistency across deployments. Now we have Azure Data Lake gen2 storage can be combined with data flows to store the data flows to provide and structured centralised data source.
When you integrate CDM with Data Lake Gen 2 you get structural consistency and you can use CDM Folders in the lake that contain your schemas in standard CDM format.
Dataflow definitions and data are stored in Model.json format. If you have a Model.json file, it shows that you are compliant with CDM.
Dataflows store their definition and data in CDM folders, in the Model.json formats. If you have Model.Json it shows you are adhering to CDM.
Of course, this can be quite difficult when you are working with data that does not adhere to CDM format. I can see it being really useful when you are starting from scratch but I have done some work looking at my data sources and they are always quite far from CDM format.
You can find information about the CDM here https://docs.microsoft.com/en-us/common-data-model/
And more information about CDS here
https://docs.microsoft.com/en-us/powerapps/maker/common-data-service/data-platform-intro
Advantages to setting up your own Data Lake Gen2 Store
Essentially, you can get by using the internal Data Lake Store but there are lots of reasons (And some of these advantages will be coming in future releases) why setting your own store up is a good thing.
- Great for Re-use, If you are given access to the workspace you can use the dataflows already created for you.
- Centralise your prepped data
Remember this doesn’t replace your data warehouse. It just adds a preparation and transformation layer above the data warehouse. Instead of having to wait to get your small change added to the warehouse you can add it to this layer.
- Data within your data lake can be integrated into other solutions by developers
- Data Lake Gen 2 is enormously scale-able for data
- Dataflow data and definition file can be leveraged by developers for AI Services
Steps to Set Azure Data Lake Gen2 as Dataflow Storage
- The storage account must be created in the same Azure Active Directory tenant as the Power BI tenant.
- The storage account must be created in the same region as your Power BI tenant. To determine where you Power BI tenant is located
- The storage account must have the Hierarchical Name Space feature enabled.
- Power BI service must be granted a Reader role on the storage account.
- A File system named powerbi must be created.
- Power BI services must be authorized to the powerbi filesystem you create.
Once configured, it cant be changed. You cant go back to the default data store for your dataflows
Set Storage Account in Same Region as your Power BI tenant
Log into Power BI service , then Help ? and About Power BI
The Data is Stored in North Europe Ireland
When you set up your Data Lake Ensure North Europe region is selected
Set up Azure Data Lake V2 in Azure Portal
When you go into Azure Currently and look for Data Lake V2 you can only Find Gen 1
So the question is, how do you set up Gen 2 in Azure Portal? (Currently we are on the 25th November 2019. Hopefully this may get easier in the future)
First of all I go to the Subscription where I want to add the new data Lake v2
Open up the Portal menu (Now hidden to the left of the screen)
Choose Create a resource
next Choose Storage and Storage Account
Note that the Account kind is Storage V2 (General Purpose)
Make sure Location is the same as Power BI Service. I’m not using this functionality at the moment but there is no harm in applying this logic.
It is recommended to set replication setting to Read-access geo-redundant storage (RA-GRS)
For the time being, I am leaving everything else as standard
Next go to Advanced
the most important setting here is the Data Lake Storage Gen 2 . Enable the Hierarchical namespace and your storage account will now be created as data Lake Storage V2
Click Review and Create
Grant Reader Role to Power BI Service
This is all built in so it should be fairly straightforward.
In Azure go to your new storage account (If you aren’t already in it) and go to Add role Assignment
One there, choose the reader role and select Power BI Service which as you can see is already in the list.
It takes about 30 minutes for this to take effect.
Create a powerbi file System
Power BI Needs to use a Filesystem specifically named powerbi so data flows can be stored in this specific file system.
We now have a few options available to us. I have some files to add so I am going to add them to a container
Click on Containers And then + File System
Note that to store dataflows its called powerbi Click OK
clicking on your new storage account(s) you are told to Download Azure Storage Explorer.
I already have this Azure Storage Explorer downloaded. If you don’t have this, its something you will absolutely need to work with Azure Storage accounts.
Once downloaded Open Azure Storage Explorer
You will need to Add in your Azure Storage Accounts by clicking the little connector icon
You will be asked to sign into your Account with your Office 365 credentials and 2fa authentication
This will log you into all your Subscriptions and Services
You are good to go
Here you find your subscription, Then go to the data Lake Storage Gen 2 and find the new File system powerbi.
Grant Power BI permissions to the file system
Before we connect we need to grant permission for Power BI to use the File System (Again this is specific to using DLV2 as a data flow store but at this point we may as well set up the permissions)
go to Azure Portal and Azure Active Directory
then select Enterprise Applications
Change the Application type Drop down to All Applications
Power Query Online and Power BI Premium and Power BI Service are in the list.
You will need the Object IDs of these applications.
Back to Azure Storage Explorer (Manage Access)
Navigate back to powerbi file system, Right click and Choose Manage Access
Click Add, Grab the object ID of Power BI Service to Manage Access
Set Read, Write and Execute Access to Service and Repeat the Process for Power BI Premium
Repeat for Power Query Online but Set Write and Execute Access
Other Also needs setting up as follows
Connect the datalake Gen Storage Account to Power BI Dataflows
To do this you need to be Power BI Admin. go to Power BI Service and navigate to the Admin Portal
From here Connect your Azure Data Lake Storage Gen2.
Add your Subscription ID, Resource group Name and Storage Account name of your Data Lake
It is now connected to Power BI
Allow Admins to Assign Workspaces
Finally, still in Admin Portal, go to dataflow Settings
Switch Allow Workspace admins to assign workspaces to this storage account to On
Workspace admins can now assign workflows to the filesystem created
Things to Consider
Here is where it starts to get a little bit hazy
- This is all very much still in preview and there will be lots of updates coming
- Once your dataflow storage location is configured it Cant be changed so dont do this on a whim.
- You have to be an owner of the dataflow or be authorised to the CDM folder in the data lake to use the data flow
- Once you have created a dataflow you ant change the storage location
- It is your organisations data flow so there can only be one.
because of this, and the fact that its still in development I am going to wait to set up a central storage account for our workflows.
Im still unsure what you would do with Workflows that are already set up, Do they stay in the default area or can you reprocess them into the central data lake.
What happens if you want to move the data lake to a new subscription? is it not possible?
I will be going back to this when I have a few more answers to these questions
One Reply to “”