Difference between dataflows, datamarts and datasets
Lets have a quick look at the history of the data set
Here we see everything in one pbix file. Only one person can work with the file at any one time. We cant reuse anything or work on anything separately. Our dataset is in the one pbix file. dependent upon Import or Direct query the dataset is in the Power BI Columnar Data storage.
the only use case for this now would be if you were simply working on your own small projects outside of a working team environment in Pro or even Power BI Free license.
Here we can see that the dataset is now separate from the dataflow (the data transformation) and the actual reporting pbix files. the Dataset is the model and the measures.
This is currently the use case that we use. However our main transformations are outside of this within the SQL database.
Dataflows are packaged ETL Type transformations. We are packaging up into a dataflow to be reused. these are really good for reusable dimensions. Like Dates. Locations, etc.
They are for individual datasets that you bring together later on in the process
Dataflow data sits in a data lake so you can use them for machine learning tasks really easily. this is one of the big wins for dataflows.
But can you do all of your transformations in them?
Some of the Power Query transformations can be really time consuming and memory intensive. Especially when you are trying to create a star schema from transactional tables and lots of separate data sources.
You also need to think about Premium or Pro because there are certain things that you can’t do in Pro within the dataflow because it needs Premium In Lake compute (Append and duplicate for example)
If you do all this in your Pbix file this can easily grind the file to a halt. Moving it to a dataflow means that this can be done at a different time and you refresh your pbix file with work that has already been done.
However even this can be too much. Imagine you are developing, you have to go to the dataflow and refresh. Power BI has to grind through all the steps and the steps are really complicated.
You can go wrong. Backtrack by creating more steps and leave the incorrect steps in very easily. Making a great number of activities. All the activities have to be refreshed. Even the wrong ones.
It is still recommended to do the heavy processing work outside of Power BI. say with Azure (Data Factory and SQL Database)
Then when Developing in the dataflow you can do things quickly and they can be transferred to the SQL Database at another time. Still allowing the user to develop quickly.
The new Premium Feature announced at Microsoft Build May 2022
The Self Service Database. it doesn’t replace the data warehouse.
Datamarts allow you to combine and work with data from all sources in a single place.
Datamarts replace the step we would call the shared dataset previously.
We would have a pbix file where we would bring in the dataflow (Which is used over the SQL datamart and we do all the friendly naming in the dataflow)
The Shared Data set contains the model and measures (I don’t use calculated columns as they can bloat the model)
The pbix file would be published to service. Then report pbix files are created over the top of the published dataset. In this example there are two pbix files.
Datamarts allow you to just have the one report pbix file instead.
Premium or PPU Only So as a user you have to understand that with Power BI Pro this isn’t a feature we can use.
Datamarts are about self service data analytics. Bridging the gap between business users and IT. How do we create the data warehouse without having to go to central IT?
No Code Low Code
But does it mean you don’t have to create your database and ELT inside Azure?
There is still the need to create full Enterprise solutions and SQL Datamarts and wearehouses.
Just like with the dataflows, transforming to an OLAP schema from OLTP (Or datasources that aren’t even OTLP sources but just scattered data sources) can be very memory and processing intensive.
Creating a data mart with better control and governance should still be done pre Power BI for large more complex based projects.
So what other use cases and plus points are there for the datamart?
Another good example of a use case for the datamart is that datamarts Refresh the data flow then then dataset. No need to use APIs to run the datasets straight after the dataflows. Or setting up refreshes on Power BI for both, guessing the amount of time it will take to run the dataflow
Our Datamart users
This is a great options is for people who use macs and can’t use Desktop. It enables a SQL Endpoint for you
Datamarts are geared towards Self Service. the Citizen Data Analyst.
“a person who creates or generates models that leverage predictive or prescriptive analytics but whose primary job function is outside of the field of statistics and analytics.”Gartner
Would you use the Datamart in an Enterprise setting?
In an enterprise setting you have Data Engineers and developers. You will have a BI team as well as analysts. There is a place for the data mart for the self service bronze approach. Still with the aim to move to the more governed approach of having the logic set in a SQL Database centrally.
Our analysts creating self service probably aren’t creating star schemas and fully attempting to transform within the dataflow. This will still need to be done by the BI Devs.
However its probably that without the datamart, all the relationships and measures were created inside one pbix file and there may not be a SQL Database layer. Just datasets created from files etc.
The datamart allows for a better governed blended approach
Would a Developer or a data engineer use a datamart?
The BI Developers and Data Engineers are probably working outside of Power BI in the SQL Database and with Data factory or other ETL packages. however they can now leverage the datamart features if they want to quickly look at the data for this project.
The Datamart model
So how does this change out datasets and dataflow models above?
We can see how the Datamart unifies the dataflow and the dataset that is usually created in the shared pbix files. It also raises lots of questions.
- Do we still create dataflows separately?
- What is this new layer. the SQL Database?
- If we have our Datamart in SQL do we need to use the datamart in Power BI?
The Datamart SQL Database layer
Dataflows stores the data in a datalake
Datamarts are stored in an Azure SQL Database. You will hear this being called the data warehouse . When we think of the DW we think in terms of The Star Schemas.
If your logic is complex and the data sets are large its always best to use technologies outside of Power BI (Data factory, SQL Database)
The data warehouse that is being spoken about here is simply data storage, like your staging layer in the Azure SQL database. Our users here are probably not users that understand how to create OLAP schemas. So you can see this as your staging layer
Then you have the dataset layer with the relationships, calculations, so the SQL layer is the middle layer between the dataflow and the data set.
But what can you do with the SQL Layer and what can’t you do?
You cant write DDL (ALTER, CREATE) or DML (INSERT UPDATE, DELETE etc) Queries. Just DQL (SELECT).
So you can’t write stored procedures or do any real transformations within SQL. This still has to be done in the dataflow. You can only query it.
The SQL Database is not part of the transform layer
How to set up the Datamart in Power BI service
At the moment you can set this in Admin tenant settings. You either allow the whole organisation to use datasets or no one. Hopefully they will change this soon so you can allow a small group of users to test the functionality.
I will do another example post soon but basically, you can create the model (really great for Mac users who can’t use Desktop)
And you can also write measures in DAX. My main concern here is that simple base measures are fine but for complex ones. I always test them against a visual and you don’t have the ability to do this here.
Also, you cant create calculate columns or calculated tables. This is a good thing. you don’t want to be creating these anyway as they bloat your model due to none compression..
Behind the scenes Managed SQL Server is running the SQL layer and you still have the Power BI Columnar data store layer for the data set.
Row level security can also be done here. At SQL Layer and the dataset layer. (Two layers are created by applying security on the data set as you would usually do, but in service, not in desktop)
Ad Hoc Analysis can be done in Power Query by the user on the SQL layer, and if you know SQL you can write T SQL too within Power Query
You can also take your SQL Endpoint into SSMS for example (Read Only)
You can manage Roles in the Datamart and Assign Internal and External Users to the Role. Or Share the Endpoint with them if XMLA endpoints are on.
This is a really exciting new development for the Self Service side of Power BI. We now need to understand where it sits. Who our users are and how we can apply it to projects?
If you create reporting in Power BI service at the moment you cant publish to other workspaces or tenants. That’s where a pbix file comes in that is separate to Service and you can re publish to other tenants. How will the datamart help with this kind of functionality?
What are the future developments of the datamart going to be? for example Slowly changing dimensions, monitoring, version control?
Will this cost any more money over having a Preview license?
Will the SQL Layer ever become part of the transform functionality