I was logging into Power BI this morning when I saw this exciting new feature
we are always looking at new solutions to provide good data lineage so this is well worth a look
Data lineage includes the data origin, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.Wikipedia
I have an App workspace set up for Adventureworks so lets have a look at Lineage using this project
Column 1 is my data source. I can see I’m using a local database and I’m also using an xlsx spreadsheet to bring in data.
In most of my projects I’m working on the ETL in Data factory, transforming data in Stored Procedures etc. for example, for a social media feed, I have a logic app that moves tweets to an Azure Data Storage NOSQL table. Data Factory then transfers this data across into a central Azure Data Warehouse. The Power BI Lineage would pick up at the data Warehouse stage. It wont take into account that there is a lot of work previous to this
Column 2 is the data set in Power BI
Column 3 provides Report information
Column 4 displays the Dashboards
You can click on a data flow node to drill down into more detail
Currently you cant go any further to look at the data items
Click on the Link icon to see the data flow for that item. In this case the Report.
This is a great start but there definitely needs to be more information here to make it something that you would want to use as a proper Data Lineage Tool
- It would be good to see the fields in each Entity for the Data Sets
- As an extra, it would be great to see what fields are being used in Measures and calculated Fields
- Reports – For me, Id like to know for every page in my report
- What field am I using from the data source
- What calculated columns I have created (Even better with the DAX Logic)
- Any Name changes from Data Source to Power BI
- What measures I have created (Even better with the DAX Logic)
- For the Dashboard, What items I am using in the dashboards (Fields, Measures, Calculated Columns
- An Important part of data lineage is getting and understanding of the entire process. This includes data transformations pre Power BI. If you cant do that in here, it would be great to be able to extract all the information out so you can use it in some way with your other Linage information to provide the full story. for example:
Azure Data Catalogue
Azure Data Catalog is a fully managed cloud service. Users can discover and consume data sources via the catalog and is a single , central place for all the organisation to contribute and understand all your data sources.
I have already registered Our Data Catalog, and I have downloaded the desktop app
As an Example I want to connect to Azure Table Storage (Connect using Azure Account name and Access Key)
At this point I’m registering everything in the storage table. then I can view the information in the Azure Portal.
You can add a friendly Name, description, Add in expert (in this case me). Tags and management information
I have added Data Preview so you can view the data within the object. there is also documentation and Column information to look at
In the data catalog you can manually add lots of description to your tables along with documentation.
This is great for providing lots of information about your data . You can explore databases and open the information in other formats (Great if you need to supply information to another Data lineage package
I will be having a look at the Azure Data catalog in more detail later to see how it could help to provide full data lineage
Azure Data Factory
Data factory is the Azure ETL Orchestration tool. Go into Monitoring for Lineage Information. However, there doesn’t seem to be a way to export this information to use. Data Factory wont take into account the work done in, for example a stored Procedure
Again this is another area to look into more.
When you use Stored Procedures to transform you data, its harder to provide automated Linage on your code. There are automated data lineage tool for SQL out there, but it would be great if there was a specific Tool within Azure that creates Data Lineage information from your Stored Procedures
Azure Logic Apps
Data for my project is collected via Logic Apps before being Processed into an Azure Data Warehouse.
Essentially, we need out data lineage to capture everything all in one place.
And just as important. everything should be as automated as possible. If I quickly create a measure, the data lineage should reflect this with no manual input needed (Unless you want to add some description to the new measure as to why it was created)