Lessons learned whilst working with Microsoft Fabric – Part 2: Power BI and Semantic Modelling

In part 1 of our initial lessons learned in Fabric blog, we looked at Data Engineering components. Our transformed dimensions and fact tables are now stored in Delta Parquet format within the Fabric Gold Lakehouse.

Now its time to look at what we can do with the Power BI Fabric functionality. Lets build our Semantic model and Reports.

The Semantic Model

Previously, the semantic model was created in Power BI Desktop. We could then create another desktop file and connect to the semantic model that had been published to the power BI Service.

With Fabric. You create the semantic model in the Fabric workspace.

We are going to use Direct Lake connection to the Delta Parquet files. As a consequence we don’t need to create dataflows. Lets look at how this looked previously.

Without the SQL DB, with Power BI Premium. Data flows can get complex, as usually the transformations happen within the SQL database. Here they are used to simply store the data centrally before being imported into the semantic model.

You can use the data flows in multiple semantic models. For example, your date dimension could be reused in many models.

Now, we are looking at the following logic.

As a consequence, this project has been simplified to Delta Parquet and Semantic model.

In the Lakehouse, go to SQL analytics endpoint

At the bottom of the screen we have a model view. This is the Default Model view

This takes us to the Semantic Model. However, when trying to update the default model we get hit with errors

After looking at some of these issues online, it would appear that currently the ‘default’ semantic model is quite glitchy and many people are having issues with it.

People are using work arounds by creating their own semantic models rather than using the default semantic model. So currently, the default is left alone and a new semantic model is created.

Back in the Lakehouse

Choose New Semantic Model

And work in the same way as you would within your Power BI desktop file. Note that we don’t have to publish. The model auto saves.

And you can then commit your model changes to your GIT repository. Eventually creating versions of your model. You can quickly create a new report in Fabric to check your DAX as you go. This is a real change from how we used to work. Where the power BI PBIX file was simply a black box file, with no opportunities to store the code in GIT.

Notice the connections on each object

Direct Lake. Power BI directly connects to the Delta Parquet file. Just like Power BI, the file is a columnar data store, and we do not lose any DAX functionality.

Power BI Reporting

As we have seen. You can create reports directly in Fabric. A big question is, is Power BI desktop still the best thing to use?

Again I think the answer to this question is definitely yes. Although the Fabric reports can be good for quick data model testing.

Desktop allows for Offline development
Power BI Desktop can be more responsive that working on the cloud
Power BI reporting allows for better version control. You can save locally and publish to the service when ready.
Desktop integration with other external tools and services
Desktop provides more options and flexibility that reporting within Fabric.

With this in mind. Another update we can take advantage of is the Power BI PBIP (Project) desktop files. PBIP allows for version control and collaboration. Finally, Our Power BI files are broken up and code can be stored in out GIT repositories.

PBIP (the Power BI Project File)

Get Data

In Power BI Desktop you can access your Power BI semantic models

Create reports from the Semantic model(s) in Fabric.

Create the Power BI Project (PBIP)

Instead of simply saving as a pbix (black box) file, Save as a project file and see how this can really change how you work. We should see benefits like:

Items are stored in JSON format instead of being unreadable in one file
JSON text files are readable and contain the semantic model and report meta data
Source Control. Finally real source control for Power BI
Amendable by more than one person at a time
The possibility of using (CI/CD) Continuous Integration and Continuous Delivery with Power BI

Saving as a project is (Currently) in preview so lets turn it on.

Options and Settings / Options

And now we can Save as report.pbip

After saving, the project is stated in the title. Lets have a look at what has been created in Explorer

The main project file

The reporting folder

The objects within the reporting folder.

For the initial test. The pbip was save to One Drive. It needs to get added into Devops so it can be added into the full source control process along with all the Fabric code.

Azure DevOps: build pipelines for continuous integration

Ensure Fabric is already connected to Azure Devops And cloned to a local drive.

Add a powerbi folder in the local repository and all the Power BI Objects are moved here.

PBIP is in local but not yet in the cloud.

Git Bash is installed locally to work with https://git-scm.com/downloads

Using GIT Bash you can push the powerBI files up to the central repository by creating a new branch to work on

Add and commit the changes. Then Push to the cloud

Back in DevOps

We can see the Power BI Items. If you have pushed to a branch, you can then create a pull request to pull the new power BI files over into main.

To work with the file again, remember to open the (PBIP) project file from your local repository. Then you can work in Git Bash to once again. Create a new branch. Add, Commit and Push the changes.

For self service Power BI developers, this may be something that takes time to embed itself, since the process is more complex and you need to have some understanding of version control but it is really worthwhile to train your developers and build this into your standard procedures. Especially with the Silver and Gold standard (Promoted and Certified) content.

The Fabric Capacity Metrics App

Finally, lets take a quick look at the Fabric Capacity Metrics app. You need to be an admin to install and view the report.

Why is the Fabric capacity app important?

It provides insights into how the capacity is being used.
Monitoring usage patterns helps to identify how to scale resources up and down.
High usage items can be quickly identified

Installing the App

Go into your Fabric Workspace and click on the settings cog at the top right of the screen and select the governance and insights Admin Portal

I am the admin for the Fabric capacity so ensure you know who your Fabric Capacity Admin is

Click on the capacity and get the capacity ID from the URL

You can use the ID for the Capacity Metric App

In the Fabric Workspace. Go to your Power BI experience at the bottom left of the screen

Click on Apps and Get Apps

Search for the Fabric capacity App and Get it Now to install

Connect to start working with the app

The screen is where you can use the capacity ID, copied from the Admin Portal. You then need to sign in and connect.

Here is where the Fabric experience falls down for me slightly currently.

The capacity app doesn’t show financial costs (Although part of the license fee. There are still costs to running spark jobs). Also, only the Fabric admin can see this information.

As a none admin user, I still want the power to be able to see my utilisation as I work with fabric.

Conclusion

The more you learn with Fabric the more exciting it gets. The next goals are to work with the Data Warehouse and the real time capabilities.

There are so many opportunities to use the Delta Lake or SQL DW as our transformation (Staging area) with Direct Lake Power BI connection. I can see Delta Lake being the option of choice to quickly to build up smaller solution projects.

As a SQL Developer, I am now a big fan of Pyspark. These two make a great tool set for the Fabric analytics engineer.

And as always with Power BI, there are exciting updates every month to look forward too. You never stop learning as an Analytics Engineer.

As at time of adding this blog. I am working towards the Fabric Engineer Certification after getting my Azure Engineer Associate some time ago. So lots more to think about. My current excitement is the real time analytics within Fabric. I.m really looking forward to trying to implement a project with this.

August 30, 2024September 12, 2024

Microsoft Fabric Part 17. Taskmaster Project. The Pipeline. Run multiple notebooks from one Notebook activity using the foreach activity

In Part 16 we processed Series 4 of Taskmaster. we had a lot of issues along the way to resolve.

Now we can create a pipeline to process all of the Notebooks we have created. We will start as simple as possible and try and build in better processes as we go along.

New Data Pipeline

Back in the Fabric workspace. Create the first Data Pipeline

Add activity: Notebook 1 Transformed

Click on Notebook to create the first Notebook

Its very simple at the moment. There are no parameters and nothing else to set.

Now for the next Notebook

Create an on Success connection between Transformed Silver and Dim Contestant.

Now we can set up the rest of the pipeline

This is as basic as it gets and it all goes well, this will be fine. However there are some things we haven’t dealt with

We have successes. But what happens if anything fails?
Each one of these is a Notebook. Imagine having lots of these here. Is there a way of having one Notebook that processes each notebook iteratively? This would clean up the pipeline
It would be super useful to record the metrics. What has been processed. How many rows etc. Can we get metrics from the run?

Iterate through Each Notebook

Lets create our first Pipeline post to process each notebook using just one Notebook activity.

In the first instance. We are going to add series 5 into the Data Lake. Technically, each Notebook should run and do nothing because there is no data to add

Create a csv file

Also get the IDs of the workspace and the notebooks. These were taken from the Fabric url’s. e.g.

And add this file into the bronze delta lake

Now we should be able to use this information in the pipeline

Create a lookup

In the pipeline we need a Lookup activity to get the information from the JSON file

And we can preview the data

Add a ForEach Activity

Drag and drop a ForEach activity onto your pipeline canvas and create an On Success Relationship between this and the Lookup.

Sequential is ticked because there are multiple rows for each notebook and we want to move through them sequentially.

Set the Items in Settings by clicking to get to the pipeline expression builder

We are using the output.value of our lookup activity.

@activity(‘GetNotebookNames’).output.value

@activity refers to the actual activity we are using. GetNotebookNames
.output accesses the output of the activity
.value gets the specific value of the output.

Configure the Notebook Activity Inside ForEach

Inside the Foreach. Add a Notebook

Again, click to use the Dynamic content expression builder to build

Workspace ID: @item().workspaceID

Notebook ID: @item().notebookID

Note, when you first set this up you see Workspace and Notebook Name. It changes to ID, I think because we are using the item() but this can be confusing.

This is the reason ID has been added into the csv file. But we still wanted the names in the file in order to better understand what is going on.

@item(): This refers to the current item in the iteration which is a row. When you’re looping through a collection of items, @item() represents each individual item as the loop progresses.
.notebookID: This accesses the notebookID property of the current item. And the notebookID is a column in the csv file

Running the pipeline

You can check the inputs and outputs of each activity.

If it fails you can also open the failure information.

Monitor

Click Monitor to get more information

Here you can see a run before certain issues were resolved.

Click on the failed text bubble to get to the details. I had added the wrong column name.

Go back to the Pipeline.

We ran with no new file. It would be useful at this point to create some SQL to quickly check the data manually. We also need to add this new Pipeline activity into our medallion architecture

Medallion Architecture task Flow

In the Fabric Workspace.

Check using SQL endpoint on the data

Technically, there should be only up to season 4 still. it would be nice to have some SQL that would quickly tell us what has happened to the data.

You can only use SQL on Delta Parquet files so we can’t look at our parquet Processedfile. this may be a reason to change this to a Delta parquet file in future so we can use SQL to check information easily.

From our Workspace, go into our gold lake.

We now want to work with the SQL analytics endpoint

Choose New SQL Query and name Check Dim and Facts.

these are the queries that have been created to check our data

--CONTESTANTS
--Make sure there are no duplicate contestants
SELECT ContestantID, ContestantName, COUNT(*)  As total FROM [GoldDebbiesTraininglh].[dbo].[dimcontestant]
Group BY ContestantID, ContestantName
Order by ContestantID

--Get the total of rows (Minus the default row) and divide by 5, which should tell you how manyseries of contestants we have
SELECT (Count(*)-1)/5 AS totalSeries FROM [GoldDebbiesTraininglh].[dbo].[dimcontestant]

4 Series. this is correct

--EPISODES
--Again, make sure we don't have duplicate records
Select Series, EpisodeName, Count(*) as Total 
from [GoldDebbiesTraininglh].[dbo].[dimepisode]
GROUP BY Series, EpisodeName
Order by Count(*) desc

--And get a distinct list of series
SELECT Distinct Series FROM [GoldDebbiesTraininglh].[dbo].[dimepisode]
Where Series <> '-1'
Order by Series

--TASKS
--Check we have no duplicates
SELECT Task, TaskOrder, COUNT(*) as Total FROM [GoldDebbiesTraininglh].[dbo].[dimtask]
GROUP BY Task, TaskOrder
Order by Count(*) Desc

--Check the number of rows per series via the fact table
SELECT e.Series, COUNT(*)AS Total FROM [GoldDebbiesTraininglh].[dbo].[facttaskmaster] f
INNER JOIN [GoldDebbiesTraininglh].[dbo].[dimepisode] e on f.EpisodeKey = e.EpisodeKey
GROUP BY e.Series
Order by COUNT(*) Desc

Now we can see a problem. Series one 2 and 3 are quite small. But here they have way more rows than S4. Something has gone wrong.

We need to add more into the scripts to ensure that we can identify specific issues. Like what were the original number of rows etc. It is only because we can remember that 240 was the original row number for S4 that we know S4 is correct. We need to identify this quickly and specifically.

Lets do more checking.

--And check the  duplicates with fact as the central table
SELECT d.date As SeriesStartDate, e.Series,e.EpisodeName, c.ContestantName, t.Task, t.TaskOrder, t.TaskType,  f.Points, COUNT(*) AS Total
FROM [GoldDebbiesTraininglh].[dbo].[facttaskmaster] f 
INNER JOIN [GoldDebbiesTraininglh].[dbo].[dimcontestant] c on f.ContestantKey = c.ContestantKey
INNER JOIN [GoldDebbiesTraininglh].[dbo].[dimepisode] e on f.EpisodeKey = e.EpisodeKey
INNER JOIN [GoldDebbiesTraininglh].[dbo].[dimdate] d on d.DateKey = f.SeriesStartDateKey
INNER JOIN [GoldDebbiesTraininglh].[dbo].[dimtask] t on t.TaskKey = f.TaskKey
Where  e.Series = 'S1'
GROUP BY d.date, e.Series,e.EpisodeName, c.ContestantName, t.Task, t.TaskOrder, t.TaskType,  f.Points
Order by COUNT(*) DESC

There should only be one row per Contestant,

But each has 4. the data has been added 4 times

Changing the Where criteria to S2, S3 and S4. There is 1 row for each for S4. S2 S3 have 4 rows. we know they were added along with S1. So now we have to understand why this is.

To do this we could do with some meta data in the Delta Parquet tables of the date and time the information was added.

Conclusion

We now have a pipeline that iterates through our Notebooks and we have discovered some issues along the way.

In the next blog post we are going to do some updates to the Notebooks and check why Season 1 has been added 4 times. then we can move on to our next Pipeline updates. recording more meta data about the runs

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31