Microsoft Fabric Part 12. Taskmaster Project. Pyspark Broadcast

In Part 10 we created the reporting and Project PBIP file. In Part 11 we looked at new Task flows and added our items into a medallion architecture.

Its now time to do some small tweaks to the project so far. One thing I discovered is the Pyspark broadcast function.

This function is used to optimise the performance of spark jobs by reducing data shuffling.

Broadcast is really good for join operations. And we have a lot of joins in the the notebooks. can we make the code better?

First of all. What is data shuffling?

Data Shuffling

In Distributed data processing its about redistributing data across partitions or nodes in a cluster. It happens when the data gets re-organised.

  • Joins
  • Aggregations
  • Groupings

So, data is distributed in partitions to allow for parallel processing

When you join two partitions together you get data shuffling (During transformation)

Shuffling is expensive and creates network overhead.

These are the stages of shuffling.

  • Map Stage – When the data is processed in partitions
  • Shuffle Stage – Shuffling and exchanging
  • Reduce Stage – Further processing after shuffling

We want to try and avoid unnecessary shuffling by using broadcast. Lets go back to our notebooks

Taskmaster transformed Notebook

# Join the extra contestant information
dfcont = dfc.join(dfp, dfc["Name"] == dfp["contestant"], "left_outer").drop(dfp.contestantID)\
.drop(dfp.contestant).drop(dfp.team).drop(dfp.team_label).drop(dfp.champion)\
.drop(dfp.TMI)

# The resulting DataFrame 'joined_df' contains all rows from dftask and matching rows from dfob
display(dfcont)

This is the original Join. Clicking run, it takes 6 seconds.

If you run it again it takes two seconds. Why is it quicker?

  • File caching
  • The second run will benefit from better data distribution and partitioning
  • JVM warmup. (Java Virtual Machine) has optimized on the second run

How do you get it to have time of run 1?

spark.catalog.clearCache()

Clears the memory before trying again.

Working with Broadcast on a join

You need to Broadcast the smaller table. In the above instance we can create a quick check of sizes on the two data sets.

dfcrow_count = dfc.count()
dfprow_count = dfp.count()

print(f"Number of rows in the dfc DataFrame: {dfcrow_count}")
print(f"Number of rows in the dfp DataFrame: {dfprow_count}")

Immediately we can see which dataframe is to be broadcast. dfc

from pyspark.sql import SparkSession, functions as F

df_small = F.broadcast(dfc)

# Join the extra contestant information
dfcont = df_small.join(dfp, df_small["Name"] == dfp["contestant"], "left_outer").drop(dfp.contestantID)\
.drop(dfp.contestant).drop(dfp.team).drop(dfp.team_label).drop(dfp.champion)\
.drop(dfp.TMI)

# The resulting DataFrame 'joined_df' contains all rows from dftask and matching rows from dfob
display(dfcont)

dfcont.explain()

Explain allows us to see the plan.

This took 1 second to run as opposed to the 6 seconds previously. if you are working with a lot of data this could really change things considerably with your runs.

Lets remove the original join and lock this one in.

The Execution Plan

Gives you more detail on how the command is being executed.

Conclusion

With this in mind, we can go through all the code and update anything where we can with a broadcast which should really help with the processing.

A really great addition to the Pyspark toolbox

Microsoft Fabric Part 11. Taskmaster Project. New Fabric Feature, Task Flows

In Part 11 we created Power BI reporting and changed the usual PBIX to a PBIP Project file.

Before moving on, lets have a quick look at a new feature in Fabric. Task Flows

Back in the Fabric Workspace

We can now build a task flow. But what is it?

Its a new workspace feature to visualise the collection of processes for an end to end solution.

There are specific types of tasks that we will look at later

There are two ways of going about this.

  1. You have started a brand new project and you have your initial Task flow set up to start adding in tasks. This helps you work with your architecture,
  2. You have already created tasks (Like this project) and want to assign them to a task flow.

Lets begin

There are 8 flows available at the moment. Throughout the Fabric journey, the medallion architecture has been championed so lets go with this.

Colouring

  • Green Items are Get data
  • Blue Items are Store Data
  • Purple items are Prepare data
  • Yellow items are Visualise data
  • Red Items are Analyse and train Data

Creating new Tasks

If you haven’t yet started. When you set up your board. you need to assign items to each task.

There are no Data Engineering steps in this development project for getting data so lets start with the green tasks.

Click New Item

And we get options of what to create for Low Volume data. Great. We have recommended items (which we can change to all) So if you wanted to go Low code a Data Factory Dataflow Gen2 might be the way to go.

lets have a look at high volume data

Here, a Data Pipeline would probably be used for this project.

Data Factory was previously the go to for copying files across into the Data lake so as the data engineer, I would possibly pick pipelines for both types of data processing.

Already this is a great way of working. We know we need to do all these processes but the amount of possibilities can be bewildering. So let Fabric give you recommendations every step of the way.

Assigning Items to a Task

Bronze Data

If we quickly click new item to check the recommendations

We have already set up a Lakehouse for Taskmaster.

Click the little paper clip icon to attach a task

And Select the Lakehouse

Now we have 3 items at Bronze Data Level

  • DebbiesFabricLakehouse Top Level
    • The SQL Analytics Endpoint
    • The semantic Model (default) This wasn’t used because there were issues with errors occurring whilst developing.

Selecting the Bronze data Task shows you the objects in that task.

Lets move on to

Initial process

We have a Data Engineering notebook set up that Creates our initial transformed Silver layer. This can be added here

One current real issue for me is the vertical bar on Fabric and Power BI

You have to go to the end to scroll down. And you can’t see your items. Extremely annoying when working on smaller screens. This is actually a large screen but you cant make the box any bigger. You have to manually make the columns smaller each time .

I hope they sort this out soon

What happens when you accidentally assign an item to the wrong task flow

I have added the Lake house to the initial Process accidentally.

Hover over the tasks and select the ones you want to remove

And Unassign from all Tasks which appears

Silver Data

In this case the Lakehouse contains bronze and silver layer, so this is for both.

What happens when we assign the lakehouse to silver?

Well, you can’t do this. An item can only be for a single task. So how do we relate this to our medallion architecture above?

For the time being, go to Bronze Data and edit

Assign the Lakehouse

Click on Silver Data

Delete. We don’t need this

This leaves a gap in our flow

Connecting Tasks

Click the edge of the starting task and join to the end task

Further transformation

We have Notebooks that prep the dims and facts. Lets add them

Golden Data

Another Data Store and again our golden data is in Delta Parquet in the Lakehouse. Looks like we can edit the initial Store data again

Mini gripe. You can’t see all of the information so you cant tell its bronze silver and gold.

Delete Golden Data

Personally I would prefer to have all three on the pane and share the Data lake between all three tasks.

But where do we add the semantic model?

The Semantic model comes before visualising the data. But we have nowhere for this.

Creating paginated reports and semantic models from a task isn’t supported.

https://learn.microsoft.com/en-us/fabric/get-started/task-flow-overview

Well this is a real shame. it feels like this needs to be added before its a really usable solution.

Data Visualise

Our report is in a PBIX File.

We don’t have any Dashboards

Lets publish the Content from Pbix to the Fabric Workspace.

Then go back to the workspace

Interestingly these are now uncommitted items and show up in source control.

In Part 10 we did a lot of work getting the PBIX file into source control. What will this do? we will look at this later.

Back to Data Visualize

We add the reporting

The Semantic Model has also been added here, Hopefully a new task will be added soon specifically for the semantic model, there are currently no items types for semantic modelling.

There is no ML Service so we could get rid of this task.

Old Notebooks not being used (Filters).

There are a few Notebooks from the old version not part of this, can we see them?

Another issue. Selecting Not Specific for Task doesn’t filter the Tasks without a Task Specified. Something else that needs to be ironed out.

Conclusion.

There are currently a fair few issues with this but it is new and hopefully they will get ironed out.

I still don’t like scrolling in any of the Power BI / Fabric Service solutions because it doesn’t automatically resize to page size. This has been an annoyance for a long time now. I really wish it would get sorted.

Semantic modelling needs to be added. Filtering needs to be much better,

And I genuinely wish you could have an item in multiple tasks, especially store data. But overall I really love this feature and can’t wait to see if get some more development

Microsoft Fabric Part 10. Taskmaster Project. Creating the Reporting and the Project File (PBIP)

So far we have created our Delta PARQUET files in the Delta Lake using Notebooks with Pyspark.

We have created a Semantic Model – Storage mode Direct Lake in fabric

Its time to create some visuals. but the question is. Do we create them in Fabric. or in a bpix file?

Reporting Fabric or Pbix File?

Fabric – Centrally managed

Pbix – offline Development and version controlling available for the files with .pbip (Project file)

For this project we are going to go for the hybrid approach. The semantic model in Fabric. And reports developed in Desktop with Devops Version controlling which are published into Fabric.

This gives us better opportunities for version control and collaboration.

Get Data

In Power BI Desktop

Power BI Reporting

Our front end reporting wont be gone into too much detail here. We are more interested in other areas of Fabric. So here are the basic pages

Drill through to Series

And Drill through to Episode

We now have a pbix Report we can publish through to our Fabric Workspace

Create the Power BI Project (PBIP)

Instead of simply saving as a pbix (black box) file, lets save as a project file instead and see how this can really change how we work with others in Power BI. We should see benefits like:

  • Items are stored in JSON format instead of being unreadable in one file
  • JSON text files are readable and contain the semantic model and report meta data
  • Source Control. Finally real source control for Power BI
  • Amendable by more than one person at a time?
  • The possibility of using (CI/CD) Continuous Integration and Continuous Delivery with Power BI

Saving as a project is in preview so lets turn it on.

Options and Settings / Options

TMDL

This was mentioned in the Power BI June 2024 Updates.

TMDL is the improvement to PBIP as the semantic model file format for the Power BI Project files.

Our Semantic model has been created within Fabric. we might look at this in more detail later.

And now we can Save as

And we can see that its a project in the title.

Clicking on the title shows us file paths of the objects. we only have the report because the semantic model has been created within Fabric.

Lets have a look at what we have created in Explorer

The main project file

The reporting folder

Our objects within the reporting folder.

Currently this is in One Drive. We want to get this added into Devops and GIT so it can be added into our full source control process along with all the Fabric code.

Azure DevOps build pipelines for continuous integration

First of all we need to make sure Fabric is connected to Azure Devops which it is

And cloned to a local drive. This was done using Visual Studio but there are other ways you can clone.

It would have been better to have saved into this cloned local project, But we can create a folder and move instead.

Created a Power BI Folder.

And Moved all the objects mentioned above into this folder

Its in local but not yet in the cloud.

I have installed Git Bash to work with https://git-scm.com/downloads

In GIT Bash I change directory to the correct GIT Directory (Local) and us ls to list the information in there

cd source/repos/DebbiesTraining
ls
git status

We can see we have powerbi code that is uncommitted and needs pushing to the central GIT repository.

We don’t really want to work on the main branch. A better cleaner process is to create a feature branch. InitialPowerBIFile

git add .
The . in git add . is a wildcard that represents all files and directories. basically telling GIT to stage all changes.

git add powerbi

just adds the power BI folder. We dont really want to do anything with fabric as this is taken care of already.

git commit -m "InitialPowerBIFile"

-m flag is allowing you to provide a commit message in the command. And our command is committing changes

So far we have added the files to the staging area and committed them. Time to Push to the cloud

git push --set-upstream origin InitialPowerBIFile

You will initially get a log in screen to authenticate the push

And this is the line you want to see. We know that we have pushed it to devops

Back in DevOps

Remember to change to the correct branch. You wont see it in Main.

We can see the Power BI Items.

Create a Pull Request

Very simple. We are in Dev and don’t have any reviewers or work on Boards to connect to

Create and Complete

Power BI Files are now in Main

How do you now work with the power BI Project file?

Open the project file from your local source control

lets create a new quick report page for Demographics

And save

Back to GIT

git checkout -b "InitialPowerBIFileAllDemographicsPowerBIPage"
git status
We can see we have modifications
git add .
git commit -m "InitialPowerBIFileAllDemographicsPowerBIPage"
git push --set-upstream origin InitialPowerBIFileAllDemographicsPowerBIPage

Back in Devops

Create and Complete merge. (If you don’t use the reviewing process in Development)

We can see the page is in with lots of work added into this part of the JSON script.

Where is the Semantic Model?

The Semantic model has been created in Fabric.

It says synced on GIT Status.

And here is is in Devops.

Conclusion

The above can be done in Power BI Pro as well as Fabric. Which is good news.

However, my worry about this is that someone creating Pbix Reporting won’t take to the whole GIT process. I suspect that when it comes down to it, it simply wont be done properly.

If you were to go the pbix file route, this would need a lot of governance work to get people to use GIT. At Enterprise level this would be doable but I doubt this would become a real part of the process at self service level.

I did start to question the process. after doing some more reading. It feels like if you created the reporting inside of Fabric, The GIT Process would be simplified. Even though the documentation states that for full GIT control use PBIX. I will be really interested in creating a Power BI Report inside of Fabric to see how this would work in the process at some point.

In the next blogs we are going to look at some new features of Fabric. We can also look at using this feature along with with Devops Pipelines later. (Premium and Fabric only)

Microsoft Fabric Part 9. Taskmaster Project. Creating the Semantic Model

The Semantic Model

moving straight on from Part 8. We were in the SQL Analytics Endpoint.

At the bottom of the screen we have a model view. This is the Default Model view

This takes us to the Semantic Model

Semantic Modelling in Fabric

Lets update the default semantic model. Set the relationships. Hide Keys, set up Orders, hierarchies etc.

Direct lake

Notice the Storage model is Direct Lake. this has greatly reduced the amount of copies of the data we have. Power BI is using the PARQUET files in the same way it would Import into the Columnar Power BI data Store.

Direct lake is only available with Microsoft Fabric and Delta PARQUET tables

Unfortunately I have hit a problem.

The Semantic Model isn’t working. After every change we hit

 ‘We cannot access source column ‘<ccon>series</ccon>’ from delta table ‘<ccon>dimepisode</ccon>

However, checking the Data and series is available.

Also the directions on the join are incorrect (Before the changes are removed)

After looking at some of these issues online, it would appear that currently the ‘default’ semantic model is very glitchy and many people are having issues with it.

People are using work arounds by creating their own semantic models rather than using the default. This seems very disappointing, and I hope its rectified soon (As at 30/05/2024)

https://powerbi.microsoft.com/en-gb/blog/improvements-for-creating-new-direct-lake-semantic-models/

Create a Semantic Model

Back in the Lakehouse

Choose New Semantic Model

Select All

Join Dim Episode to Episode Key

Immediately we see an issue. the many should be on the fact side.

Time to go back to the SQL analytics endpoint to quickly check what is happening.

Going back to My Queries and the Taskmaster data checks file.

SELECT f.TaskKey AS FactTaskKey, COUNT(*)
FROM [DebbiesFabricLakehouse].[dbo].[facttaskmaster] f
GROUP BY f.TaskKey

SELECT t.TaskKey AS DimTaskKey, COUNT(*)
FROM [DebbiesFabricLakehouse].[dbo].[dimtask] t
GROUP BY t.TaskKey
Having COUNT(*) > 1

SELECT Task, TaskOrder, COUNT(*) FROM [DebbiesFabricLakehouse].[dbo].[dimtask] 
GROUP BY Task, TaskOrder
HAVING COUNT(*) > 1

it would seem that from running the above queries. Everything is correct. 1 Task to Many. In Power BI desktop this would automatically create the correct relationship for you,

So in this instance, go into the relationship and reset. (is this an issue in Fabric at the moment?)

Relationships and hiding keys

Friendly Names

Now I want to set up Friendly names for the columns and tables. If we change a table name in Data, how does this affect the DELTA PARQUET table? Lets check.

it doesn’t affect Delta PARQUET file so lets add Friendly names into the model for the dims

E.g.

Add Sort orders and hide

This has been repeated on :

  • Dim Date Quarter / Quarter No
  • Dim Episode Episode Name / Episode No
  • Dim Task Task / Task Order

Hierarchies

Dim Date

Hide the original fields outside of the hierarchy so not to confuse the user.

For this project, Lets not use Quarter and hide it for the time being

Lets also hide financial year

Dim episode

Don’t forget to click Apply level changes

DAX

Now we can start to add the DAX. The question here is, In power BI Desktop we can create DAX and then immediately create a visual to check its looking good. how do we do this in the Semantic model in Fabric?

Points

Total Points = SUM(‘fact Taskmaster'[Points])

And hide the base metric

And add the new DAX to a Folder

So how do we quickly check this

Using the new report we can quickly create a test report.

No need to save this report. it is just a tester.

There is something wrong with the Date connection. We can go back to the SQL analytics Endpoint to quickly check (After checking the relationships)

Our Date Dimension starts in 2020, and we are only at 2017 so the Date Delta PARQUET needs an update.

Dim Date NoteBook

Back in Dim Date. All we need to do here is to reconfigure these Start and End Dates

# Define your start and end dates
start_date = '2015-01-01'
end_date = '2027-12-21'

And rerun the entire notebook

then go back to the Query just jun.

And after a refresh on the report

Great. this issue has now been resolved.

Week

And whilst in Dim Date Dimension. there is one thing that we havent added and thats week.

Week is really important because each Each episode is once a week. Its more important than day. So lets quickly update to bring in a Week column and we can update the hierarchy and hide day.

from pyspark.sql.functions import to_date, weekofyear

dfDateWeek  = dfdate2.withColumn('weekofyear', weekofyear(dfdate2.date))

dfDateWeek.show()

And Update the Creation of the files to use this dataframe

Update the date Hierarchy

Once set lets go back to the semantic model

Right click and Refresh

This didn’t work. Week of year isn’t appearing in the model.

Clicking this seems to have worked.

We don’t want to sum up the weeks so in Properties / Advanced, set summarize by to None.

Update the hierarchy

Also, Don’t forget to set as Date table

Back to the DAX.

Base Metrics

Total Contestants = DISTINCTCOUNT('dim contestant'[Contestant Name])

Total Episodes = DISTINCTCOUNT('fact Taskmaster'[EpisodeKey])

Total Points = SUM('fact Taskmaster'[Points])

Max Points Scored (ALL Contestant Images) = MAXX(
                    ALL( 'dim contestant'[Image]),[Total Points])

Points for Group Tasks = CALCULATE(SUM('fact Taskmaster'[Points]),'dim   task'[Assignment] = "Group")

Points for Special Tasks = CALCULATE(SUM('fact Taskmaster'[Points]),'dim task'[Assignment] = "Special")

Total Series = CALCULATE(DISTINCTCOUNT('dim episode'[Series]), 'dim episode'[Series] <> "NA")

Total Tasks = DISTINCTCOUNT('fact Taskmaster'[TaskKey])

DAX with variables
Winner + Tiebreaker = 
var tiewin = CALCULATE(SUM('fact Taskmaster'[Winner]),FILTER('dim task','dim task'[Task Type] = "Tie Breaker"))

RETURN  [Total Points]+tiewin

DAX using If

Group Tasks = IF(CALCULATE(DISTINCTCOUNT('fact Taskmaster'[SeriesStartDateKey]),'dim task'[Assignment] = "Group") = BLANK(), 0,
              CALCULATE(DISTINCTCOUNT('fact Taskmaster'[TaskKey]),'dim task'[Assignment] = "Group"))

Special Tasks = IF(CALCULATE(DISTINCTCOUNT('fact Taskmaster'[SeriesStartDateKey]),'dim task'[Assignment] = "Special") = BLANK(), 0,              CALCULATE(DISTINCTCOUNT('fact Taskmaster'[TaskKey]),'dim task'[Assignment] = "Special"))

Tie Breaker = IF( CALCULATE(DISTINCTCOUNT('fact Taskmaster'[TaskKey]),'Dim Task'[Task Type] = "Tie Breaker") = BLANK(), 0,              CALCULATE(DISTINCTCOUNT('fact Taskmaster'[TaskKey]),'Dim Task'[Task  Type] = "Tie Breaker"))

Tie Breaker episode ? = IF([Tie Breaker]>=1 , "Tie Breaker", "Won on points")

Percentages

% Tasks = [Total Tasks]/ CALCULATE([Total Tasks], ALL('dim task'[Assignment]))*100

% Tasks by Series = [Total Tasks]/ CALCULATE([Total Tasks], ALL('dim episode'[Series]))*100

% Tasks by Type = [Total Tasks]/ CALCULATE([Total Tasks], ALL('dim task'[Task Type]))*100

And each one is added to a display folder. E.g. Measures\Tie Breaker to give us a sub folder under Measures

Highlighting Metrics By Colour will be used later in the reporting.

Highlight Max Points Scored (ALL Contestant Images) = 
Var MaxPoints = MAXX(
                    ALL( 'dim contestant'[Image]),[Total Points])


Var Check = IF([Total Points] = MaxPoints, "Light Green","White")

Return Check

Highlight Max Points Scored (ALL Episodes) = 
Var MaxPoints = MAXX(
                    ALL('dim episode'[Episode Name]),[Total Points])


Var Check = IF([Total Points] = MaxPoints, "Light Green","White")

Return Check

Highlight Max Points Scored By Episode = 
Var MaxPoints = MAXX(
                    ALL('dim episode'[Episode Name]), [Total Points])


Var Check = IF([Total Points] = MaxPoints, "Light Green","White")

Return Check

Highlight Min Points Scored (ALL Contestant Images) = 
Var MaxPoints = MINX(
                    ALL('dim contestant'[Image]), [Total Points])


Var Check = IF([Total Points] = MaxPoints, "Light Red","White")

Return Check

Hide original fact data

With Everything hidden in the table. we get out fact icon, establishing that this is a fact table

Final Updates

Have Thousands separators set for numbers (Measures)

Make sure percentages are also set correctly

Set Image as Image URL in Properties > Advanced

Our Semantic model is complete. Or at least ready for us to start creating reports.

The other thing to note about this semantic model is there is no need to publish. It autosaves and we simply need to refresh our data source for the reporting to bring through changes.

Lets move on to creating the visuals in the next post.

Design a site like this with WordPress.com
Get started