Data Engineering – Debbies Microsoft Power BI, SQL Fabric and Azure Blog

In Part 16 we processed Series 4 of Taskmaster. we had a lot of issues along the way to resolve.

Now we can create a pipeline to process all of the Notebooks we have created. We will start as simple as possible and try and build in better processes as we go along.

New Data Pipeline

Back in the Fabric workspace. Create the first Data Pipeline

Add activity: Notebook 1 Transformed

Click on Notebook to create the first Notebook

Its very simple at the moment. There are no parameters and nothing else to set.

Now for the next Notebook

Create an on Success connection between Transformed Silver and Dim Contestant.

Now we can set up the rest of the pipeline

This is as basic as it gets and it all goes well, this will be fine. However there are some things we haven’t dealt with

We have successes. But what happens if anything fails?
Each one of these is a Notebook. Imagine having lots of these here. Is there a way of having one Notebook that processes each notebook iteratively? This would clean up the pipeline
It would be super useful to record the metrics. What has been processed. How many rows etc. Can we get metrics from the run?

Iterate through Each Notebook

Lets create our first Pipeline post to process each notebook using just one Notebook activity.

In the first instance. We are going to add series 5 into the Data Lake. Technically, each Notebook should run and do nothing because there is no data to add

Create a csv file

Also get the IDs of the workspace and the notebooks. These were taken from the Fabric url’s. e.g.

And add this file into the bronze delta lake

Now we should be able to use this information in the pipeline

Create a lookup

In the pipeline we need a Lookup activity to get the information from the JSON file

And we can preview the data

Add a ForEach Activity

Drag and drop a ForEach activity onto your pipeline canvas and create an On Success Relationship between this and the Lookup.

Sequential is ticked because there are multiple rows for each notebook and we want to move through them sequentially.

Set the Items in Settings by clicking to get to the pipeline expression builder

We are using the output.value of our lookup activity.

@activity(‘GetNotebookNames’).output.value

@activity refers to the actual activity we are using. GetNotebookNames
.output accesses the output of the activity
.value gets the specific value of the output.

Configure the Notebook Activity Inside ForEach

Inside the Foreach. Add a Notebook

Again, click to use the Dynamic content expression builder to build

Workspace ID: @item().workspaceID

Notebook ID: @item().notebookID

Note, when you first set this up you see Workspace and Notebook Name. It changes to ID, I think because we are using the item() but this can be confusing.

This is the reason ID has been added into the csv file. But we still wanted the names in the file in order to better understand what is going on.

@item(): This refers to the current item in the iteration which is a row. When you’re looping through a collection of items, @item() represents each individual item as the loop progresses.
.notebookID: This accesses the notebookID property of the current item. And the notebookID is a column in the csv file

Running the pipeline

You can check the inputs and outputs of each activity.

If it fails you can also open the failure information.

Monitor

Click Monitor to get more information

Here you can see a run before certain issues were resolved.

Click on the failed text bubble to get to the details. I had added the wrong column name.

Go back to the Pipeline.

We ran with no new file. It would be useful at this point to create some SQL to quickly check the data manually. We also need to add this new Pipeline activity into our medallion architecture

Medallion Architecture task Flow

In the Fabric Workspace.

Check using SQL endpoint on the data

Technically, there should be only up to season 4 still. it would be nice to have some SQL that would quickly tell us what has happened to the data.

You can only use SQL on Delta Parquet files so we can’t look at our parquet Processedfile. this may be a reason to change this to a Delta parquet file in future so we can use SQL to check information easily.

From our Workspace, go into our gold lake.

We now want to work with the SQL analytics endpoint

Choose New SQL Query and name Check Dim and Facts.

these are the queries that have been created to check our data

--CONTESTANTS
--Make sure there are no duplicate contestants
SELECT ContestantID, ContestantName, COUNT(*)  As total FROM [GoldDebbiesTraininglh].[dbo].[dimcontestant]
Group BY ContestantID, ContestantName
Order by ContestantID

--Get the total of rows (Minus the default row) and divide by 5, which should tell you how manyseries of contestants we have
SELECT (Count(*)-1)/5 AS totalSeries FROM [GoldDebbiesTraininglh].[dbo].[dimcontestant]

4 Series. this is correct

--EPISODES
--Again, make sure we don't have duplicate records
Select Series, EpisodeName, Count(*) as Total 
from [GoldDebbiesTraininglh].[dbo].[dimepisode]
GROUP BY Series, EpisodeName
Order by Count(*) desc

--And get a distinct list of series
SELECT Distinct Series FROM [GoldDebbiesTraininglh].[dbo].[dimepisode]
Where Series <> '-1'
Order by Series

--TASKS
--Check we have no duplicates
SELECT Task, TaskOrder, COUNT(*) as Total FROM [GoldDebbiesTraininglh].[dbo].[dimtask]
GROUP BY Task, TaskOrder
Order by Count(*) Desc

--Check the number of rows per series via the fact table
SELECT e.Series, COUNT(*)AS Total FROM [GoldDebbiesTraininglh].[dbo].[facttaskmaster] f
INNER JOIN [GoldDebbiesTraininglh].[dbo].[dimepisode] e on f.EpisodeKey = e.EpisodeKey
GROUP BY e.Series
Order by COUNT(*) Desc

Now we can see a problem. Series one 2 and 3 are quite small. But here they have way more rows than S4. Something has gone wrong.

We need to add more into the scripts to ensure that we can identify specific issues. Like what were the original number of rows etc. It is only because we can remember that 240 was the original row number for S4 that we know S4 is correct. We need to identify this quickly and specifically.

Lets do more checking.

--And check the  duplicates with fact as the central table
SELECT d.date As SeriesStartDate, e.Series,e.EpisodeName, c.ContestantName, t.Task, t.TaskOrder, t.TaskType,  f.Points, COUNT(*) AS Total
FROM [GoldDebbiesTraininglh].[dbo].[facttaskmaster] f 
INNER JOIN [GoldDebbiesTraininglh].[dbo].[dimcontestant] c on f.ContestantKey = c.ContestantKey
INNER JOIN [GoldDebbiesTraininglh].[dbo].[dimepisode] e on f.EpisodeKey = e.EpisodeKey
INNER JOIN [GoldDebbiesTraininglh].[dbo].[dimdate] d on d.DateKey = f.SeriesStartDateKey
INNER JOIN [GoldDebbiesTraininglh].[dbo].[dimtask] t on t.TaskKey = f.TaskKey
Where  e.Series = 'S1'
GROUP BY d.date, e.Series,e.EpisodeName, c.ContestantName, t.Task, t.TaskOrder, t.TaskType,  f.Points
Order by COUNT(*) DESC

There should only be one row per Contestant,

But each has 4. the data has been added 4 times

Changing the Where criteria to S2, S3 and S4. There is 1 row for each for S4. S2 S3 have 4 rows. we know they were added along with S1. So now we have to understand why this is.

To do this we could do with some meta data in the Delta Parquet tables of the date and time the information was added.

Conclusion

We now have a pipeline that iterates through our Notebooks and we have discovered some issues along the way.

In the next blog post we are going to do some updates to the Notebooks and check why Season 1 has been added 4 times. then we can move on to our next Pipeline updates. recording more meta data about the runs

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Tag: Data Engineering

Microsoft Fabric Part 17. Taskmaster Project. The Pipeline. Run multiple notebooks from one Notebook activity using the foreach activity