May 20, 2024 – Debbies Microsoft Power BI, SQL Fabric and Azure Blog

May 20, 2024May 21, 2024

Power BI Story Telling in PowerPoint. New Enhancements April 2024

A Previous Project we worked on had a specific requirement.

Project Requirement

Create reporting once a month and allow us to tell the story in narrative via text boxes. The narrative must show against the reports.

The narrative must be formattable. Bold, italic.

Bullet point, etc.

Our users don’t want to go to Power BI service. They want the information emailed as PowerPoint slides to the report viewers.

The License is Power BI Pro and we couldn’t use, for example SQL DB for the project to have the transformations before going to Power BI.

There are a lot of slicers in the reports, so there were numerous PowerPoint slides needed based on the specific slicers

There were immediate restrictions based on the requirements.

Restrictions

With a SQL Database we could have used a visual that allowed us to add narrative to each page because it could have been recorded to a database but this was too complex and we didn’t have that option. Also the user might not have been able to format the text enough on the page.
We tried to use the Storytelling in PowerPoint. But, at the time this was live only. You couldn’t snapshot the data. And there were issues in the amount of sub slicers and the users needing to have paginated reporting.

April 2024

As At April. It seems that there have been changes to Storytelling with PowerPoint. And its of interest to see if this could have worked with the above requirements. Based on very basic data.

I am going to use TV shows watched as an example for this.

We start off with a couple of records in excel

Which is imported into Power BI.

A Couple of measures have been created.

Show Episodes Watched = COUNT(Shows[Show])

Total Shows Watched = DISTINCTCOUNT(Shows[Show])

And a Few Basic Visuals have been created

Just enough for us to play with

Lets publish to Service

Then Click on the ellipses and Select Share and Open in PowerPoint

And Now. We go across to a new PowerPoint

Click Insert

Lets choose a few more Visuals and add some text explaining that day.

Lets save this PowerPoint

Now we are moving into another day and adding more data.

Save and Close XLSX. Lets see how this looks in Power BI after a refresh.

So now lets open PowerPoint again.

What we are hoping for is a PowerPoint snapshot for each point in time

So What we have is Live Reporting which now doesn’t match the text added. What we need is to have Business Logic that allows us to Snapshot the Power point.

Each of the 3 Visuals has options and currently we are on Live data.

There are two other options.

Snapshot

Show and image of Power BI data to people with existing access.

so, these are people who can view the reporting in Power BI service,.

Public Snapshot.

Show an image of Power BI data for people with access to this deck

Essentially, a public snapshot allows you to share a static image representation of your Power BI data with a broader audience, even those who don’t have direct access to the report.

We can chose snapshot for this exercise but what will the business process be? How will this work?

The Business Logic

Essentially we want to have a template in place of Power BI visuals and Text.

We want to make a copy of the PowerPoint and send this to users.

The Next day after a Refresh we want to do the same thing and build up PowerPoint Snapshots where the text tells the story on that day

So what happens if we change all three to Snapshot in the slide deck?

Remember, this slide deck will have many visuals and be much more complex.

Lets make a copy of the PowerPoint slide deck and see.

The text has been updated for the current 2 day data source

Every Visual has had to be manually changed to Snapshot.

This could be really problematic if you had to do this every single time you want to create a snapshot.

lets add another days worth of data into Power BI and refresh

Our Live PowerPoint shows the latest.

Our Snapshot still shows

So now we have a problem. We could do this

BUT. Imagine we have a lot of pages and a lot of visuals. This involves manual updates for every PowerPoint snapshot to move each visual from Live to Snapshot.

Do you want to add this in as a process?

Could your team handle this as ongoing business logic?

This change is really great. But because there are manual aspects to it. its still not that efficient and could be error prone. So we are not quite there yet.

The test needs a manual update every month in the PowerPoint and it does need saving manually.

Changing from individual visuals to a full report

Could we simplify from individual visuals to a full report?

For this, the excel data will be taken back to day 1

There are two possibilities here.

We don’t want to embed an image because that will be snapshot data and the users don’t want to go in and do this every month.

We want to try and Embed Live Data and then change to snapshot. After a few tweaks to the Power BI Page.

Embed live Data and copy the link.

You only need to move from Live to Snapshot once instead of three times for an entire report page.

You can leave gaps on the Power BI report to add in a textbox in PowerPoint.

The question is

Would the users be happy with the look of this?

If yes, we can we now have new manual business logic.

Complications in the Logic.

We have made the decision to do a whole page as one but lets complicate the logic and add some more data.

We want a page in PowerPoint for every TV Watcher

Shows

TV Watchers

We are on Day 1. Lets Update Power BI with a new page

First up all, Power Query is refreshed with the new data set and the updated data set.

Not a Star Schema but this is about the end product. the PowerPoint so we can relax on this point for the time being.

Logic – Pull in an entire report page. not just visuals.

After refreshing and publishing. we need to add the new page into our live PowerPoint Embed Live Data.

At this point in time there are only two TV watchers. So we can load in the data twice and set the Slicer. now we have a Page for each Person. including some personal text for the specific person.

Next Step.

To Load in more data. Which means more people could potentially be watching.

How would be logically deal with this?

And will the Slicers stay correct for each page?

Add in another days worth of data to the xlsx and refresh
In Power BI, we manually need to check if we have more watchers. Which we do. there are two. this means, manually another two pages need adding.
Open the Live PowerPoint. After A refresh. the good news is the filters have held.
Manually create two new Pages and filter to the two new people

Question. Why can’t we just have one page and allow the user to flick through TV watchers?

Because a manager will come and add unique text for each slicer. And without saving this text somewhere that can’t be added dynamically, we cannot do this.

Question. Will the Business Owners be able to add this into the process every snapshot because its a manual process to use the reports to see if there are new users and then add them as pages.

Set all pages to Snapshot and Save File as Snapshot. Keeping the Live File as the template.

After a Second test this works. there are certain Manual points to the process. But the user gets what they wanted an a template of Live visuals with room for text.

That can be Saved to Snapshot PPTX.

And have filters set for the sub reports. each watcher gets their own page. We just have to manually understand when we need to add new pages. So we couldn’t do this if we have thousands of watchers with new ones adding every snapshot at a great amount.

Report Builder

Another specific Rule for this project is that there will be alot of tables used and these tables will go over pages (1 to Many)

Because of this. As we go we have to go for the Paginated Option of report Builder.

Question. Would this work with Report Builder Objects?

Using Power BI Report Builder

Add the Data Source as the TV Show Semantic Model created above

Then a Data set

Really nothing fancy here. We just need a table that will go over Pages

The paginated report is saved to Service.

And you can see that immediately we have only Microsoft PowerPoint option. It opens in its own file so can’t be added to the PowerPoint containing Power BI visuals

This doesn’t appear to update. It seems to be a simple Snapshot.

Can we do this another way?

We have created and published the Paginated Report.

In Power BI Desktop

We now have the paginated reporting in Power BI. But its not doing the one thing we needed to use Report Builder for in the first place. Its not paginating the table.

We now can either connect to Live or Snapshot data which is good

But the table is not paginated. It is scrollable. So no need to add in Paginated reporting really. we could stick to the table in Power BI for this functionality

Conclusion

The new Snapshot functionality is super helpful when you don’t want your PowerPoint to simply be one object pointing to the latest data.

we can now create snapshots of information via a live PowerPoint quite quickly. The text manually added into each snapshot PowerPoint will always be specific to that point in time. It needs some manual intervention. But for those users who want to keep things simple. This could be just what is required for this kind of logic.

The PowerPoint even holds the slicer you select on a page.

The problem lies in the extra complexity, if the user requires large tables over many PowerPoint pages.

There are questions here of course. Why do you need large tables? If you do need tables. Would the user be happy to scroll through on a slide?

Note that if you select Public Snapshot. because you are sending out to people without licenses, all functionality is lost because you just get images, So if you have tables with sliders you will lose the additional records. Something to think about when setting up a project like this.

If so, Paginated report can be bypassed, and you have licenses for all viewers. Great. But, if this is a requirement. The new functionality will not work for you.

Lots to think about here. As always every month poses lots of questions for current and past projects.

May 20, 2024

Microsoft Fabric Part 4. Taskmaster Project. Adding the fact table

In parts 1 and 2 we created and updated DimContestant

In Part 3 we created DimTask, DimEpisode and DimDate

Its time to create the fact table. the first thing we need to do is to get an understanding of what Facts we have and how they would join to the data we have created in the dims so far.

Lets create a notebook.

%%configure
{
    "defaultLakehouse": {  
        "name": "DebbiesFabricLakehouse"
    }
}

Adding possible csvs into DataFrames

Attempts

#Read the first file into a dataframe
dfattempts = spark.read.format("csv").option("header","true").load("Files/Data/Bronze/attempts.csv")
# dfattempts now is a Spark DataFrame containing CSV data from "Files/Data/Bronze/attempts.csv".
display(dfattempts)

Contains points and ranks and appears to be at the episode and contestant level

episode_scores

#Read the file into a dataframe
dfepsc = spark.read.format("csv").option("header","true").load("Files/Data/Bronze/episode_scores.csv")

display(dfepsc)

episode_scores is new to this project but seems a really good data set. Holds scores and Ranks at the episode and contestant level (but not at task level)

we actually don’t need this because its just aggregated data. We can remove this from our transformations. Power BI will aggregate for us

episodes

#Read the file into a dataframe
dfepis = spark.read.format("csv").option("header","true").load("Files/Data/Bronze/episodes (1).csv")

display(dfepis)

This contains Dates and points. However the Scores are at Episode Level. We will be able to get this information from DAX, summing the individual Scores so we don’t need the scores at all. All we are interested in is the air date.

Lets look at the row counts

ct = dfepsc.count()
print(f"dfepsc: {ct}")

ct = dfepis.count()
print(f"dfepis: {ct}")

ct = dfattempts.count()
print(f"dfattempts: {ct}")

the number of columns vary. We need to make sense of what we have

Create the data we need from attempts

# Select multiple columns with aliases
dfdfattemptsf = dfattempts.select(
    dfattempts.task.alias('attTaskID'), #Join to DimTask
    dfattempts.task_label.alias('attTask'),
    dfattempts.episode.alias('attEpisode'),
    dfattempts.series.alias('attSeries'),
    dfattempts.contestant_label.alias('attContestant'), #Join to DimContestant
    dfattempts.points.alias('attPoints')
).distinct()

# Show the resulting DataFrame
display(dfdfattemptsf)

The Points in Attempts are the best ones we have at the moment because the Task csv was empty.

we are clearly now hitting real data problems and in a normal project we would be going back to the team creating the data to get them to resolve the problems.

Filter attempts

the attempts is our main table but we could do with the episode and Series information working with our dimensions we have and we only have Ids for both

Are episode and series related with the correct IDs? Lets do some filter queries to look at series 1, episode 1.

After a few filtered queries, there is clearly a some data issues, or unknowns that I am not aware of. With attempts. The episode 1 and series 1 doesn’t match the contestants and tasks (I know because I watched it).

For episodes (1) (Above) we do have the correct information for episode and series. But we aren’t using these to get data.

More problems with Episode Scores. Episode appears to be correct. But the series ID is completely different so we cant match on this. But we aren’t using this data any more.

We need to look at our PARQUET file DImEpisode

The first thing we need to do is Load our DimContestant PARQUET data back into a dataframe and to to this, we have to update the code slightly as previously we have loaded in csv files.

# Read Parquet file using read.parquet()
dfdimep  = spark.read.parquet("Files/Data/Silver/DimEpisode.parquet")
display(dfdimep)

To Get the path, Click on you File in the Files Folder in the Lakehouse and Copy ABFS Path, Copy Relative path for spark gives you a more concise path

We are now using the silver layer of the medallion architecture in the path. All our bronze landing data has been transformed.

Are our IDs correct in the Dim for Attempts, or for the Tasks by Objective csv?

Now we have a real issue. Our Episodes cant join to attempts.

Join with one Condition

We can join episode Scores to Episode first via the episode name? This would be an issue if you have repeating episode names in Series, but we are lucky that this is not the case.

# condition
original_condition = dfepis["title"] == dfepscf["scEpisode"]

# Now perform the left outer join
dfEpScorej =  dfepis.join(dfepscf, on=original_condition, how="full").drop(dfepscf.scEpisode)

display(dfEpScorej)

Filter new transformed table for episode 1 series 1

import pyspark.sql.functions as F

filtered_df = dfEpScorej.filter((F.col('episode') == 1) & (F.col('series') ==1))
display(filtered_df)

Filtering on Task 1 Season 1 gives us all 5 contestants against the episode. There are lots of columns we can remove. For example, we don’t need to hard code the winner. This can be done via Power BI DAX. So, the above join will be slightly updated with .drop.

# condition
original_condition = dfepis["title"] == dfepscf["scEpisode"]

# Now perform the left outer join
dfEpScorej =  dfepis.join(dfepscf, on=original_condition, how="full").drop(dfepscf.scEpisode)\
.drop(dfepscf.scEpisodeID).drop(dfepis.winner).drop(dfepis.winner_label).drop(dfepis.studio_date).drop(dfepis.winner).drop(dfepis.finale).drop(dfepis.TMI)

display(dfEpScorej)


Full is being used as the join because we want each side even if they haven't been joined.

Use dfEpScorej now instead of dfepis and dfepscf

We have the attempts (dfattemptsf) left

So how can we join this final table?

We can’t join by Task because Episodes doesn’t include task
We can’t join by Series because the Series IDs don’t match and neither does Episode.
We can match by Contestant but Contestants can take part in multiple series so the join doesnt work

As it stands we cant actually match in this table so we need to go with Episodes and Scores for the fact table.

Add the Dimension Keys into the transformed table

Join DimContestants Via Contestant Name

We Can’t use ID but we can use name.

Also when we join we only want the Key and we can immediately remove everything relating to Contestant.

# Original condition
original_condition = dfdimcont["contestant"] == dfEpScorej["sccontestant"]

# Adding another condition based on the 'name' column
#combined_condition = original_condition & (dfdimcont["contestant"] == dfEpScorej["sccontestant"])

# Now perform the left outer join
dfEpScorej = dfEpScorej.join(dfdimcont, on=original_condition, how="left")\
.drop(dfdimcont.TeamName).drop(dfdimcont.teamID).drop(dfdimcont.age).drop(dfdimcont.gender)\
.drop(dfdimcont.hand).drop(dfdimcont.ageRange).drop(dfdimcont.dob)\
.drop(dfdimcont.contestant).drop(dfdimcont.contestantID)\
.drop(dfEpScorej.sccontestant).drop(dfEpScorej.scContestantID)

display(dfEpScorej)

If we have any null values they can be replaced with -1

But first we can check if any exists

dfEpScorej.filter(dfEpScorej.ContestantKey.isNull()).show()

fillna to replace null with a value

Even though we have none at the moment, there could be some in the future so fillna allows us to replace with our Default keys

dfEpScorej = dfEpScorej.fillna(-1, subset=[“ContestantKey”])
dfEpScorej.show()

Add Episode Key

All is good. we can join it into the transformed table. Removing everything but the key using .drop

the data was checked that it was ok before adding all the .drop’s into the code,

# Original condition
original_condition = dfdimep["seriesID"] == dfEpScorej["series"]

# Adding another condition based on the 'name' column
combined_condition = original_condition & (dfdimep["episodeID"] == dfEpScorej["episode"])

# Now perform the left outer join
df = dfEpScorej.join(dfdimep, on=combined_condition, how="left")\
.drop(dfdimep.episodeTitle).drop(dfdimep.episodeID).drop(dfdimep.series).drop(dfdimep.seriesID)\
.drop(dfEpScorej.series_label).drop(dfEpScorej.series).drop(dfEpScorej.episode).drop(dfEpScorej.title)

display(df)

Set Key to -1 if NULL

dfdimtask = dfdimtask.fillna(-1, subset=["taskKey"])
dfdimtask.show()

Conclusion

This specific project has been abandoned because the data doesn’t give me what I need join wise to create the star schema. Mostly because there is clearly a main table with missing data.

But, I can’t wait to take what I have learned and try again with another data set.

Parts 1 to 4 have given me some great insight into analysis transforming and modelling with Pyspark.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31