Azure Synapse – Creating a Type 1 Upload with Pipelines

Working through exercises and learning paths for Microsoft Synapse is a really good way of becoming familiar with concepts and tools. And whilst these are a fantastic source of learning. the more you do, the more questions you end up asking, Why use this option over this option for example?

SO, lets look at one of the initial Pipeline creation exercises for Azure Synapse in more detail and try and answer questions that arise.

https://microsoftlearning.github.io/dp-203-azure-data-engineer/Instructions/Labs/10-Synpase-pipeline.html

The end game is to take a file from the Serverless SQL Pool (Data Lake) and load it into the dedicated SQL Pool. The Synapse Data Warehouse.

But if the data is already there, we simply update it without adding it a duplicate record.

for this you need Azure Synapse and a Data Lake

Analytics Pools

In Synapse Studio, Go to Manage

Our starting points are the Built in, Serverless SQL Pool Connected to a datalake (storage account resource in Azure)

We also have a Dedicated SQL Pool which you can see in data

This DB can also be used as Serverless without having to un pause the dedicated capacity by adding External data source, External File Format and External Tables.

Back to Manage. We now know we have SQL Pools. Both Serverless and Dedicated. Serverless will be our source. Dedicated is the destination.

Linked Services

Default Linked Services

The following were already created. Its useful to understand what is used and not used for this dataflow

  • Name synapse*********-WorkspaceDefaultStorage
  • Type: Azure Data Lake Storage Gen 2

This is the default linked service to the Data Lake Storage Gen 2 in Azure. You can find this in Azure as your Storage account resource.

  • Name synapse********WorkspaceDefaultSqlServer
  • Type: Azure Synapse Analytics
  • Authentication Type :System Assigned managed Identity

This one isn’t used for this particular process. However what makes it different to the other Azure Synapse Analytics Linked Service that has been set up (Below)? We will answer that shortly

  • Name DataWarehouse_ls
  • Type: Azure Synapse Analytics
  • Authentication Type :System Assigned managed Identity
  • Database Name: TrainingDB

This was added for this project and is the Dedicated SQL Pool Database.

The answer seems to be that there is no difference, apart from the original is parameterised. Here are the differences.

  • Fully qualified Domain Name: In the one created it was called synapse*****.sql.azuresynapse,net . In the automated Linked Service its tcp:synapse*****.sql.azuresynapse,net,1433. This has more detail, like the transformation control protocol and the tcp port.
  • DatabaseName: In the one created it was called TrainingDB after the database in Synapse Workspace. In the automated Linked Service its @{linkedService<>.DBName)
  • Parameters: the created one doesnt have any parameters. The default ls has Name DBName type String.

It will be interesting to see if changing the linked service to the default one will change things for us.

Integrate

For this task we create a Pipeline Load Product Data Type 1 and, in Move and transform

We use a data flow.

Copy activity is the simpler process to load data. Specify the source, sink and data mapping. the data flow can also be used to transform the data as well as copy. At run time the data flow is executed in a Spark environment rather than the Data Factory Execution runtime

Copy Activity: is around £0.083 and hour. the orchestration is around £1.237 per 100b runs

Data Flows: starts a t£0.0228 an hour

The DataFlow is called LoadProductsData

Settings

Staging should only be configured when your data flow has Azure Synapse Analytics as a sink or source. Which it does with the Data Warehouse Destination (Sink)

Double click on the dataflow to get into the details

Source

Sources are the first to be created

Source Type – Integration dataset, Inline, Workspace DB

There is no information in the dataflow as to what the source types are and what is best to choose. Looking at the documentation we can glean the following:

Some Formats can support both inline and dataset objects. Dataset objects are reusable and can be used in other data flow and copy activities. they are good for hard schemas. Datasets are NOT based on Spark.

Inline datasets are good for flexible schemas or one off instances. They are also good for parameterised sources. Inline datasets are based on Spark.

Integration Dataset
Inline data Set
Workspace Db

The integration dataset allows you to connect straight to the linked service object. Whereas inline connects to the linked service set up in Data Factory linked Services.

Basically. the integration Dataset bypasses an extra step. And Workspace DB allows you to do the same but its only available for Azure Synapse Analytics objects (Like the data warehouse)

So for this project. The integration data set has been used which is not based on spark.

Data set

A Data set needs to be created.

  • Type: Azure Datalake Storage Gen2
  • Format: Delimited text (Because we are working with csv files)
  • Name: Products_Csv
  • Linked service: synapsexxxxxxx-WorkspaceDefaultStorage (Find in Manage, Linked Services. this is the Azure Data Lake Storage Gen2 Linked Service)
  • File path: files/data/Product.csv (In the data lake we have a files collection. then a data folder)
  • First row as header: Selected (there is a header in the csv files)
  • Import schema: From connection/store (Take from the actual data source)
  • Allow schema drift: Selected

In Data Factory there is a dataset folder where you can find all your data sets. And of course, parameterise them so you can use for example, the same csv data set for all your csv file. However there is no data set folder within Integrate.

So where can you find your datasets one created in Synapse?

Schema Drift

What is Schema Drift? If you have a schema which changes. For example, new columns are added and old deleted. You would have to develop against this drift, making sure that you are capturing the right information. this could mean a lot of development work and affects your projects, from source, through to your Power BI Analytics models.

It would be really useful if we could remove some development work and allow the Pipeline to deal with this.

Schema Drift allows for schema remodelling without constant redevelopment of upstream schemas.

Projection

Here we see the projected columns from the data source. Hopefully in the flow we could also parameterise so a data flow with the same functionality could control multiple dataflows. e.g. 5 csv files being moved to 5 tables in a database.

Another source is also created

Data transformation – Lookup

from the + under the first source. Lookup is selected

Here we take the Products in the csv file and look them up against the destination table in the dedicated SQL Pool.

Match on

Why has last row been chosen? If we chose Any row. we just specify the lookup conditions. Which is from the csv files Product ID to the dedicated SQL pools ProductAltKey. So this is the only lookup required. Great when you know there are no duplicates.

Because last row has been chosen, a sort order is required to the data which is the AltProductKey.

Match on

Note that this is all that is being used in this instance. Does the business key already exist in the data warehouse?

Alter Row

From the lookup, a new activity is selected. Alter row

Now we are loading into the DW.

The incoming stream is the previous activity. Here we add Alter Row conditions.

In this example the join between the tables find two matches. Where it doesn’t match there is Null for both the ProductKey and ProductAltKey in the results set.

Therefore, If ProductKey in the warehouse is null in the results set. Insert the record. If its not null then UPSERT. (Update and Insert)

We could refine this with more logic where we check the actual data row and if its not changed we can do nothing. Only update and insert when the row has changed. This is better for larger data sets.

Debugging

Now we have a simple data flow its time to turn on the spark cluster and debug.

Once the spark pool has started, you can click on the end sink and go to Preview

The icons to the left show New and Upserted records. There were three ProductIDs that already existed in the data warehouse. Now we can also go back to the pipeline and trigger now

So what has happened?

In monitor you can check how the flow is doing. Once it shows as succeeded, go to the table and run a SQL statement

The first thing to note is that we now have all the records from the csv file in the dedicated SQL pool data warehouse. But what about the staging options in the dataflow settings?

If we now go into Data – Linked and to the Primary datalake store we can see we have a folder called Stage_Products. Its empty. So what does it do?

If you run it again and refresh, we can see a PARQUET file is created before being deleted.

Basically the source table is loaded into the staging table. Then the transformation process can work over the staging table, rather than using the actual source file. Once completed the staging file is removed.

So throughout the creation of the basic type 1 dataflow, a lot of questions have been asked and answered. And a lot of ideas for more complex load processes. How to use Parameterisation of objects where possible. How to create Reuseable content?

So lots more to come.

Leave a comment

Design a site like this with WordPress.com
Get started