Published on

October 8, 2024

Exploring Data Integration and Orchestration in Azure Synapse Analytics

Azure Synapse Analytics is a powerful tool that unifies data exploration, visualization, and integration experiences for users. In this blog post, we will dive into the data integration and orchestration features of Synapse and explore how they can be used to build a sample data pipeline.

Overview of Data Integration Features

Azure Synapse Analytics inherits most of the data integration components from Azure Data Factory (ADF). Additionally, Synapse introduces new integration components such as Spark notebooks and Spark job definitions. These components allow users to leverage notebooks with Python, Scala, and .NET code as part of their data integration pipelines. With these additions, Microsoft aims to replace the need for using Databricks for engineering tasks, although the legacy Databricks components from ADF are still available in Synapse.

If you are not familiar with Azure Data Factory, I would recommend reading my previous tips dedicated to ADF. Having a good understanding of ADF will help you better grasp the material provided here and make the most out of Synapse’s data integration features.

Data Integration Use-Case Scenario

In this demo, we will build a data pipeline that reads JSON files from Blob Storage, performs basic transformations, and writes the results into a SQL pool. Here are the specifics:

  • We will use data from the Adventureworks database, which is a sample database provided by Microsoft.
  • The sales order related fact tables and product dimension table will be converted into JSON format and included in the Blob Storage.
  • The pipeline will initiate a PySpark notebook to read the JSON files, convert them to data frames, join the dimensions with the fact tables to build a denormalized fact table, and write the results into a temporary location within the Blob Storage.
  • The pipeline will then read the files produced by the notebook and merge this data with the destination fact table in the SQL pool.

Before proceeding, make sure you have SQL and Spark pools attached to your Synapse workspace. If not, please provision and start them as we will need to run a number of SQL and Spark scripts.

Creating a Data Transformation Notebook

Let’s open Synapse Studio and navigate to the Develop tab to create a notebook. Name the notebook as “DWH_ETL” and select PySpark as the language. Add the necessary commands to initialize the notebook parameters and import the required libraries. Then, read the JSON files as data frames and perform the required transformations. Finally, publish the notebook to preserve the changes.

Creating DWH Structures

Now that we have processed data in the storage account, we can create the SQL database schema objects. We will use PolyBase technology to read the data from the Blob Storage. First, we need to create the format and data source objects. Then, we can create an external table pointing to the DWH folder in the storage account. Finally, we create a destination table and a stored procedure to merge the data from the external table into the destination table.

Creating a Data Integration Pipeline

The pipeline building interface in Synapse resembles that of Azure Data Factory. We can move to the Orchestrate tab and add a pipeline. Add the necessary parameters and components to the pipeline, such as the notebook component and the SQL pool stored procedure activity. Set up the dependencies between these activities and publish the pipeline.

Conclusion

In this blog post, we explored the data integration and orchestration features of Azure Synapse Analytics. We learned how to build a data pipeline that involves SQL, storage, and Spark services. Synapse pipelines offer seamless integration with various services, both internal and external to the Synapse workspace. By leveraging these features, users can efficiently integrate and orchestrate their data workflows in a unified environment.

Stay tuned for upcoming tips where we will dive deeper into building more complex pipelines in Azure Synapse Analytics.

Click to rate this post!
[Total: 0 Average: 0]

Let's work together

Send us a message or book free introductory meeting with us using button below.