In today’s data-driven world, organizations are constantly dealing with large volumes of data. To efficiently process and analyze this data, it is crucial to have a well-designed data integration pipeline. In this article, we will explore how to design an incremental data integration pipeline in SQL Server.
Data Preparation
Before we dive into the pipeline design, let’s first prepare our data. In this example, we will be using the Adventure Works database. We will introduce date variations spanning a few weeks by updating the ModifiedDate column in the SalesOrderHeader table. This will simulate historical data that we want to incrementally ingest into our pipeline.
Here is the SQL script to update the ModifiedDate column:
UPDATE [SalesLT].[SalesOrderHeader] SET ModifiedDate='2022-01-01' WHERE SalesOrderID < 71831;
UPDATE [SalesLT].[SalesOrderHeader] SET ModifiedDate='2022-01-08' WHERE SalesOrderID BETWEEN 71831 AND 71901;
UPDATE [SalesLT].[SalesOrderHeader] SET ModifiedDate='2022-01-17' WHERE SalesOrderID BETWEEN 71902 AND 71920;
UPDATE [SalesLT].[SalesOrderHeader] SET ModifiedDate='2022-01-25' WHERE SalesOrderID > 71920;
Next, we need to fetch the data from the Azure SQL DB to the Bronze layer of our data lake. We can do this by running the Azure SQL DB to Raw and Raw to Bronze layer pipelines that we created in earlier posts.
To ensure that the Bronze table has received the latest updates, we can run the following SQL script:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://synstg.blob.core.windows.net/syn-fs/delta/bronze/SalesOrderHeader/',
FORMAT = 'DELTA'
) AS [result]
ORDER BY ModifiedDate DESC;
Data Integration Pipeline Design
Now that our data is prepared, let’s move on to designing the incremental data integration pipeline. We will create a pipeline that divides the historical data into smaller pieces based on certain time intervals and calls a mapping data flow for each interval.
First, we need to create a data integration pipeline and add two parameters of the string type: WindowStart and WindowEnd. These parameters will receive the trigger window timestamps.
Next, we add a data flow activity to the pipeline and point it to the data flow that we created in the previous post.
In the Parameters tab of the data flow activity, we provide the following pipeline expressions as the parameter values:
- WindowStart – @pipeline().parameters.MyStartDate
- WindowEnd – @pipeline().parameters.MyEndDate
These expressions pass the pipeline parameters to the mapping data flow.
Pipeline Orchestration
To handle both historical and future time periods, we need to design the orchestration of our pipeline. In Azure Synapse Analytics, we can use tumbling triggers to create backdated window ranges.
In our example, we will create a weekly trigger starting from January 1, 2022. This trigger will run on an hourly frequency for 168 hours. We also need to ensure that the “Start trigger on creation” checkbox is selected.
It is important to limit the max concurrency setting to one in the Advanced panel. This ensures that multiple instances of the pipeline do not run simultaneously, as Delta Lake tables do not allow concurrent writes.
Once the trigger is published, we can navigate to the Monitor page to see multiple instances of the pipeline running sequentially. We can also view the details of each execution, including the data flow details.
Validating the Data
Finally, we can validate the data in the destination Delta table by running the following SQL script:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://synstg.blob.core.windows.net/syn-fs/delta/silver/SalesOrderHeader/',
FORMAT = 'DELTA'
) AS [result];
This will retrieve the top 100 records from the Delta table.
Conclusion
In this article, we have explored how to design an incremental data integration pipeline in SQL Server. By dividing the historical data into smaller pieces and using tumbling triggers, we can efficiently ingest and process large volumes of data. This pipeline design also allows for future time periods, making it a versatile solution for data integration.
Thank you for reading!
Article Last Updated: 2022-04-21