Designing an Incremental Data Integration Pipeline in SQL Server

In today’s data-driven world, organizations are constantly dealing with large volumes of data. To efficiently process and analyze this data, it is crucial to have a well-designed data integration pipeline. In this article, we will explore how to design an incremental data integration pipeline in SQL Server.

Data Preparation

Before we dive into the pipeline design, let’s first prepare our data. In this example, we will be using the Adventure Works database. We will introduce date variations spanning a few weeks by updating the ModifiedDate column in the SalesOrderHeader table. This will simulate historical data that we want to incrementally ingest into our pipeline.

Here is the SQL script to update the ModifiedDate column:

UPDATE [SalesLT].[SalesOrderHeader] SET ModifiedDate='2022-01-01' WHERE SalesOrderID < 71831;
UPDATE [SalesLT].[SalesOrderHeader] SET ModifiedDate='2022-01-08' WHERE SalesOrderID BETWEEN 71831 AND 71901;
UPDATE [SalesLT].[SalesOrderHeader] SET ModifiedDate='2022-01-17' WHERE SalesOrderID BETWEEN 71902 AND 71920;
UPDATE [SalesLT].[SalesOrderHeader] SET ModifiedDate='2022-01-25' WHERE SalesOrderID > 71920;

Next, we need to fetch the data from the Azure SQL DB to the Bronze layer of our data lake. We can do this by running the Azure SQL DB to Raw and Raw to Bronze layer pipelines that we created in earlier posts.

To ensure that the Bronze table has received the latest updates, we can run the following SQL script:

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://synstg.blob.core.windows.net/syn-fs/delta/bronze/SalesOrderHeader/',
        FORMAT = 'DELTA'
    ) AS [result]
ORDER BY ModifiedDate DESC;

Data Integration Pipeline Design

Now that our data is prepared, let’s move on to designing the incremental data integration pipeline. We will create a pipeline that divides the historical data into smaller pieces based on certain time intervals and calls a mapping data flow for each interval.

First, we need to create a data integration pipeline and add two parameters of the string type: WindowStart and WindowEnd. These parameters will receive the trigger window timestamps.

Next, we add a data flow activity to the pipeline and point it to the data flow that we created in the previous post.

In the Parameters tab of the data flow activity, we provide the following pipeline expressions as the parameter values:

WindowStart – @pipeline().parameters.MyStartDate
WindowEnd – @pipeline().parameters.MyEndDate

These expressions pass the pipeline parameters to the mapping data flow.

Pipeline Orchestration

To handle both historical and future time periods, we need to design the orchestration of our pipeline. In Azure Synapse Analytics, we can use tumbling triggers to create backdated window ranges.

In our example, we will create a weekly trigger starting from January 1, 2022. This trigger will run on an hourly frequency for 168 hours. We also need to ensure that the “Start trigger on creation” checkbox is selected.

It is important to limit the max concurrency setting to one in the Advanced panel. This ensures that multiple instances of the pipeline do not run simultaneously, as Delta Lake tables do not allow concurrent writes.

Once the trigger is published, we can navigate to the Monitor page to see multiple instances of the pipeline running sequentially. We can also view the details of each execution, including the data flow details.

Validating the Data

Finally, we can validate the data in the destination Delta table by running the following SQL script:

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://synstg.blob.core.windows.net/syn-fs/delta/silver/SalesOrderHeader/',
        FORMAT = 'DELTA'
    ) AS [result];

This will retrieve the top 100 records from the Delta table.

Conclusion

In this article, we have explored how to design an incremental data integration pipeline in SQL Server. By dividing the historical data into smaller pieces and using tumbling triggers, we can efficiently ingest and process large volumes of data. This pipeline design also allows for future time periods, making it a versatile solution for data integration.

Thank you for reading!

Article Last Updated: 2022-04-21

Click to rate this post!

[Total: 0 Average: 0]

Comprehensive 360 Degree Assessment

Data Replication

Performance Optimization

Data Security

Database Migration

Expert Consultation

Cloud Migration Made Easy

Considering a move to the cloud? Axial SQL brings you proven migration strategies to streamline your transition. Our expert team ensures a smooth, efficient shift, keeping your data safe and accessible. Start your journey to the cloud with confidence!

SQL Performance Optimization

Is your SQL running slower than expected? Don't let sluggish performance hinder your business. Our optimization experts at Axial SQL specialize in tuning your databases for peak performance. Speed up your SQL and supercharge your data processing today!

Database Stability Solutions

Tired of frequent database outages? Discover stability with Axial SQL! Our comprehensive analysis identifies and resolves your database vulnerabilities. Enhance reliability, reduce downtime, and keep your operations running smoothly with our expert guidance.

Expert Database Team Evaluation

Questioning your database team's efficiency? Let Axial SQL provide an expert, unbiased analysis. We assess your team's strategies and workflows, offering insights and improvements to boost productivity. Elevate your database management to new heights!

Data Security Assurance

Concerned about your database security? Axial SQL is here to fortify your data defenses. Our specialized security assessments identify potential risks and implement robust protections. Keep your sensitive data secure and your peace of mind intact with our expert services.

Published on