Welcome to Part 2 of our series on uploading data from SQL Server to Azure Blob Storage. In Part 1, we discussed how to copy a full SQL database table into Azure Blob Storage as a CSV file. In this article, we will focus on uploading incremental data changes to Azure after the initial data load.
When working with large databases, it is often necessary to only upload the changes made to the data instead of the entire dataset. This can save time and resources, especially when dealing with frequent updates. The general steps for uploading incremental changes in a table are as follows:
- Enable Change Tracking on the database and table
- Enable Snapshot Isolation on the database (optional)
- Create a table to record Change Tracking versions
- Create a stored procedure to update the Change Tracking table
- Create Linked Services
- Create Datasets
- Make changes in SQL tables
- Create Pipelines
- Trigger Pipeline
Change tracking is a feature in SQL Server that keeps track of changes made to data in a table. It uses a version counter that increases as changes are made, allowing you to query for the changes since the last version. By enabling change tracking on your source SQL Server database, you can use Azure Data Factory (ADF) to retrieve the tracked changes and apply them to the data store, which in our case will be a CSV file uploaded to Azure Blob Storage.
To enable change tracking, you can run the following SQL statements:
ALTER DATABASE AdventureWorks2016
SET CHANGE_TRACKING = ON
(CHANGE_RETENTION = 2 DAYS, AUTO_CLEANUP = ON)
ALTER TABLE HumanResources.Department
ENABLE CHANGE_TRACKING
WITH (TRACK_COLUMNS_UPDATED = ON)
Enabling snapshot isolation on your database is optional but advised. It helps ensure data consistency in certain scenarios where the changed data results could be inconsistent. If you decide to enable snapshot isolation, you can use the following SQL statements:
ALTER DATABASE AdventureWorks2016
SET ALLOW_SNAPSHOT_ISOLATION ON;
-- Example of using snapshot isolation
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;
BEGIN TRAN
-- Verify that the version of the previous synchronization is valid.
-- Obtain the version to use next time.
-- Obtain changes.
COMMIT TRAN
Next, you need to create a table to record the change tracking versions. This table will store the version number and the table name. Here is an example of how to create the table:
USE AdventureWorks2016;
GO
CREATE TABLE ChangeTrackingVersion
(
tablename VARCHAR(255),
sys_change_version BIGINT
);
DECLARE @sync_version BIGINT;
SET @sync_version = CHANGE_TRACKING_CURRENT_VERSION();
INSERT INTO ChangeTrackingVersion
VALUES ('[HumanResources].[Department]', @sync_version);
After creating the table, you can create a stored procedure to update the change tracking version. This stored procedure will be invoked by the ADF pipeline to update the version number for the next pipeline run. Here is an example of how to create the stored procedure:
USE [AdventureWorks2016];
GO
CREATE PROCEDURE [dbo].[Update_ChangeTrackingVersion]
@CurrentChangeTrackingVersion BIGINT,
@TableName VARCHAR(100)
AS
BEGIN
UPDATE ChangeTrackingVersion
SET sys_change_version = @CurrentChangeTrackingVersion
WHERE tablename = @TableName;
END
Now that we have the necessary SQL Server configurations in place, we can move on to creating the Linked Services and Datasets in Azure Data Factory. These components establish the communication links between our SQL Server, ADF, and Azure Blob Storage.
In Part 1 of this series, we already created Linked Services for our SQL Server and Azure Blob Storage. We can reuse those linked services for uploading the incremental data. However, let’s quickly recap the steps to create the linked services:
- Sign in to your Azure account and open Azure Data Factory.
- Create a Source Linked Service for your SQL Server by providing the necessary details such as server name, database name, and authentication method.
- Create a Sink Linked Service for your Azure Blob Storage by providing the necessary details such as the storage account name and authentication method.
With the Linked Services in place, we can now create the Datasets. In this step, we will create three datasets: one for the source data, one for the destination data (Azure Blob Storage), and one for the change tracking data. These datasets define the structure and location of the data.
Once the Datasets are created, we can proceed to create the Pipelines. The Pipeline is a workflow that orchestrates the data movement and transformation activities. In our case, the Pipeline will retrieve the old and new change versions, copy the changed data from SQL Server to Azure Blob Storage, and update the change version for the next pipeline run.
Finally, we can trigger the Pipeline to run on a regular basis. This can be done by scheduling the Pipeline to start at specific intervals or based on certain events.
By following these steps, you can efficiently upload incremental data changes from SQL Server to Azure Blob Storage. This approach saves time and resources by only transferring the necessary data.
Thank you for reading Part 2 of our series on uploading data to Azure Blob Storage with SQL Server. Stay tuned for more articles on SQL Server concepts and ideas!
For any further questions or assistance, please feel free to contact us at SQLRx@sqlrx.com. We are here to help!
Visit us at www.sqlrx.com for more information and resources.