Many businesses today are looking to leverage Azure’s Data Services to build an E-L-T process for moving data from their on-premises SQL Server to Azure Data Lake Storage. In this article, we will explore how to solve this problem using Azure Data Factory.
Prerequisite Resources
Before we begin, let’s make sure we have the necessary resources set up:
- Azure Data Factory V2: This will be used as the ETL orchestration tool.
- Azure Data Lake Store Gen2: The data will be stored in ADLS2.
- On-Premises SQL Server: The source data will be stored in an on-premises SQL Server.
- Azure SQL Database (Standard): We will use Azure SQL Database to store pipeline parameter values and tables.
- Azure Data Factory Self Hosted IR: This will allow us to link our on-premises resources to Azure.
Prepare and Verify SQL Server Database Objects
First, we need to navigate to SQL Server Management Studio and connect to the on-premises SQL Server containing the databases we want to move. We should also verify that there are tables in these databases.
Prepare and Verify Azure SQL Database Objects
Next, we will create a pipeline parameter table in our Azure SQL Database to store the table names, catalog names, and process flags. This table will drive the pipeline configurations at runtime.
Prepare Azure Data Lake Store Gen2 Container
We also need to create an ADLS2 container and folder for our server. We should confirm that there is no existing data in the server level folder.
Create Azure Data Factory Pipeline Resources
Now we are ready to begin creating our Azure Data Factory Pipeline. We will navigate to our Azure Data Factory resource and click on “Author and monitor”. Once the Azure Data Factory canvas loads, we will click on “Create Pipeline”.
Create Self Hosted IR
We will create and verify that our Self Hosted Integration Runtime is created and in “Running” status. This will allow us to connect our on-premises resources to Azure.
Create Linked Services
Once our self-hosted IR is created, we will create all the required Linked Services, including SQL Server, Azure SQL Database, and Azure Data Lake Store Gen 2.
Create Datasets
Our pipelines will require three datasets:
- DS_ADLS2: This dataset will be configured as a Parquet format and will allow us to partition our data by YY-MM-DD-HH.
- DS_SQLSERVER: This dataset will connect to our on-premises SQL Server.
- DS_ASQLDB_PIPEINE_PARAMETER: This dataset will connect to our pipeline parameter table in Azure SQL Database.
Create Azure Data Factory Pipelines
Now we can start creating our Azure Data Factory Pipelines:
P_Insert_Base_Table_Info
This pipeline will query the on-premises information_Schema.tables as its source to get the Table Name and Database name, and will then output the results to a basic parameters table in Azure SQL Database. The purpose of this pipeline and process is to use this pipeline parameter table to drive the rest of our pipelines.
P_SQL_to_ADLS
This pipeline will create the SQL Server to ADLS data orchestration. It will use a Lookup activity to retrieve the pipeline parameter table, and a Foreach activity to iterate through the tables. Within the Foreach activity, a copy activity will be used to copy the data from the on-premises SQL Server to ADLS2.
Run the Data Factory Pipeline & Verify Azure Data Lake Store Gen 2 Objects
Finally, we can run the pipeline P_SQL_to_ADLS. After running the pipeline, we can verify that the data has been successfully moved to Azure Data Lake Storage. We should see the appropriate database-level and table-level folders in ADLS2.
By following these steps, businesses can easily move their on-premises SQL Server data to Azure Data Lake Storage using Azure Data Factory. This allows for a structured, partitioned, and repeatable process for all servers in their SQL Server ecosystem.