Introduction:
When working with data pipelines in Azure Data Factory (ADF), there are often situations where we need to combine records from multiple source files and consolidate them into a single output file. In this article, we will explore the steps to merge multiple files using SQL Server.
Solution:
For this demonstration, we will be using three files stored in a data lake storage folder: File1.csv, File2.csv, and File3.csv. Each file has a similar structure, with columns such as SALESORDERNUMBER, ISEXPORTSALE, CREATEDDATE, ISSELECTED, SALESORDERSTATUS, CURRENCYCODE, DELIVERYADDRESSCITY, DELIVERYADDRESSCOUNTRYREGIONID, and DELIVERYADDRESSSTATEID.
Let’s take a look at the content of these three source files:
File1.csv:
SALESORDERNUMBER,ISEXPORTSALE,CREATEDDATE,ISSELECTED,SALESORDERSTATUS,CURRENCYCODE,DELIVERYADDRESSCITY,DELIVERYADDRESSCOUNTRYREGIONID,DELIVERYADDRESSSTATEID 1,true,2021-01-01,false,Completed,USD,New York,US,NY 2,false,2021-01-02,true,In Progress,EUR,Paris,FR,75 3,true,2021-01-03,false,Completed,GBP,London,GB,ENG 4,false,2021-01-04,true,In Progress,USD,Los Angeles,US,CA 5,true,2021-01-05,false,Completed,EUR,Berlin,DE,BE
File2.csv:
SALESORDERNUMBER,ISEXPORTSALE,CREATEDDATE,ISSELECTED,SALESORDERSTATUS,CURRENCYCODE,DELIVERYADDRESSCITY,DELIVERYADDRESSCOUNTRYREGIONID,DELIVERYADDRESSSTATEID 20000,true,2021-01-01,false,Completed,USD,New York,US,NY 20001,false,2021-01-02,true,In Progress,EUR,Paris,FR,75 20002,true,2021-01-03,false,Completed,GBP,London,GB,ENG 20003,false,2021-01-04,true,In Progress,USD,Los Angeles,US,CA 20004,true,2021-01-05,false,Completed,EUR,Berlin,DE,BE
File3.csv:
SALESORDERNUMBER,ISEXPORTSALE,CREATEDDATE,ISSELECTED,SALESORDERSTATUS,CURRENCYCODE,DELIVERYADDRESSCITY,DELIVERYADDRESSCOUNTRYREGIONID,DELIVERYADDRESSSTATEID 1000,true,2021-01-01,false,Completed,USD,New York,US,NY 1001,false,2021-01-02,true,In Progress,EUR,Paris,FR,75 1002,true,2021-01-03,false,Completed,GBP,London,GB,ENG 1003,false,2021-01-04,true,In Progress,USD,Los Angeles,US,CA 1004,true,2021-01-05,false,Completed,EUR,Berlin,DE,BE
SQL Server provides various methods to merge these files into a single output file. One approach is to use the BULK INSERT statement to load the data from each file into a temporary table, and then insert the records from the temporary table into the final output table.
Here is an example of how this can be achieved:
-- Create a temporary table to hold the merged data
CREATE TABLE #MergedData (
SALESORDERNUMBER INT,
ISEXPORTSALE BIT,
CREATEDDATE DATE,
ISSELECTED BIT,
SALESORDERSTATUS VARCHAR(50),
CURRENCYCODE VARCHAR(3),
DELIVERYADDRESSCITY VARCHAR(50),
DELIVERYADDRESSCOUNTRYREGIONID VARCHAR(2),
DELIVERYADDRESSSTATEID VARCHAR(2)
)
-- Load data from File1.csv into the temporary table
BULK INSERT #MergedData
FROM 'C:\Path\To\File1.csv'
WITH (
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
FIRSTROW = 2 -- Skip the header row
)
-- Load data from File2.csv into the temporary table
BULK INSERT #MergedData
FROM 'C:\Path\To\File2.csv'
WITH (
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
FIRSTROW = 2 -- Skip the header row
)
-- Load data from File3.csv into the temporary table
BULK INSERT #MergedData
FROM 'C:\Path\To\File3.csv'
WITH (
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
FIRSTROW = 2 -- Skip the header row
)
-- Insert the merged data into the final output table
INSERT INTO OutputTable
SELECT *
FROM #MergedData
-- Clean up the temporary table
DROP TABLE #MergedData
Conclusion:
In this article, we have discussed the steps to merge multiple files into a single file using SQL Server. By leveraging the BULK INSERT statement and temporary tables, we can efficiently consolidate data from different sources into a unified output. It is important to ensure that all files have a similar structure for the merge process to work properly.