Introduction:
When working with data pipelines in Azure Data Factory (ADF), there are often situations where we need to combine records from multiple source files and consolidate them into a single output file. In this article, we will explore the steps to merge multiple files using SQL Server.
Solution:
For this demonstration, we will be using three files stored in a data lake storage folder: File1.csv, File2.csv, and File3.csv. Each file has a similar structure, with columns such as SALESORDERNUMBER, ISEXPORTSALE, CREATEDDATE, ISSELECTED, SALESORDERSTATUS, CURRENCYCODE, DELIVERYADDRESSCITY, DELIVERYADDRESSCOUNTRYREGIONID, and DELIVERYADDRESSSTATEID.
Let’s take a look at the content of these three source files:
File1.csv:
SALESORDERNUMBER,ISEXPORTSALE,CREATEDDATE,ISSELECTED,SALESORDERSTATUS,CURRENCYCODE,DELIVERYADDRESSCITY,DELIVERYADDRESSCOUNTRYREGIONID,DELIVERYADDRESSSTATEID 1,true,2021-01-01,false,Completed,USD,New York,US,NY 2,false,2021-01-02,true,In Progress,EUR,Paris,FR,75 3,true,2021-01-03,false,Completed,GBP,London,GB,ENG 4,false,2021-01-04,true,In Progress,USD,Los Angeles,US,CA 5,true,2021-01-05,false,Completed,EUR,Berlin,DE,BE
File2.csv:
SALESORDERNUMBER,ISEXPORTSALE,CREATEDDATE,ISSELECTED,SALESORDERSTATUS,CURRENCYCODE,DELIVERYADDRESSCITY,DELIVERYADDRESSCOUNTRYREGIONID,DELIVERYADDRESSSTATEID 20000,true,2021-01-01,false,Completed,USD,New York,US,NY 20001,false,2021-01-02,true,In Progress,EUR,Paris,FR,75 20002,true,2021-01-03,false,Completed,GBP,London,GB,ENG 20003,false,2021-01-04,true,In Progress,USD,Los Angeles,US,CA 20004,true,2021-01-05,false,Completed,EUR,Berlin,DE,BE
File3.csv:
SALESORDERNUMBER,ISEXPORTSALE,CREATEDDATE,ISSELECTED,SALESORDERSTATUS,CURRENCYCODE,DELIVERYADDRESSCITY,DELIVERYADDRESSCOUNTRYREGIONID,DELIVERYADDRESSSTATEID 1000,true,2021-01-01,false,Completed,USD,New York,US,NY 1001,false,2021-01-02,true,In Progress,EUR,Paris,FR,75 1002,true,2021-01-03,false,Completed,GBP,London,GB,ENG 1003,false,2021-01-04,true,In Progress,USD,Los Angeles,US,CA 1004,true,2021-01-05,false,Completed,EUR,Berlin,DE,BE
SQL Server provides various methods to merge these files into a single output file. One approach is to use the BULK INSERT statement to load the data from each file into a temporary table, and then insert the records from the temporary table into the final output table.
Here is an example of how this can be achieved:
-- Create a temporary table to hold the merged data CREATE TABLE #MergedData ( SALESORDERNUMBER INT, ISEXPORTSALE BIT, CREATEDDATE DATE, ISSELECTED BIT, SALESORDERSTATUS VARCHAR(50), CURRENCYCODE VARCHAR(3), DELIVERYADDRESSCITY VARCHAR(50), DELIVERYADDRESSCOUNTRYREGIONID VARCHAR(2), DELIVERYADDRESSSTATEID VARCHAR(2) ) -- Load data from File1.csv into the temporary table BULK INSERT #MergedData FROM 'C:\Path\To\File1.csv' WITH ( FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', FIRSTROW = 2 -- Skip the header row ) -- Load data from File2.csv into the temporary table BULK INSERT #MergedData FROM 'C:\Path\To\File2.csv' WITH ( FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', FIRSTROW = 2 -- Skip the header row ) -- Load data from File3.csv into the temporary table BULK INSERT #MergedData FROM 'C:\Path\To\File3.csv' WITH ( FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', FIRSTROW = 2 -- Skip the header row ) -- Insert the merged data into the final output table INSERT INTO OutputTable SELECT * FROM #MergedData -- Clean up the temporary table DROP TABLE #MergedData
Conclusion:
In this article, we have discussed the steps to merge multiple files into a single file using SQL Server. By leveraging the BULK INSERT statement and temporary tables, we can efficiently consolidate data from different sources into a unified output. It is important to ensure that all files have a similar structure for the merge process to work properly.