With the increasing prevalence of “big data” in everyday scenarios, it has become necessary to find ways to import data faster and more efficiently. While there are cloud-based solutions available for importing large amounts of data, many organizations have policies that prevent them from using these solutions. However, if your company already has an existing SQL Server Integration Services (SSIS) implementation, you can leverage it to improve the speed of your data import process.
The Problem
When importing thousands or even millions of files with the same file structure, the traditional approach of processing each file sequentially can be time-consuming. Even if each file only takes a second to import, it can still take a significant amount of time to import all the files.
The Solution
To improve the speed of the import process, you can run multiple instances of your SSIS package in parallel, with each instance processing a portion of the total number of files. Here are the steps to implement this solution:
Step 1: Create a SSIS Solution
Create a new SSIS solution and add the following project-level parameters:
- SourceDirectory: The path to the directory containing the source files.
- PackageName: The full path of the package that will import the files.
- NoOfPackageInstances: The number of parallel instances of the package to run.
- FileMask: The filename filter in case you only want to process certain files within the folder.
Step 2: Create a Script Task
Create a new package and add a “Script Task” component to it.
Step 3: Configure the Script Task
Open the Script Task and add the project-level parameters. Then click on “Edit Script” to open the script editor.
Step 4: Write the Script
Declare variables and assign them the values from the project parameters. Also, declare variables to keep track of the filenames to be imported and the groups they’ll be split into. The script should loop through each file and assign it to a particular group based on the number of package instances. Finally, loop through each file group and execute an instance of the package.
Step 5: Create a New Package
Create a new package and add the following variables:
- FileList: The list of files to process. The name must be the same as in the script from Step 4 and must be writable.
- FilePath: The name of the individual file being processed.
Step 6: Add a Foreach Loop Container
In the new package, add a “Foreach Loop Container” that contains a “Data Flow Task” inside it.
Step 7: Configure the Foreach Loop Container
In the Foreach Loop Container, select “Foreach From Variable Enumerator” and choose the “User::FileList” variable. Then select “User::FilePath” under “Variable Mappings”.
Step 8: Update the Data Flow
Update the Data Flow inside the Foreach Loop Container to include a source, destination, and any required transformations.
Step 9: Configure the Source Connection Manager
In the source connection manager, set the expression for the connection string property to “User::FilePath”.
Step 10: Configure the Destination Connection Manager
In the destination connection manager, make sure that “Table lock” is not selected. Otherwise, each package instance won’t be able to write to the table concurrently.
Full Script Task Code
Here is the full code for the Script Task:
// Please excuse my code if it isn't the neatest or most efficient way of doing things
using System;
using System.IO;
using System.Linq;
using System.Threading;
using System.Collections;
using System.Collections.Generic;
using Microsoft.SqlServer.Dts.Runtime;
namespace ST_e1209e3a____
{
// ScriptMain is the entry point class of the script. Do not change the name, attributes, or parent of this class.
[Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute]
public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
{
public void Main()
{
try
{
// Retrieve project-level parameters
string sourceDirectory = Dts.Variables["$Project::SourceDirectory"].Value.ToString();
string fileMask = Dts.Variables["$Project::FileMask"].Value.ToString();
Int16 noOfPackageInstances = Convert.ToInt16(Dts.Variables["$Project::NoOfPackageInstances"].Value);
string packageName = Dts.Variables["$Project::PackageName"].Value.ToString();
Int32 fileCount = 0;
Int32 fileGroup;
List<string>[] allFiles = new List<string>[noOfPackageInstances];
// Loop through each file and assign it to a particular group based on the number of package instances
foreach (string file in Directory.GetFiles(sourceDirectory, fileMask))
{
fileCount += 1;
fileGroup = fileCount % noOfPackageInstances;
List<string> tempList = new List<string>();
if (allFiles[fileGroup] != null)
{
tempList = allFiles[fileGroup];
}
tempList.Add(file);
allFiles[fileGroup] = tempList;
}
// Execute each instance of the package in a separate thread
var Threads = new List<Thread>();
for (Int16 i = 0; i < noOfPackageInstances; i++)
{
Int16 j = i; // due to closure and scoping of the variable in the loop
Thread t = new Thread(() => ExecutePackage(allFiles[j], packageName))
{
IsBackground = false
};
t.Start();
Threads.Add(t);
}
// Wait for all threads to finish
while (Threads.Any(t => t.ThreadState == ThreadState.Running))
{
Thread.Sleep(500);
};
Dts.TaskResult = (int)ScriptResults.Success;
}
catch(Exception ex)
{
Dts.Log(ex.Message, 0, null);
Dts.TaskResult = (int)ScriptResults.Failure;
}
}
public void ExecutePackage(object fileList, string dtsxPackage)
{
Application app = new Application();
Package pkg = app.LoadPackage(dtsxPackage, null);
Variables vars = pkg.Variables;
vars["User::FileList"].Value = fileList;
pkg.Execute(null, vars, null, null, null);
}
// Enum for script results
enum ScriptResults
{
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
};
}
}
By implementing this solution, you can significantly improve the speed of your data import process in SQL Server, even without using cloud-based solutions. Running multiple instances of your SSIS package in parallel allows for faster and more efficient data import, saving valuable time and resources.
Remember to test and optimize your solution based on your specific requirements and environment.
Thank you for reading!