Have you ever needed to gather information from a free-form source like a web page or a word processing document and load it into a structured format like a SQL Server staging table? Many of us have encountered this challenge, whether it’s extracting data from a public database or gathering information from the internet for reporting purposes. In this article, we will discuss a handy process consisting of three steps to accomplish this task.
Step 1: The Data Source
The first step is to identify the data source, which can be any free-form document such as a web page or a word processing document. It is important that the structure of the document remains consistent, with only the data field values changing. For example, let’s consider a scenario where we need to extract GPS data from web-based truck fleet reports into a SQL Server staging table. The reports are generated in HTML and have consistent formatting, margins, font sizes, and field placement. The only variable is the data field values that we require.
Step 2: The XML Virtual Printer
Next, we need to convert the data into a format that is readable by SQL Server Integration Services (SSIS). One approach is to print the data to an XML virtual printer, which creates a physical file on your drive. There are various tools available for this purpose, such as the XML Printer software created by Archae. This software allows you to parse the print job and extract only the necessary data fields, saving you from processing unnecessary data. You can define the data fields in a definition file using the XML Printer Composer application that comes with the software.
Step 3: The SSIS Package
Now comes the fun part – creating an SSIS package to transform and load the data into your SQL Server staging table. Here’s an overview of the process:
- Foreach Loop Container: Use the Foreach File Enumerator to iterate through the XML print job files. Create a user variable to store the full path to each parsed XML print job.
- XML Control Flow Task: Transform the parsed XML file using Extensible Stylesheet Language Transformations (XSLT) to convert each data element into attributes of the
<data>
element and add a<table>
parent element. Store the transformed XML in a user variable to be read by the XML Data Flow Source. - XML Data Flow Task: Point the task to the user variable containing the transformed XML data. Generate an XML Schema Definition (XSD) file to define the specific format that the XML Data Flow Task should expect for each print job.
- SQL Server Destination: Load the transformed data into your SQL Server staging table using the SQL Server Destination component. Perform any necessary data conversions before the task executes.
Depending on your requirements, you may also want to consider implementing auditing functionality in your SSIS package or setting up an OnError Event handler to receive email notifications in case of errors.
Conclusion
Extracting, transforming, and loading data fields from a free-form source into SQL Server can be a challenging task. However, by following the three steps outlined in this article, you can streamline the process and ensure the successful transfer of data. Remember to adapt these steps to your unique situation, and you will be well on your way to achieving your data integration goals.