In a previous article, we discussed using Fuzzy Grouping to find inexact duplicates within the same dataset. But what if you have two different datasets that you need to compare and find common records? This is where the Fuzzy Lookup Transformation in SQL Server Integration Services (SSIS) comes in handy.
Let’s say you have a dataset of 1244 patients from an outside study database and you want to check if they exist in your 1.7 million patient registration database. Simply looking for exact matches won’t be sufficient, as there might be misspellings or typos in the data. This is where the Fuzzy Lookup Transformation can help.
To get started, you’ll need to follow these steps:
- Install SQL Server service pack 2 or the memory leak fix to resolve any memory issues when working with large datasets.
- Create an OLE DB Source dataset containing the records you want to match. Make sure to limit the dataset to the fields that will be used for lookups and pass through to improve performance.
- Add the Fuzzy Lookup Transformation to your SSIS project and configure it by specifying the reference table, columns to lookup, and return.
- Define the mapping type, comparison flags, and minimum similarity threshold for each field in the Create Relationships editor.
- Configure any additional settings in the Advanced tab, such as the maximum number of matches to output per lookup and the similarity threshold for the entire lookup.
- Add a destination for the output and link it to the Fuzzy Lookup Transformation.
- Run the package and review the output, which will include similarity and confidence scores for each field.
The Fuzzy Lookup Transformation allows you to find and standardize lookup values, but with some customization, it can also be used for probabilistic linkage of different datasets. The quality of the matching is dependent on the fields you have and the quality of the data.
For example, let’s say you have two dummy tables with fake data: an input table and a reference table. By running the project with a maximum output matches per input value of 2, you can see that the Fuzzy Lookup Transformation returns two possible matches for a record named Mary Smith. The similarity and confidence scores can help determine which match to use.
Overall, the Fuzzy Lookup Transformation in SSIS is a powerful tool for comparing and finding common records in different datasets. It allows for flexibility in matching records with inexact duplicates and can be customized to meet specific requirements. By understanding its capabilities and properly configuring the transformation, you can improve data quality and accuracy in your SQL Server environment.