SQL Server 2008 introduced a new feature in SSIS called Data Profiler. This feature allows us to gain insights into the data stored in our tables, including the number of distinct values in each column and the distribution of data values. Data Profiler is particularly useful in data warehousing, both during the ETL process and when building cubes in SSAS.
When developing an ETL package, it is often necessary to understand the data values in a column. For example, we may need to determine if a column is mostly empty or contains valid data. Data Profiler can help us make informed decisions about whether to include or exclude certain values or patterns based on data quality requirements. It can also assist in identifying potential hierarchies in dimension tables.
Data Profiler is not only useful for ETL development but also for performance tuning. For example, when partitioning a fact table or a cube, understanding the data distribution can help us create similar-sized groups based on a specific column.
In the past, I used Informatica Data Explorer (IDE) to analyze the distribution of data values in data warehouse fact and dimension tables for building SSAS cubes. I often wondered when we would have a similar feature in SQL Server Management Studio, where we could simply right-click on any table and select “Analyze” to get the data profile. With the introduction of SSIS 2008 Data Profiler, we now have a starting point for this functionality.
Let’s take a closer look at the output of SSIS 2008 Data Profiler. In the left pane, we can see the tables that we have analyzed, along with the profiles that we have executed for each table. The right-hand pane displays the details of the selected profile. For example, the “Column Value Distribution” profile shows a list of columns and the number of distinct values in each column. Clicking on a column in this panel will display the distribution of data values, including the value itself, the number of rows containing that value, and the percentage of total rows.
In addition to analyzing the distribution of data values, SSIS 2008 Data Profiler can also analyze:
- The number of nulls in a column
- The pattern of data values
- The minimum and maximum values in each column
- The mean (average) and standard deviation of data values in each column
- Which column(s) are good candidates for a primary key
- Whether a column is fully or partially dependent on other column(s)
- Whether all values in a column exist in another table
To set up a Data Profiling Task in SSIS, open BIDS and create a new SSIS project. Drag the Data Profiling Task onto the design surface and double-click to edit it. Specify the destination file name and click on “Profile Requests” in the left pane. Select “Column Value Distribution” as the profile type and provide the connection details. Then, select the table and column(s) for analysis and click OK.
Once the SSIS package is run, the output file will be created. To view the output file, we need to use the Data Profile Viewer utility. Navigate to the appropriate folder (e.g., C:\Program Files\Microsoft SQL Server\100\DTS\Binn) and execute DataProfileViewer.exe. Open the output file generated by the SSIS Data Profiling Task, and you will see the data profile similar to the one shown in Figure 1.
In addition to the “Column Value Distribution” profile, we can choose other profiles such as “Column Pattern Profiles,” “Column Statistics Profiles,” “Functional Dependency Profiles,” “Candidate Key Profiles,” “Column Null Ratio Profiles,” and “Inclusion Profiles.” These profiles provide the seven analysis items mentioned earlier.
While the Data Profiling facility in SQL Server 2008 is quite basic and lacks user-friendliness, it is a good starting point. It is essential for an ETL tool to have a data profiling feature, whether it comes at an additional cost or not. We can expect the SSIS data profiling feature to improve significantly in future releases of SQL Server. Hopefully, the Data Profile Viewer utility will be integrated into BI Development Studio, and we will be able to right-click on any table in Management Studio and choose “Analyze” to get the data profile.
Stay tuned for more updates on SQL Server features and best practices!
Author: Vincent Rainardi
Date: 7th July 2008