When analyzing data, it is important to have a good understanding of the data you are working with. This requires running statistical functions and measures to gain insights into the data. However, when working with data in a SQL Server database, you may come across unknown or missing values, outliers, and other undesired values. These issues can have a significant impact on your analysis and lead to false interpretations.
One way to address these issues is through data cleaning, which involves detecting, finding, and recoding these values. SQL Server provides several built-in statistical functions that can be used to explore the data. These functions include mathematical functions (ABS, COS, SIN, EXP, LOG, PI, SQRT, etc.), aggregate functions (MIN, MAX, AVG, SUM, COUNT, VAR, VARP, etc.), ranking functions (RANK, NTILE, DENSE, etc.), and string functions (SOUNDEX, DIFFERENCE, STUFF, REVERSE, etc.). These functions are simple to use and can be used to construct more complex statistical functions.
In addition to the built-in functions, SQL Server also provides algorithms for data profiling within SSIS and other approaches for understanding data, such as fuzzy matching, deduplication, and outlier detection.
However, analyzing data using SQL Server alone can be time-consuming, especially when dealing with long T-SQL code and running statistical tests. To simplify the process and gain more advanced statistical capabilities, you can use the R programming language.
Introducing R
R is a language and software environment for statistical computing and graphics. It is an open-source program that is free to use and can be downloaded from the official R website. R is easily adaptable and has a wide range of libraries for performing various statistical analyses. It also has libraries for connecting to database engines, Excel, and other programs.
Using R for data exploration has several advantages. First, it is a powerful program that is gaining popularity among statisticians. Second, it is free to use, which can be a significant cost-saving factor compared to other statistical programs. Third, R has a growing community and many online resources available for learning.
While R is not a database engine like SQL Server, it can be used in conjunction with SQL Server to perform statistical analyses on your data. Traditionally, statisticians would export data from SQL Server to third-party programs like SAS, SPSS, or Excel to calculate statistics. This process was time-consuming and inefficient. To address this issue, a procedure called sp_getStatistics has been developed. This procedure allows you to feed data from your SQL Server table, perform statistical calculations in R, and return the results back to SQL Server.
Installing R and Preparing Your Environment
Installing R is a safe and straightforward process. You can download the executable R installation file from the official R website. Once installed, you will have the R engine on your computer.
To use the sp_getStatistics procedure, you will also need to run two SQL files that create the necessary SQL Server configuration, data sample, and the procedure itself. These files can be found in the resources section below.
Configuring SQL Server
Before using the sp_getStatistics procedure, you need to ensure that you have the necessary components on your client machine. These include an instance of SQL Server, the R program for statistical computing, and read/write permission on the working folder for the R files.
The procedure also requires enabling XP_CMDSHELL and Ole Automation Procedures on your SQL Server instance. These settings can be enabled using the SP_CONFIGURE command.
Data Preparation
Before running the sp_getStatistics procedure, you need to query your data into a new table or create a view against an existing fact table. For example, you can create a new table using the AdventureWorks2012 database and run different statistics to explore the data.
Once the data is prepared, you can use a simple SELECT statement to verify that you have the correct data in your table.
Procedure Parameters
The sp_getStatistics procedure accepts several input parameters and returns statistics on the desired columns/variables. The input parameters include:
- @TargetTable: The name of the table holding the data for analysis.
- @Variables: The column(s) or variable(s) from the table to be analyzed.
- @Statistics: The ID of the statistics to be calculated. These statistics are defined in the dbo.tableStatistics table.
- @ServerName: The name of the SQL Server running on the client machine.
- @DatabaseName: The name of the SQL Server database on the server.
- @WorkingDirectory: The name of the folder on the client machine where the SQL Server will store the R scripts and the R program will store the exports.
- @RPath: The path on the client machine to the R.exe file, which is the executable file for starting R.
The procedure includes validation checks for the input parameters to ensure that they are properly defined and valid.
Generating R Script
The sp_getStatistics procedure generates an R script that performs the desired statistical calculations. The R script includes logic for checking the existence of libraries, opening an ODBC connection to the SQL Server, executing the SQL query to pull data into the R environment, and executing the specified statistics based on the input parameters.
The R script is then stored in an R file and executed using the R program. The results are stored in a text file and printed back in SQL Server Management Studio for easy access and analysis.
Conclusion
The sp_getStatistics procedure provides a simple and effective way to perform statistical analyses on your SQL Server data using the R programming language. By combining the power of SQL Server and R, you can gain deeper insights into your data and make more informed decisions. Whether you are a statistician or a data analyst, this procedure can help you explore and understand your data without leaving the comfort of SQL Server Management Studio.
Author: Tomaz Kastrun
Twitter: @tomaz_tsql
Blog: http://tsqljokes.tumblr.com/