Published on

July 27, 2016

Exploring SQL Server Data with Simple Techniques

In the world of business intelligence and predictive analytics, there is often a focus on predicting the future and making decisions based on these predictions. However, prediction alone is not enough to truly understand the factors driving business trends. In this blog post, we will explore how simple exploratory techniques can help bridge the gap between traditional business intelligence and predictive analytics.

To demonstrate these techniques, we will be working with a time series dataset from AdventureWorksDW2012, a popular sample database for SQL Server. This dataset contains monthly sales data for four different products across three regions from 2005 to 2008.

Let’s start by visualizing the trend of product sales over time. Using R and the ggplot2 library, we can create a line chart that shows the sales for each product in each region over the years. This chart allows us to easily identify sales growth, declines, and overall trends for each product.

library(ggplot2)
ggplot(vTimeSeries, aes(x = factor(TimeIndex), y = Amount / 1000, group = ModelRegion)) +
  geom_point(colour = "darkblue") +
  geom_line(colour = "darkblue", alpha = 0.5) +
  facet_wrap(~ ModelRegion, ncol = 3) +
  theme_minimal() +
  theme(panel.grid = element_blank(), axis.text.x = element_blank()) +
  ylab("$ (x 1000)") +
  xlab("Time") +
  ggtitle("Product Sales by Region") +
  guides(group = "none", colour = "none")

From the chart, we can observe interesting trends such as sales growth for the M200 product line in Europe and North America, a decline in sales for the R250 product in North America, and steady sales increase for the R750 product in all three regions. These insights can help us understand which products are performing well and which are underperforming.

Next, we will use hierarchical clustering to identify groups of similar and dissimilar products. Hierarchical clustering builds a tree structure that groups similar products together. By using R’s heatmap function, we can visualize the clustering results and observe patterns in the data.

heatmap(byRegion, distfun = function (x) as.dist((1 - cor(t(x)))), Colv = NA)

The resulting heatmap shows the sales patterns for each product in each region. We can clearly see the drop in sales for the R250 product line, as well as the gradual increase in sales for the other products. Based on the dendrogram on the left, we can identify four groups of products: M200 with strong sales improvement, R250 with declining sales, R750 with steady sales increase, and T1000 with strong sales growth since its introduction.

While clustering and heatmaps are useful for analyzing larger and more complex datasets, they may not be the best way to communicate these trends to business stakeholders. Instead, we can use principal components analysis (PCA) to simplify the data and create a more visually appealing plot.

pca_regions <- prcomp(byRegion, scale = FALSE, center = FALSE)
plot_data <- data.frame(pca_regions$x)
plot_data$Region <- rownames(pca_regions$x)

library(ggrepel)
ggplot(plot_data, aes(x = PC1, y = -PC2)) +
  geom_point(aes(colour = -PC2), alpha = 0.4, size = -scale(plot_data$PC2, center = max(plot_data$PC2), scale = 17500)) +
  geom_text_repel(aes(label = Region), colour = "black") +
  xlim(-850000, -50000) +
  ylim(-800000, 650000) +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  xlab("Severity of change") +
  ylab("Increase in sales") +
  guides(colour = "none", size = "none") +
  ggtitle("Sales behavior")

The resulting plot shows the separation of the four product groups based on the first two principal components. The y-axis represents the amount of improvement in sales over the three-year period, while the x-axis represents the severity of change in sales. This visualization allows us to easily identify the sales trends and compare the performance of different products and regions.

In conclusion, while predictive analytics and machine learning are powerful tools for predicting future sales, they often do not provide insights into the factors driving these trends. By using simple exploratory techniques like data visualization, clustering, and principal components analysis, we can gain a deeper understanding of our data and uncover valuable insights. These techniques can be especially useful for analyzing large and complex datasets. Remember, simple and clear communication of results is key when presenting these insights to business stakeholders.

Click to rate this post!
[Total: 0 Average: 0]

Let's work together

Send us a message or book free introductory meeting with us using button below.