Published on

July 6, 2017

Exploring Purchasing Behaviors in SQL Server

The integration of RevolutionR with SQL Server 2016 promises a wider range of analytical methods and greater flexibility for exploratory, predictive, and visual data analysis. In this post, we will use RevolutionR to analyze purchasing behaviors in the AdventureWorksDW2012 InternetSales schema.

The Data

Our main goal in this analysis is to identify purchasing patterns in the InternetSales schema of AdventureWorksDW2012. We create a view that draws information from the factInternetSales fact table and relevant dimensions (dimProduct, dimProductSubCategory, dimSalesTerritory):

use AdventureWorksDW2012;
go
create view [vw_RSalesAnalysis]
as
    select
        sales.ProductKey,
        p.ProductSubcategoryKey,
        ps.EnglishProductSubcategoryName,
        p.EnglishProductName,
        sales.SalesTerritoryKey,
        t.SalesTerritoryRegion,
        sales.CustomerKey
from dbo.FactInternetSales as sales
    inner join dbo.DimProduct as p on p.ProductKey = sales.ProductKey
    inner join dbo.DimProductSubcategory as ps on ps.ProductSubcategoryKey = p.ProductSubcategoryKey
    inner join dbo.DimSalesTerritory as t on t.SalesTerritoryKey = sales.SalesTerritoryKey;
go

From RStudio, we can access this view using an ODBC connection:

library(RODBC)
library(ggplot2)
library(FactoMineR)
library(reshape2)

## Open Connection: 
## Note: "RSQLAnalytics2016" is an existing ODBC Data Source
SQLconnection <- odbcConnect("RSQLAnalytics2016", uid="<username>", pwd="<password>")

## Retrieve Data
salesData <- sqlFetch(SQLconnection, "vw_RSalesAnalysis")

## Close connection
odbcClose(SQLconnection)

Above, we used the sqlFetch() function from R’s RODBC package to retrieve the data in the view. Alternatively, the sqlQuery() function allows you to retrieve the results of a query – this is particularly useful for adhoc analysis. We can view the first few rows using the head() function:

## Look at the first few rows
head(salesData)

Exploratory Visualizations

Let’s begin with some visualizations of the sales data. For example, we can look at the total number of sales across the regions as well as the total number of sales by ProductSubcategory. We will use R’s GGPLOT2 library for this.

sales_by_region <- ggplot(salesData, aes(x=SalesTerritoryRegion, y=..count..)) + 
    geom_bar(fill="steelblue") + 
    theme(axis.text = element_text(size=5, angle=0), panel.background = element_blank()) + 
    xlab("") +
    coord_flip() +
    ggtitle("Number of Sales by Territory Region")

sales_by_region

The most interesting thing about this plot is that there are very few sales for the Central, Northeast, and Southeast regions. The relative imbalance between these three regions and the rest means that we should be cautious about making any conclusions about these regions.

sales_by_cat <- ggplot(salesData, aes(x=EnglishProductSubcategoryName, y=..count..)) + 
    geom_bar(fill="steelblue") + 
    theme(axis.text = element_text(size=5, angle=0), panel.background = element_blank()) + 
    xlab("") +
    coord_flip() +
    ggtitle("Number of Sales by ProductSubCategory")

sales_by_cat

The plot above shows that some ProductSubcategories are far more popular than others, for example “Tires and Tubes” and “Bottles and Cages”. We can also see that Road Bikes and Mountain Bikes form a good portion of overall sales.

Since we are interested in the differences between regions, let’s view the relative proportions of products sold by region:

product_sales_by_region <- ggplot(salesData, aes(x=ProductKey, y=..density..)) + 
    geom_density(fill="steelblue", alpha=0.8) + 
    facet_wrap(~ SalesTerritoryRegion) +
    theme_bw()

product_sales_by_region

This plot hints at some interesting differences between the regions, but it also highlights the limitations of basic summaries to shed light on underlying patterns and trends. For example, we can see that there are two very popular products in Canada. Also obvious is the different profiles for the Northeast and Southeast – though as mentioned we won’t focus on these regions. Beyond this however, this plot fails to obviously distinguish the other regions, for example, it is very difficult to distinguish Australia, France, the UK, and Germany. Given that we cannot easily see differences in these regions, the question is whether there are any differences in purchasing behaviors?

Below, we will explore this using Principal Components Analysis (PCA). When plotted, the principal components should give us a reasonable indication of which regions are similar and those that are different. First, we will summarize the number of sales by product and by region:

by_region <- with(salesData, table(SalesTerritoryRegion, ProductKey))

This creates a table with 10 rows (one for each region) and 158 columns (one for each ProductKey). To visualize this, we will map the data onto two new dimensions using PCA. This results in a table that has 10 rows (the regions) and 2 columns which capture the purchasing behaviors in each region:

pca_regions <- prcomp(by_region, scale=TRUE)
pca_data <- data.frame(pca_regions$x)
pca_data$Region <- rownames(pca_data)

Although we haven’t shown this here, the first two principal components (which we will use for visualization) capture 89% of the variation in this data. This is a good result and suggests that the purchasing behaviors have been well captured by these new dimensions (visualized below):

From the above plot, we observe that Australia is quite unlike the rest of the regions. The European regions (UK, Germany, and France) are all very similar. Canada, the Northwest, and Southwest are all distinct. This plot hints at some interesting differences between the regions, which we dig into in more detail in the following section.

Purchasing Behaviors

We will use correspondence analysis (CA) to identify which products are most popular by region. CA is also known as ‘Market Basket Analysis’. First of all, let’s see if there is an association between ProductSubcategories and the various regions. We begin by creating another contingency table, a count of the number of sales by ProductSubcategory in each region:

by_region <- with(salesData, table(SalesTerritoryRegion, EnglishProductSubcategoryName))

Then we will use the CA() function in R’s FactoMineR package:

results <- CA(by_region, graph=FALSE)
plot.CA(results, col.row="red", col.col="grey", cex=0.6, title="Region : Subcategory")

From this plot, we can see that Dimension 1 and Dimension 2 explain 75% and 17% of the association in the dataset, respectively. Dimension 1 seems to be a contrast between purchasing preferences in Australia versus Canada. More specifically, we see:

  • Sales of bikes (road, mountain, and touring) contribute a greater proportion of the Australian sales than they do in other regions.
  • There is a clear association between the UK and “Tires and Tubes”, Helmets and Caps. Caps are also popular in Germany and France.
  • The sales of clothing (Socks, Gloves, Vests, and Jerseys) are particularly strong on the West Coast.

More generally, we might conclude that sales are predominantly centered on the actual bikes in Australia, and as we move from left to right (from Australia to Canada in the plot above), we observe an increase in the sales of clothing and accessories.

Let’s drill down further and look at the sales patterns by product. Again, we create a contingency table of the number of sales per product by region and then perform correspondence analysis:

by_region <- with(salesData, table(SalesTerritoryRegion, ProductKey))
results <- CA(by_region, graph=FALSE)
plot.CA(results, col.row="red", col.col="grey", cex=0.6, title="Region : Product")

This plot is a hot mess and very tough to interpret. But in general, we can again see the contrast between Canada and Australia across the first dimension, and a contrast between Europe and Australia in the second dimension. Beyond that, it is a bit tough. The distance between the Region and each product indicates how strongly that product is associated with that region. So we can see a number of products up in the top right and a bunch of products in the center which show weak associations.

We can improve this plot by removing weak associations and only focusing on the strong ones:

## get strength of associations, include the ProductKey in this
strength <- data.frame(results$col$contrib[, 1:2])
strength$ProductKey <- rownames(product_contributions

## find the strongest associations (values > 1.5)
## loosely corresponds to associations outside the 95% confidence interval
major_products <- strength[strength$value > 1.5, "ProductKey"]

## from the sales data, identify which rows belong to the 'strong' products
## and retrieve only these rows
idx_major <- which(salesData$ProductKey %in% major_products)
majorSalesData <- salesData[idx_major, ]

## repeat the Correspondence Analysis 
## (create contingency table and perform analysis)
by_region_filter <- with(majorSalesData, table(SalesTerritoryRegion, ProductKey))
results_filter <- CA(by_region_trimmed, graph=FALSE)

## now create a new data frame for visualization
## get the coordinates for both the products and regions
products <- data.frame(results_filter$col$coord)
regions <- data.frame(results_filter$row$coord)

## get the product names for visualization
product_names <- unique(salesData[, c("ProductKey", "EnglishProductName")])
get_product_names <- function (key) product_names$EnglishProductName[which(product_names$ProductKey==key)]

## Finally, visualize
ggplot(products, aes(x=Dim.1, y=Dim.2)) + 
    geom_text(data=products, aes(x=Dim.1, y=Dim.2), 
              label=sapply(rownames(products), get_product_names), 
              colour="darkblue", 
              size=2, 
              alpha=0.7) +
    geom_point(data=regions, aes(x=Dim.1, y=Dim.2), colour="red", shape=4) +
    geom_text(data=regions, aes(x=Dim.1, y=Dim.2), label=rownames(regions), colour="red", size=4, vjust=-1,alpha=0.4) +
    theme_bw() +
    ggtitle("Region : Product")

Finally, from this plot we can conclude:

  • The sales of women’s mountain shorts are particularly high in Canada and the West coast of the USA. In addition, the sales of “HL Mountain Tires”, “HL Road Tires”, and “Mountain Tire Tubes” are particularly high in the same regions.
  • The sale of “Road Tire Tube”, “LL Road Tires”, “Touring Tyres”, “Touring Tyre Tubes”, “AWC Logo Caps”, and “Long-sleeve Logo Jersey (M)” is particularly strong in Europe. This confirms our intuition that road racing is likely to be popular around France.
  • There are a number of mountain and road bikes that sell well in Australia. The “Short Sleeve Classic Jersey” also sells well in Australia.

Conclusions

We have managed to build an interesting picture of the cycling communities and their purchasing behaviors in Canada/USA compared to Europe and Australia. Our analysis suggests that mountain biking is particularly popular with women in Canada and the US, and that products relating to this sell well. This is clearly different from the UK, Germany, and France where road racing appears to be the most popular. In Australia, road biking and mountain biking are popular, with the sale of bikes forming a major part of their total sales. Information like this is invaluable for managing stock and for marketing departments. There is clearly no need to heavily invest in stock and marketing of mountain bikes in Europe or road bikes in the US.

So why use R when we could have performed a Market Basket Analysis using Analysis Services? For me, more than anything, it is the flexibility, convenience, and power of R. Based on Microsoft’s very clear investment in R (integration with Azure Machine Learning and SQL Server 2016), it looks to become a core part of Microsoft’s business intelligence and analytics stack.

Click to rate this post!
[Total: 0 Average: 0]

Let's work together

Send us a message or book free introductory meeting with us using button below.