Problem:
Many companies are faced with the challenge of dealing with big data. With the ever-growing amount of data being generated, it is important for IT professionals to find ways to gather and process data from various sources. In this article, we will discuss a solution to this problem using Azure SQL Database and Blob Storage.
Solution:
There are two schools of thought when dealing with big data: schema on write and schema on read. Schema on write is represented by the traditional relational database, where raw data is ingested through an ETL process and stored in tables for quick retrieval. Schema on read, on the other hand, is represented by technologies such as Hadoop or PolyBase, where data is processed as simple text files and the table definition is applied during the read operation.
In this article, we will focus on how to load data from Blob Storage into Azure SQL Database using two Transact SQL commands: BULK INSERT and OPENROWSET. These commands were updated with additional functionality in February 2017, allowing us to easily load data from Blob Storage into Azure SQL Database.
Business Problem:
Let’s consider a fictitious company named Big Jon Investments, which invests in the stock market via mutual funds. They currently invest in the S&P 500 mutual fund but are thinking of investing in individual funds starting in 2017. The investments department will need historical data to make informed choices when picking stocks. The boss has asked us to pull summarized daily trading data for the 505 stocks that make up the list. Last year’s historical data should be stored in comma delimited files and will be the start of a new data mart in Azure. Our task is to copy the comma delimited files to Azure Blob Storage and load an Azure SQL database table with the combined results.
Task Description:
- Use PowerShell to sign into Azure.
- Create and load an Azure blob container.
- Create and define an Azure SQL database.
- Define an external data source.
- Investigate the two Transact SQL commands: BULK INSERT and OPENROWSET.
Sign into Azure:
Before we can start working in the PowerShell ISE, we need to log in with a valid subscription owner. This requires entering valid credentials, such as a username and password. Once logged in, we can choose the correct subscription to work with.
Create Blob Container:
In Azure, most objects are contained inside a resource group. For this project, we will create a resource group, a storage account, and a blob container. The blob container will be used to store the comma delimited files. We can use PowerShell cmdlets to accomplish these tasks.
Create Azure SQL Server:
Almost everything in Azure is contained by a resource group. For this project, we will define a logical SQL database server inside the resource group. We will also define an administrator account and password for the server. Security is a very important concern, so Microsoft blocks all connections to the database by using a firewall. We will need to add a firewall rule to allow access to the database server.
Define Azure SQL Database:
We will use SQL Server Management Studio to manage our new Azure SQL server. We need to connect to the server to start crafting Transact SQL scripts. Once connected, we can create the necessary tables and define the schemas. In this case, we will create two user-defined schemas: ACTIVE and STAGE. We will also create the necessary tables within these schemas.
Define External Data Source:
In order for the BULK INSERT and OPENROWSET Transact SQL commands to access the Azure blob storage, we need to define an external data source. This involves creating a master key, obtaining a share access signature for the storage account, defining a database credential, and finally creating the external data source using the objects we created beforehand.
Use BULK INSERT and OPENROWSET:
The BULK INSERT command can read both text and binary files from Azure Blob Storage. We can use this command to load the comma delimited files into the staging table. The OPENROWSET command, on the other hand, can read in a CSV file from Azure Blob Storage. We can use this command to load the data from the staging table into the active table. Both commands require specifying the data source and format of the files.
Summary:
In this article, we discussed how to load data from Blob Storage into Azure SQL Database using the BULK INSERT and OPENROWSET Transact SQL commands. We also covered the steps required to set up the necessary infrastructure in Azure, such as creating a blob container, defining an Azure SQL server, and creating the necessary tables and schemas. By following these steps, we can easily load data from Blob Storage into Azure SQL Database and perform further analysis on the data.