Published on

March 5, 2022

Working with Microsoft Excel Files in SQL Server

Microsoft Excel is a widely used spreadsheet application that allows users to store, organize, and analyze data. While SQL Server is a powerful relational database management system, it may be necessary to work with Excel files in certain scenarios. In this article, we will explore how to manage Microsoft Excel files in SQL Server using Apache Spark.

Sample Data Sets

Before we dive into the details, let’s first gather the necessary sample data sets. For this demonstration, we will use the AdventureWorks files, which are commonly used to create a database. These files are pipe delimited and can be found on GitHub. Additionally, we will also use a Microsoft Excel worksheet called the “Sunshine List,” which contains a list of public sector workers in Ontario, Canada who make more than $100K. This file can be downloaded from a specific website.

Configure Spark Cluster

In order to work with Microsoft Excel files in SQL Server, we need to configure our Spark cluster. We will need to install the openpyxl library, which is used to read and write Excel files, on the Azure Databricks cluster. This can be done by accessing the running cluster and adding the library through the cluster settings.

Read Excel Files

Once our cluster is configured, we can start reading Excel files in SQL Server. There are two libraries that support reading Excel files: PySpark and Pandas. PySpark is a Python library that provides an interface for Apache Spark, while Pandas is a popular data manipulation library in Python. Both libraries have methods to read Excel files into dataframes.

Here is an example of reading an Excel file using PySpark:

# libraries
import pyspark.pandas as pd

# file path
path = '/mnt/datalake/bronze/sunshine/stage/sunshine-list-2021.xlsx'

# read the file
pdf1 = pd.read_excel(io = path, sheet_name = 0, header = 0)

# show top 10 rows
pdf1.head(10)

And here is an example of reading an Excel file using Pandas:

# libraries
import pandas as pd
from openpyxl import load_workbook

# path to file
path = '/dbfs/mnt/datalake/bronze/sunshine/stage/sunshine-list-2021.xlsx'

# open work book
wb = load_workbook(filename=path, read_only=True)

# choose worksheet
ws = wb['tbs-pssd-compendium-salary-disc']

# convert to data frame
pdf1 = pd.DataFrame(ws.values)

# close workbook
wb.close()

# show top 5 rows
pdf1.head(5)

Analyze Excel Files

Once we have read the Excel files into dataframes, we can perform various analysis tasks on the data. This can include aggregations, filtering, and transformations. We can use SQL queries or dataframe operations to analyze the data.

Write Excel Files

In addition to reading Excel files, we can also write data from SQL Server to Excel files. Both PySpark and Pandas libraries provide methods to write dataframes to Excel files.

Here is an example of writing a dataframe to an Excel file using PySpark:

# output path
path = '/mnt/datalake/bronze/sunshine/raw/temp'

# save as one file
pdf1.to_spark().repartition(1).write.parquet(path)

# create single file
unwanted_file_cleanup("/mnt/datalake/bronze/sunshine/raw/temp", "/mnt/datalake/bronze/sunshine/raw/sunshine-list-v001.parquet", "parquet")

And here is an example of writing a dataframe to an Excel file using Pandas:

# file path
file_path = "advwrks.xlsx"

# use excel writer for multiple sheets
with pd.ExcelWriter(file_path, engine='openpyxl') as writer:
  values = range(16)
  for i in values:
    df_list[i].toPandas().to_excel(writer, sheet_name = ctrl_list[i]["worksheet_name"], index=False)

Conclusion

In this article, we have explored how to work with Microsoft Excel files in SQL Server using Apache Spark. We have learned how to read Excel files into dataframes, analyze the data, and write dataframes back to Excel files. By leveraging the power of Spark and the flexibility of Python libraries like PySpark and Pandas, we can easily manage Excel files in SQL Server.

Click to rate this post!
[Total: 0 Average: 0]

Let's work together

Send us a message or book free introductory meeting with us using button below.