Published on

November 30, 2020

Reading and Importing Data from a PDF File in SQL Server

In today’s digital world, data is available in various formats such as Excel, CSV, PDF, HTML, JSON, XML, and TXT. While SQL Server Integration Services (SSIS) is commonly used for data imports and exports in SQL Server, it does not directly support importing or exporting data in PDF format. However, with the help of SQL Server’s Machine Learning Language and R scripts, we can read and import data from PDF files.

Environment Details

Before we begin, make sure you have the following environment set up:

  • SQL Server 2019
  • R Machine Learning Language
  • SQL Server Launchpad and Database Engine service running
  • SQL Server Management Studio

Reading Data from PDF Files using R Scripts in SQL Server

To read data from PDF files in SQL Server, we can use external libraries such as “pdftools”. Start by launching the administrative R console and installing the “pdftools” library using the following SQL Server R Script:

install.packages("pdftools")

Once installed, you can use the “sp_execute_external_script” stored procedure to execute R scripts from SQL Server. For example, you can print the version of the “pdftools” package using the following script:

EXECUTE sp_execute_external_script
  @language = N'R',
  @script = N'library(pdftools)
              print(packageVersion("pdftools"))'

This will display the version of the “pdftools” package, confirming that you can use it for your scripts.

To read the content of a PDF file, use the “pdf_text()” function in the SQL Server R script. For example:

data <- pdf_text("MyData.pdf")
print(data)

This will output the contents of the PDF file. You can compare the output with the original PDF document to verify the contents.

Improving the Formatting of Extracted Text

By default, the extracted text may not have any formatting, ignoring spaces or line breaks between sentences. To improve the formatting, you can use the “strsplit()” function to split the extracted text into multiple lines using the new line character (“\n”). For example:

data <- pdf_text("MyData.pdf")
Splitdata <- strsplit(data, "\n")
print(Splitdata)

This will display the extracted characters in separate lines, similar to the original PDF document.

Inserting Data into SQL Server Table

If you want to import the extracted data into a SQL Server table, you can create a table with a column of type “varchar()” to store the data. For example:

CREATE TABLE ExtractedData (
  id INT IDENTITY,
  ExtractedText VARCHAR(1000)
)

Then, you can use the “sp_execute_external_script” stored procedure to insert the data into the table. For example:

INSERT INTO ExtractedData
EXECUTE sp_execute_external_script
  @language = N'R',
  @script = N'library(pdftools)
              data <- pdf_text("MyData.pdf")
              Splitdata <- strsplit(data, "\n")
              OutputDataSet <- data.frame(Splitdata)' 

This will insert the extracted data into the “ExtractedData” table.

Text Mining using R Scripts in SQL Server

Once you have imported the data from a PDF file, you can perform text mining using SQL Server’s R script capabilities. For example, you can install the “tm” external package for text mining:

install.packages("tm")
library("tm")

With the “tm” package, you can apply various transformations to the extracted data. For example, converting the texts to lowercase, removing common English words, removing punctuation marks, and removing numbers. Here’s an example:

data <- pdf_text("MyData.pdf")
extracteddata <- Corpus(VectorSource(data))
extracteddata <- tm_map(extracteddata, content_transformer(tolower))
extracteddata <- tm_map(extracteddata, removeWords, stopwords("english"))
extracteddata <- tm_map(extracteddata, removePunctuation)
extracteddata <- tm_map(extracteddata, removeNumbers)

Once you have preprocessed the data, you can create a term-document matrix to analyze the frequency of words. For example:

extracteddata_matrix <- TermDocumentMatrix(extracteddata)
m <- as.matrix(extracteddata_matrix)
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word = names(v), freq = v)
print(head(d, 3))

This will display the top 3 most frequently used words in the extracted data.

Conclusion

In this article, we have explored how to read and import data from a PDF file using R scripts in SQL Server. We have also seen an example of text mining on the extracted data. With SQL Server’s Machine Learning Language and R scripts, you can perform various data analysis tasks on PDF files and integrate them into your SQL Server workflows.

Click to rate this post!
[Total: 0 Average: 0]

Let's work together

Send us a message or book free introductory meeting with us using button below.