Problem: My company receives names in a variety of different formats from business partners. The formatting for names varies substantially across business partners. However, the names need to be parsed into one consistent format to help with tracking details associated with persons. What SQL Server built-in functions and T-SQL programming techniques are especially helpful for parsing names into one format that come from vendors in different formats?
Solution: It is not uncommon to encounter situations where a single agency accepts data on persons from many other business partners. For example, a single agency may attempt to collect previously unpaid loan balances for two or more loans originated by a single lender or different lenders. Another typical example of the need to coordinate data by a person’s name involves the collection of patient ratings for multiple treatment episodes at a hospital or other healthcare provider. In both these scenarios, it is not unusual for the same person’s name to arrive for processing in different formats.
One lender or healthcare provider can enter names so that the first, middle, and last names occur in that order, but another lender can enter names so that the last name is followed by a comma which precedes the first and middle names.
This blog post will discuss SQL Server functions and T-SQL programming techniques that are especially useful for parsing names in different formats and demonstrate basic name parsing programming tips with several examples.
Basic Functions for Parsing Names in SQL Server
In this tip, we will focus on basic demonstrations for how to use five SQL Server string functions for name parsing. These functions are:
- CHARINDEX
- SUBSTRING
- RIGHT
- LEN
- REPLACE
You can use these functions to learn about the contents of a name string containing name parts and then extract the parts so that you can enter them into a staging table for tracking persons within and across data feeds.
For this introductory tip, it is assumed that name parts are separated by just blanks, and the parts sometimes include a trailing comma. Our main focus will be handling strings that can vary by the number of name parts, such as first and last name only versus first name, middle name, last name and suffix. The goal is to equip you to devise one code base for handling name strings that correctly extracts name parts in as many different formats as your needs dictate.
One place to get started parsing name field values is to segment the values into name parts. Each string in a name field value can represent a part of a name. The CHARINDEX function is especially helpful for finding the location of delimiters for strings in a name field.
For a name field value with just two strings for first and last names separated by a space, the CHARINDEX function can find the location of the blank between the name parts. If a name field value contains more than two strings for several different parts, then you can nest CHARINDEX function calls within one another to find the location of delimiters between successive strings within a name field value.
Armed with information about the location of the space between the first and last name parts in a name string, you can use the SUBSTRING function to extract each string part. Then, you can assign the string parts to the first name and last name in a staging table. The SUBSTRING function extracts a sequence of characters from a starting point. When extracting the first string from a name field, you start at position one. When extracting the second or another subsequent string, you start at one position past the space just after the preceding string part. The number of characters to extract can be computed. The computed value can depend on the position of the blank space and/or the length of a name field value.
In my experience, it is typical to use an expression to compute the number of characters to extract from a name field value. The LEN function can report the number of characters in a name field value. When extracting all the characters after some delimiter, such the last name after the first name, you can use the RIGHT function in combination with the LEN function value to specify the number of characters to extract. It is sometimes handy when name parsing to be able to modify the selected characters within a name field value. For example, you can use the REPLACE function to remove commas from a name field value so that they are not included in a parsed name.
Parsing Just First Name and Last Name from a Name String in SQL
One especially easy name parsing example is to parse name field values when there are just two parts, such as a first name followed by a space and a second name. The only issue for this name parsing task is to dynamically adjust the lengths of strings extracted for the first and second names.
Here is an example of how to extract the first and last names from a name field value:
-- Code for parsing first name followed by last name
DECLARE @nameString as varchar(128)
SET @nameString = 'Jerome Sardo'
-- find the delimiter for the end of the first name
SELECT CHARINDEX(' ',@namestring,1) space_location
-- extract the first name
SELECT SUBSTRING(@nameString,1,CHARINDEX(' ',@namestring,1)-1) fname
-- extract the second name one way
SELECT SUBSTRING(@nameString, CHARINDEX(' ',@namestring,1)+1,
LEN(@namestring)-CHARINDEX(' ',@namestring,1)) lname_1
-- extract the second name a different way
SELECT RIGHT(@namestring, LEN(@namestring)-CHARINDEX(' ',@namestring,1)) lname_2
-- Bring the parts together as first name space last name
SELECT
SUBSTRING(@nameString,1,CHARINDEX(' ',@namestring,1)-1)
+ ' ' +
RIGHT(@namestring, LEN(@namestring)-CHARINDEX(' ',@namestring,1)) parsed_name_parts
The above code demonstrates how to extract the first name and last name from a name field value using the CHARINDEX, SUBSTRING, and RIGHT functions. The CHARINDEX function is used to find the position of the space between the first and last names. The SUBSTRING function is used to extract the first name, and the RIGHT function is used to extract the last name. The final SELECT statement combines the first name and last name to display the parsed name.
Parsing Names with One, Two, or Three Strings in SQL Server
The previous code samples are each especially designed to parse names arriving in a specific format. The first sample is especially tailored for names with two strings, and the second is tailored for names with three strings. Neither sample will handle a name with just one string, such as “Microsoft”. In addition, the sample for parsing three-part names does not successfully parse names with just two parts.
The following code handles name fields with one, two, or three name parts. The code achieves this flexibility by scanning the name string to keep track of how many spaces are in the name. Then, it uses that information to extract the strings for each name part in the variable.
-- Code for parsing a name with multiple parts
DECLARE @nameString as varchar(max),
@firstSpaceLoc as smallint,
@secondSpaceLoc as smallint,
@thirdSpaceLoc as smallint,
@firstString as varchar(max),
@secondString as varchar(max),
@thirdString as varchar(max)
SET @nameString = 'Robert Dobson, Jr.'
-- How many strings are in the name?
-- Is there one space in the name
SET @firstSpaceLoc = CHARINDEX(' ',@namestring,1)
-- Is there second space in the name
SET @secondSpaceLoc = CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1)
-- Is there a third space in the name
SET @thirdSpaceLoc =
CASE
WHEN CHARINDEX(' ',
@namestring,
CHARINDEX(' ',@nameString,1)+1) = 0 THEN 0
WHEN CHARINDEX(' ',
@namestring,
CHARINDEX(' ',@nameString,1)+1) > 0 THEN
CHARINDEX(' ', @namestring,
CHARINDEX(' ', @namestring,
CHARINDEX(' ',@nameString,1)+1)+1)
END
-- extract and save strings
SELECT
@firstString =
CASE
WHEN @firstSpaceLoc > 0 THEN LEFT(@nameString,CHARINDEX(' ',@namestring,1)-1)
ELSE @nameString
END,
@secondString =
CASE
WHEN @firstSpaceLoc = 0 THEN ''
WHEN @secondSpaceLoc = 0 THEN
RIGHT(@namestring, LEN(@namestring)- CHARINDEX(' ',@namestring,1))
WHEN @secondSpaceLoc > 0 THEN
REPLACE (
SUBSTRING (
@nameString,
CHARINDEX(' ',@namestring,1)+1,
CHARINDEX(' ', @namestring,
CHARINDEX(' ',@nameString,1)+1)
- CHARINDEX(' ',@namestring,1)
),
',',
''
)
ELSE ''
END,
@thirdString =
CASE
WHEN @firstSpaceLoc = 0 OR @secondSpaceLoc = 0 THEN ''
WHEN @secondSpaceLoc > 0
AND @thirdSpaceLoc = 0 THEN
SUBSTRING (
@nameString,
CHARINDEX(' ', @namestring,
CHARINDEX(' ',@nameString,1)+1)+1,
LEN(@nameString)
)
ELSE RIGHT(@namestring,LEN(@namestring) - @secondSpaceLoc)
END
-- Report names
SELECT
@nameString sourceString,
@firstString [First string],
@secondString [Second string],
@thirdString [Third string]
SELECT
CASE
WHEN @thirdSpaceLoc > 0 THEN
@thirdString + ', ' + @firstString + ' ' + @secondString
WHEN @secondSpaceLoc > 0 AND @thirdSpaceLoc = 0 THEN
@secondString + ' ' + @thirdString + ', ' + @firstString
WHEN @firstSpaceLoc > 0 THEN
@secondString + ', ' + @firstString
WHEN @firstSpaceLoc = 0 THEN
@firstString
END [Reported Name]
The above code demonstrates how to parse a name field value with one, two, or three name parts using the CHARINDEX, SUBSTRING, and REPLACE functions. The code keeps track of the number of spaces in the name field value and extracts the corresponding name parts. The final SELECT statement reports the original name field value followed by its name parts.
By using these SQL Server functions and T-SQL programming techniques, you can parse names in different formats and extract the desired name parts. This allows you to standardize the format of names and track details associated with persons more effectively.
Stay tuned for the next part of this series, where we will explore more advanced techniques for parsing names in SQL Server.