Single-Table Queries in Microsoft SQL Server 2012
- 7/15/2012
Working with Character Data
In this section, I cover query manipulation of character data, including data types, collation, operators and functions, and pattern matching.
Data Types
SQL Server supports two kinds of character data types—regular and Unicode. Regular data types include CHAR and VARCHAR, and Unicode data types include NCHAR and NVARCHAR. Regular characters use one byte of storage for each character, whereas Unicode data requires two bytes per character, and in cases in which a surrogate pair is needed, four bytes are required. If you choose a regular character type for a column, you are restricted to only one language in addition to English. The language support for the column is determined by the column’s effective collation, which I’ll describe shortly. With Unicode data types, multiple languages are supported. So if you store character data in multiple languages, make sure that you use Unicode character types and not regular ones.
The two kinds of character data types also differ in the way in which literals are expressed. When expressing a regular character literal, you simply use single quotes: ‘This is a regular character string literal’. When expressing a Unicode character literal, you need to specify the character N (for National) as a prefix: N’This is a Unicode character string literal’.
Any data type without the VAR element (CHAR, NCHAR) in its name has a fixed length, which means that SQL Server preserves space in the row based on the column’s defined size and not on the actual number of characters in the character string. For example, when a column is defined as CHAR(25), SQL Server preserves space for 25 characters in the row regardless of the length of the stored character string. Because no expansion of the row is required when the strings are expanded, fixed-length data types are more suited for write-focused systems. But because storage consumption is not optimal with fixed-length strings, you pay more when reading data.
A data type with the VAR element (VARCHAR, NVARCHAR) in its name has a variable length, which means that SQL Server uses as much storage space in the row as required to store the characters that appear in the character string, plus two extra bytes for offset data. For example, when a column is defined as VARCHAR(25), the maximum number of characters supported is 25, but in practice, the actual number of characters in the string determines the amount of storage. Because storage consumption for these data types is less than that for fixed-length types, read operations are faster. However, updates might result in row expansion, which might result in data movement outside the current page. Therefore, updates of data having variable-length data types are less efficient than updates of data having fixed-length data types.
You can also define the variable-length data types with the MAX specifier instead of a maximum number of characters. When the column is defined with the MAX specifier, any value with a size up to a certain threshold (8,000 bytes by default) can be stored inline in the row (as long as it can fit in the row). Any value with a size above the threshold is stored external to the row as a large object (LOB).
Later in this chapter, in the “Querying Metadata” section, I explain how you can obtain metadata information about objects in the database, including the data types of columns.
Collation
Collation is a property of character data that encapsulates several aspects, including language support, sort order, case sensitivity, accent sensitivity, and more. To get the set of supported collations and their descriptions, you can query the table function fn_helpcollations as follows.
SELECT name, description FROM sys.fn_helpcollations();
For example, the following list explains the collation Latin1_General_CI_AS:
Latin1_General Code page 1252 is used. (This supports English and German characters, as well as characters used by most Western European countries.)
Dictionary sorting Sorting and comparison of character data are based on dictionary order (A and a < B and b).
You can tell that dictionary order is used because that’s the default when no other ordering is defined explicitly. More specifically, the element BIN doesn’t explicitly appear in the collation name. If the element BIN appeared, it would mean that sorting and comparison of character data was based on the binary representation of characters (A < B < a < b).
CI The data is case insensitive (a = A).
AS The data is accent sensitive (à <> ä).
In an on-premises SQL Server implementation, collation can be defined at four different levels: instance, database, column, and expression. The lowest effective level is the one that should be used. In Windows Azure SQL Database, collation can be indicated at the database, column, and expression levels.
The collation of the instance is chosen as part of the setup program. It determines the collations of all system databases and is used as the default for user databases.
When you create a user database, you can specify a collation for the database by using the COLLATE clause. If you don’t, the instance’s collation is assumed by default.
The database collation determines the collation of the metadata of objects in the database and is used as the default for user table columns. I want to emphasize the importance of the fact that the database collation determines the collation of the metadata, including object and column names. For example, if the database collation is case insensitive, you can’t create two tables called T1 and t1 within the same schema, but if the database collation is case sensitive, you can.
You can explicitly specify a collation for a column as part of its definition by using the COLLATE clause. If you don’t, the database collation is assumed by default.
You can convert the collation of an expression by using the COLLATE clause. For example, in a case-insensitive environment, the following query uses a case-insensitive comparison.
SELECT empid, firstname, lastname FROM HR.Employees WHERE lastname = N'davis';
The following query returns the row for Sara Davis, even though the casing doesn’t match, because the effective casing is insensitive.
empid firstname lastname ----------- ---------- -------------------- 1 Sara Davis
If you want to make the filter case sensitive even though the column’s collation is case insensitive, you can convert the collation of the expression.
SELECT empid, firstname, lastname FROM HR.Employees WHERE lastname COLLATE Latin1_General_CS_AS = N'davis';
This time the query returns an empty set because no match is found when a case-sensitive comparison is used.
Operators and Functions
This section covers string concatenation and functions that operate on character strings. For string concatenation, T-SQL provides the + operator and the CONCAT function. For other operations on character strings, T-SQL provides several functions, including SUBSTRING, LEFT, RIGHT, LEN, DATALENGTH, CHARINDEX, PATINDEX, REPLACE, REPLICATE, STUFF, UPPER, LOWER, RTRIM, LTRIM, and FORMAT. In the following sections, I describe these commonly used operators and functions.
String Concatenation (Plus Sign [+] Operator and CONCAT Function)
T-SQL provides the plus sign (+) operator and the CONCAT function (in SQL Server 2012) to concatenate strings. For example, the following query against the Employees table produces the fullname result column by concatenating firstname, a space, and lastname.
SELECT empid, firstname + N' ' + lastname AS fullname FROM HR.Employees;
This query produces the following output.
empid fullname ----------- ------------------------------- 1 Sara Davis 2 Don Funk 3 Judy Lew 4 Yael Peled 5 Sven Buck 6 Paul Suurs 7 Russell King 8 Maria Cameron 9 Zoya Dolgopyatova
Standard SQL dictates that a concatenation with a NULL should yield a NULL. This is the default behavior of SQL Server. For example, consider the query against the Customers table shown in Listing 2-7.
Listing 2-7. Query Demonstrating String Concatenation
SELECT custid, country, region, city, country + N',' + region + N',' + city AS location FROM Sales.Customers;
Some of the rows in the Customers table have a NULL in the region column. For those, SQL Server returns by default a NULL in the location result column.
custid country region city location ----------- --------------- ------ --------------- ------------------- 1 Germany NULL Berlin NULL 2 Mexico NULL México D.F. NULL 3 Mexico NULL México D.F. NULL 4 UK NULL London NULL 5 Sweden NULL Luleå NULL 6 Germany NULL Mannheim NULL 7 France NULL Strasbourg NULL 8 Spain NULL Madrid NULL 9 France NULL Marseille NULL 10 Canada BC Tsawassen Canada,BC,Tsawassen 11 UK NULL London NULL 12 Argentina NULL Buenos Aires NULL 13 Mexico NULL México D.F. NULL 14 Switzerland NULL Bern NULL 15 Brazil SP Sao Paulo Brazil,SP,Sao Paulo 16 UK NULL London NULL 17 Germany NULL Aachen NULL 18 France NULL Nantes NULL 19 UK NULL London NULL 20 Austria NULL Graz NULL ... (91 row(s) affected)
To treat a NULL as an empty string—or more accurately, to substitute a NULL with an empty string—you can use the COALESCE function. This function accepts a list of input values and returns the first that is not NULL. Here’s how you can revise the query from Listing 2-7 to programmatically substitute NULL marks with empty strings.
SELECT custid, country, region, city, country + COALESCE( N',' + region, N'') + N',' + city AS location FROM Sales.Customers;
SQL Server 2012 introduces a new function called CONCAT that accepts a list of inputs for concatenation and automatically substitutes NULL marks with empty strings. For example, the expression CONCAT(‘a’, NULL, ‘b’) returns the string ‘ab’.
Here’s how to use the CONCAT function to concatenate the customer’s location elements, replacing NULL marks with empty strings.
SELECT custid, country, region, city, CONCAT(country, N',' + region, N',' + city) AS location FROM Sales.Customers;
The SUBSTRING Function
The SUBSTRING function extracts a substring from a string.
Syntax
SUBSTRING(string, start, length)
This function operates on the input string and extracts a substring starting at position start that is length characters long. For example, the following code returns the output ‘abc’.
SELECT SUBSTRING('abcde', 1, 3);
If the value of the third argument exceeds the end of the input string, the function returns everything until the end without raising an error. This can be convenient when you want to return everything from a certain point until the end of the string—you can simply specify the maximum length of the data type or a value representing the full length of the input string.
The LEFT and RIGHT Functions
The LEFT and RIGHT functions are abbreviations of the SUBSTRING function, returning a requested number of characters from the left or right end of the input string.
Syntax
LEFT(string, n), RIGHT(string, n)
The first argument, string, is the string the function operates on. The second argument, n, is the number of characters to extract from the left or right end of the string. For example, the following code returns the output ‘cde’.
SELECT RIGHT('abcde', 3);
The LEN and DATALENGTH Functions
The LEN function returns the number of characters in the input string.
Syntax
LEN(string)
Note that this function returns the number of characters in the input string and not necessarily the number of bytes. With regular characters, both numbers are the same because each character requires one byte of storage. With Unicode characters, each character requires two bytes of storage (in most cases, at least); therefore, the number of characters is half the number of bytes. To get the number of bytes, use the DATALENGTH function instead of LEN. For example, the following code returns 5.
SELECT LEN(N'abcde');
The following code returns 10.
SELECT DATALENGTH(N'abcde');
Another difference between LEN and DATALENGTH is that the former excludes trailing blanks but the latter doesn’t.
The CHARINDEX Function
The CHARINDEX function returns the position of the first occurrence of a substring within a string.
Syntax
CHARINDEX(substring, string[, start_pos])
This function returns the position of the first argument, substring, within the second argument, string. You can optionally specify a third argument, start_pos, to tell the function the position from which to start looking. If you don’t specify the third argument, the function starts looking from the first character. If the substring is not found, the function returns 0. For example, the following code returns the first position of a space in ‘Itzik Ben-Gan’, so it returns the output 6.
SELECT CHARINDEX(' ','Itzik Ben-Gan');
The PATINDEX Function
The PATINDEX function returns the position of the first occurrence of a pattern within a string.
Syntax
PATINDEX(pattern, string)
The argument pattern uses similar patterns to those used by the LIKE predicate in T-SQL. I’ll explain patterns and the LIKE predicate later in this chapter, in “The LIKE Predicate.” Even though I haven’t explained yet how patterns are expressed in T-SQL, I include the following example here to show how to find the position of the first occurrence of a digit within a string.
SELECT PATINDEX('%[0-9]%', 'abcd123efgh');
This code returns the output 5.
The REPLACE Function
The REPLACE function replaces all occurrences of a substring with another.
Syntax
REPLACE(string, substring1, substring2)
The function replaces all occurrences of substring1 in string with substring2. For example, the following code substitutes all occurrences of a dash in the input string with a colon.
SELECT REPLACE('1-a 2-b', '-', ':');
This code returns the output: ‘1:a 2:b’.
You can use the REPLACE function to count the number of occurrences of a character within a string. To do this, you replace all occurrences of the character with an empty string (zero characters) and calculate the original length of the string minus the new length. For example, the following query returns, for each employee, the number of times the character e appears in the lastname attribute.
SELECT empid, lastname, LEN(lastname) - LEN(REPLACE(lastname, 'e', '')) AS numoccur FROM HR.Employees;
This query generates the following output.
empid lastname numoccur ----------- -------------------- ----------- 5 Buck 0 8 Cameron 1 1 Davis 0 9 Dolgopyatova 0 2 Funk 0 7 King 0 3 Lew 1 4 Peled 2 6 Suurs 0
The REPLICATE Function
The REPLICATE function replicates a string a requested number of times.
Syntax
REPLICATE(string, n)
For example, the following code replicates the string ‘abc’ three times, returning the string ‘abcabcabc’.
SELECT REPLICATE('abc', 3);
The next example demonstrates the use of the REPLICATE function, along with the RIGHT function and string concatenation. The following query against the Production.Suppliers table generates a 10-digit string representation of the supplier ID integer with leading zeros.
SELECT supplierid, RIGHT(REPLICATE('0', 9) + CAST(supplierid AS VARCHAR(10)), 10) AS strsupplierid FROM Production.Suppliers;
The expression producing the result column strsupplierid replicates the character 0 nine times (producing the string ‘ 000000000’) and concatenates the string representation of the supplier ID to form the result. The string representation of the supplier ID integer is produced by the CAST function, which is used to convert the data type of the input value. Finally, the expression extracts the 10 rightmost characters of the result string, returning the 10-digit string representation of the supplier ID with leading zeros. Here’s the output of this query, shown in abbreviated form.
supplierid strsupplierid ----------- ------------- 29 0000000029 28 0000000028 4 0000000004 21 0000000021 2 0000000002 22 0000000022 14 0000000014 11 0000000011 25 0000000025 7 0000000007 ... (29 row(s) affected)
Note that SQL Server 2012 introduces a new function called FORMAT that allows you to achieve such formatting needs much more easily. I’ll describe it later in this section.
The STUFF Function
The STUFF function allows you to remove a substring from a string and insert a new substring instead.
Syntax
STUFF(string, pos, delete_length, insertstring)
This function operates on the input parameter string. It deletes as many characters as the number specified in the delete_length parameter, starting at the character position specified in the pos input parameter. The function inserts the string specified in the insertstring parameter in position pos. For example, the following code operates on the string ‘ xyz’, removes one character from the second character, and inserts the substring ‘abc’ instead.
SELECT STUFF('xyz', 2, 1, 'abc');
The output of this code is ‘xabcz’.
If you just want to insert a string and not delete anything, you can specify a length of 0 as the third argument.
The UPPER and LOWER Functions
The UPPER and LOWER functions return the input string with all uppercase or lowercase characters, respectively.
Syntax
UPPER(string), LOWER(string)
For example, the following code returns ‘ITZIK BEN-GAN’.
SELECT UPPER('Itzik Ben-Gan');
The following code returns ‘itzik ben-gan’.
SELECT LOWER('Itzik Ben-Gan');
The RTRIM and LTRIM Functions
The RTRIM and LTRIM functions return the input string with leading or trailing spaces removed.
Syntax
RTRIM(string), LTRIM(string)
If you want to remove both leading and trailing spaces, use the result of one function as the input to the other. For example, the following code removes both leading and trailing spaces from the input string, returning ‘abc’.
SELECT RTRIM(LTRIM(' abc '));
The FORMAT Function
The FORMAT function allows you to format an input value as a character string based on a Microsoft .NET format string and an optional culture.
Syntax
FORMAT(input, format_string, culture)
There are numerous possibilities for formatting inputs using both standard and custom format strings. The MSDN article at http://go.microsoft.com/fwlink/?LinkId=211776 provides more information. But just as a simple example, recall the convoluted expression used earlier to format a number as a 10-digit string with leading zeros. By using FORMAT, you can achieve the same task with either the custom form string ‘0000000000’ or the standard one, ‘d10’. As an example, the following code returns ‘0000001759’.
SELECT FORMAT(1759, '000000000');
The LIKE Predicate
T-SQL provides a predicate called LIKE that allows you to check whether a character string matches a specified pattern. Similar patterns are used by the PATINDEX function described earlier. The following section describes the wildcards supported in the patterns and demonstrates their use.
The % (Percent) Wildcard
The percent sign represents a string of any size, including an empty string. For example, the following query returns employees where the last name starts with D.
SELECT empid, lastname FROM HR.Employees WHERE lastname LIKE N'D%';
This query returns the following output.
empid lastname ----------- -------------------- 1 Davis 9 Dolgopyatova
Note that often you can use functions such as SUBSTRING and LEFT instead of the LIKE predicate to represent the same meaning. But the LIKE predicate tends to get optimized better—especially when the pattern starts with a known prefix.
The _ (Underscore) Wildcard
An underscore represents a single character. For example, the following query returns employees where the second character in the last name is e.
SELECT empid, lastname FROM HR.Employees WHERE lastname LIKE N'_e%';
This query returns the following output.
empid lastname ----------- -------------------- 3 Lew 4 Peled
The [<List of Characters>] Wildcard
Square brackets with a list of characters (such as [ABC]) represent a single character that must be one of the characters specified in the list. For example, the following query returns employees where the first character in the last name is A, B, or C.
SELECT empid, lastname FROM HR.Employees WHERE lastname LIKE N'[ABC]%';
This query returns the following output.
empid lastname ----------- -------------------- 5 Buck 8 Cameron
The [<Character>-<Character>] Wildcard
Square brackets with a character range (such as [A-E]) represent a single character that must be within the specified range. For example, the following query returns employees where the first character in the last name is a letter in the range A through E.
SELECT empid, lastname FROM HR.Employees WHERE lastname LIKE N'[A-E]%';
This query returns the following output.
empid lastname ----------- -------------------- 5 Buck 8 Cameron 1 Davis 9 Dolgopyatova
The [^<Character List or Range>] Wildcard
Square brackets with a caret sign (^) followed by a character list or range (such as [^A-E]) represent a single character that is not in the specified character list or range. For example, the following query returns employees where the first character in the last name is not a letter in the range A through E.
SELECT empid, lastname FROM HR.Employees WHERE lastname LIKE N'[^A-E]%';
This query returns the following output.
empid lastname ----------- -------------------- 2 Funk 7 King 3 Lew 4 Peled 6 Suurs
The ESCAPE Character
If you want to search for a character that is also used as a wildcard, (such as %, _, [, or ]), you can use an escape character. Specify a character that you know for sure doesn’t appear in the data as the escape character in front of the character you are looking for, and specify the keyword ESCAPE followed by the escape character right after the pattern. For example, to check whether a column called col1 contains an underscore, use col1 LIKE ‘%!_%’ ESCAPE ‘!’.
For the wildcards %, _, and [ you can use square brackets instead of an escape character. For example, instead of col1 LIKE ‘%!_%’ ESCAPE ‘!’ you can use col1 LIKE ‘%[_]%’.