Single-Table Queries in Microsoft SQL Server 2012
- 7/15/2012
NULL Marks
As explained in Chapter 1, “Background to T-SQL Querying and Programming,” SQL supports the NULL mark to represent missing values and uses three-valued logic, meaning that predicates can evaluate to TRUE, FALSE, or UNKNOWN. T-SQL follows the standard in this respect. Treatment of NULL marks and UNKNOWN in SQL can be very confusing because intuitively people are more accustomed to thinking in terms of two-valued logic (TRUE and FALSE). To add to the confusion, different language elements in SQL treat NULL marks and UNKNOWN differently.
Let’s start with three-valued predicate logic. A logical expression involving only existing or present values evaluates to either TRUE or FALSE, but when the logical expression involves a missing value, it evaluates to UNKNOWN. For example, consider the predicate salary > 0. When salary is equal to 1,000, the expression evaluates to TRUE. When salary is equal to –1,000, the expression evaluates to FALSE. When salary is NULL, the expression evaluates to UNKNOWN.
SQL treats TRUE and FALSE in an intuitive and probably expected manner. For example, if the predicate salary > 0 appears in a query filter (such as in a WHERE or HAVING clause), rows or groups for which the expression evaluates to TRUE are returned, whereas those for which the expression evaluates to FALSE are filtered out. Similarly, if the predicate salary > 0 appears in a CHECK constraint in a table, INSERT or UPDATE statements for which the expression evaluates to TRUE for all rows are accepted, whereas those for which the expression evaluates to FALSE for any row are rejected.
SQL has different treatments for UNKNOWN in different language elements (and for some people, not necessarily the expected treatments). The correct definition of the treatment SQL has for query filters is “accept TRUE,” meaning that both FALSE and UNKNOWN are filtered out. Conversely, the definition of the treatment SQL has for CHECK constraints is “reject FALSE,” meaning that both TRUE and UNKNOWN are accepted. If SQL used two-valued predicate logic, there wouldn’t be a difference between the definitions “accept TRUE” and “reject FALSE.” But with three-valued predicate logic, “accept TRUE” rejects UNKNOWN (it accepts only TRUE, hence it rejects both FALSE and UNKNOWN), whereas “reject FALSE” accepts it (it rejects only FALSE, hence it accepts both TRUE and UNKNOWN). With the predicate salary > 0 from the previous example, a NULL salary would cause the expression to evaluate to UNKNOWN. If this predicate appears in a query’s WHERE clause, a row with a NULL salary will be filtered out. If this predicate appears in a CHECK constraint in a table, a row with a NULL salary will be accepted.
One of the tricky aspects of UNKNOWN is that when you negate it, you still get UNKNOWN. For example, given the predicate NOT (salary > 0), when salary is NULL, salary > 0 evaluates to UNKNOWN, and NOT UNKNOWN remains UNKNOWN.
What some people find surprising is that an expression comparing two NULL marks (NULL = NULL) evaluates to UNKNOWN. The reasoning for this from SQL’s perspective is that a NULL represents a missing or unknown value, and you can’t really tell whether one unknown value is equal to another. Therefore, SQL provides you with the predicates IS NULL and IS NOT NULL, which you should use instead of = NULL and <> NULL.
To make things a bit more tangible, I’ll demonstrate the aforementioned aspects of the three-valued predicate logic. The Sales.Customers table has three attributes called country, region, and city, where the customer’s location information is stored. All locations have existing countries and cities. Some have existing regions (such as country: USA, region: WA, city: Seattle), yet for some the region element is missing and inapplicable (such as country: UK, region: NULL, city: London). Consider the following query, which attempts to return all customers where the region is equal to WA.
SELECT custid, country, region, city FROM Sales.Customers WHERE region = N'WA';
This query generates the following output.
custid country region city ----------- --------------- --------------- --------------- 43 USA WA Walla Walla 82 USA WA Kirkland 89 USA WA Seattle
Out of the 91 rows in the Customers table, the query returns the three rows where the region attribute is equal to WA. The query returns neither rows in which the value in the region attribute is present and different than WA (the predicate evaluates to FALSE) nor those where the region attribute is NULL (the predicate evaluates to UNKNOWN).
The following query attempts to return all customers for whom the region is different than WA.
SELECT custid, country, region, city FROM Sales.Customers WHERE region <> N'WA';
This query generates the following output:
custid country region city ----------- --------------- --------------- --------------- 10 Canada BC Tsawassen 15 Brazil SP Sao Paulo 21 Brazil SP Sao Paulo 31 Brazil SP Campinas 32 USA OR Eugene 33 Venezuela DF Caracas 34 Brazil RJ Rio de Janeiro 35 Venezuela Táchira San Cristóbal 36 USA OR Elgin 37 Ireland Co. Cork Cork 38 UK Isle of Wight Cowes 42 Canada BC Vancouver 45 USA CA San Francisco 46 Venezuela Lara Barquisimeto 47 Venezuela Nueva Esparta I. de Margarita 48 USA OR Portland 51 Canada Québec Montréal 55 USA AK Anchorage 61 Brazil RJ Rio de Janeiro 62 Brazil SP Sao Paulo 65 USA NM Albuquerque 67 Brazil RJ Rio de Janeiro 71 USA ID Boise 75 USA WY Lander 77 USA OR Portland 78 USA MT Butte 81 Brazil SP Sao Paulo 88 Brazil SP Resende (28 row(s) affected)
If you expected to get 88 rows back (91 rows in the table minus 3 returned by the previous query), you might find the fact that this query returned only 28 rows surprising. But remember, a query filter “accepts TRUE,” meaning that it rejects both rows for which the logical expression evaluates to FALSE and those for which it evaluates to UNKNOWN. So this query returned rows in which a value was present in the region attribute and that value was different than WA. It returned neither rows in which the region attribute was equal to WA nor rows in which region was NULL. You will get the same output if you use the predicate NOT (region = N’WA’) because in the rows where region is NULL and the expression region = N’WA’ evaluates to UNKNOWN, NOT (region = N’WA’) evaluates to UNKNOWN also.
If you want to return all rows for which region is NULL, do not use the predicate region = NULL, because the expression evaluates to UNKNOWN in all rows—both those in which the value is present and those in which the value is missing (is NULL). The following query returns an empty set.
SELECT custid, country, region, city FROM Sales.Customers WHERE region = NULL; custid country region city ----------- --------------- --------------- --------------- (0 row(s) affected)
Instead, you should use the IS NULL predicate.
SELECT custid, country, region, city FROM Sales.Customers WHERE region IS NULL;
This query generates the following output, shown in abbreviated form.
custid country region city ----------- --------------- --------------- --------------- 1 Germany NULL Berlin 2 Mexico NULL México D.F. 3 Mexico NULL México D.F. 4 UK NULL London 5 Sweden NULL Luleå 6 Germany NULL Mannheim 7 France NULL Strasbourg 8 Spain NULL Madrid 9 France NULL Marseille 11 UK NULL London ... (60 row(s) affected)
If you want to return all rows for which the region attribute is not WA, including those in which the value is present and different than WA, along with those in which the value is missing, you need to include an explicit test for NULL marks, like this.
SELECT custid, country, region, city FROM Sales.Customers WHERE region <> N'WA' OR region IS NULL;
This query generates the following output, shown in abbreviated form.
custid country region city ----------- --------------- --------------- --------------- 1 Germany NULL Berlin 2 Mexico NULL México D.F. 3 Mexico NULL México D.F. 4 UK NULL London 5 Sweden NULL Luleå 6 Germany NULL Mannheim 7 France NULL Strasbourg 8 Spain NULL Madrid 9 France NULL Marseille 10 Canada BC Tsawassen ... (88 row(s) affected)
SQL also treats NULL marks inconsistently in different language elements for comparison and sorting purposes. Some elements treat two NULL marks as equal to each other and others treat them as different.
For example, for grouping and sorting purposes, two NULL marks are considered equal. That is, the GROUP BY clause arranges all NULL marks in one group just like present values, and the ORDER BY clause sorts all NULL marks together. Standard SQL leaves it to the product implementation as to whether NULL marks sort before present values or after. T-SQL sorts NULL marks before present values.
As mentioned earlier, query filters “accept TRUE.” An expression comparing two NULL marks yields UNKNOWN; therefore, such a row is filtered out.
For the purposes of enforcing a UNIQUE constraint, standard SQL treats NULL marks as different from each other (allowing multiple NULL marks). Conversely, in T-SQL, a UNIQUE constraint considers two NULL marks as equal (allowing only one NULL if the constraint is defined on a single column).
Keeping in mind the inconsistent treatment SQL has for UNKNOWN and NULL marks and the potential for logical errors, you should explicitly think of NULL marks and three-valued logic in every query that you write. If the default treatment is not what you want, you must intervene explicitly; otherwise, just ensure that the default behavior is in fact what you want.