Aggregating and Pivoting Data in Microsoft SQL Server 2008 T-SQL
- 3/25/2009
Tiebreakers
In this section, I want to introduce a new technique based on aggregates to solve tiebreaker problems, which I started discussing in Chapter 6. I’ll use the same example as I used there—returning the most recent order for each employee—using different combinations of tiebreaker attributes that uniquely identify an order for each employee. Keep in mind that the performance of the solutions that use subqueries depends very strongly on indexing. That is, you need an index on the partitioning column, sort column, and tiebreaker attributes. But in practice, you don’t always have the option of adding as many indexes as you like. The subquery-based solutions will greatly suffer in performance from a lack of appropriate indexes. Using aggregation techniques, you’ll see that the solution yields reasonable performance even when an optimal index is not in place—in fact, even when no good index is in place.
Let’s start by using MAX(orderid) as the tiebreaker. To recap, you’re after the most recent order for each employee, and if there’s a tie for most recent, choose the order with the largest ID. For each employee’s most recent order, you’re supposed to return the columns empid, orderdate, orderid, custid, and requireddate.
The aggregate technique to solve the problem applies the following logical idea, given here in pseudocode:
SELECT empid, MAX(orderdate, orderid, custid, requireddate) FROM Sales.Orders GROUP BY empid;
This idea can’t be expressed directly in T-SQL, so don’t try to run this query. The idea here is to select for each empid, the row with largest orderdate (most recent), then largest orderid—the tiebreaker—among orders with the most recent orderdate. Because the combination empid, orderdate, orderid is already unique, there will be no further ties to break, and the other attributes (custid and requireddate) are simply returned from the selected row. Because a MAX of more than one attribute does not exist in T-SQL, you must mimic it somehow. One way is by merging the attributes into a single input to the MAX function, then extracting back the individual elements in an outer query.
The question is this: What technique should you use to merge the attributes? The trick is to convert each attribute to a fixed-width string and concatenate the strings. You must convert the attributes to strings in a way that doesn’t change the sorting order. When dealing exclusively with nonnegative numbers, you can get by with an arithmetic calculation instead of concatenation. For example, say you have the numbers m and n, each with a valid range of 1 through 999. To merge m and n, use the following formula: m*1000 + n AS r. You can easily extract the individual pieces later: r divided by 1000 is m, and r modulo 1000 is n. However, in many cases you may have nonnumeric data to concatenate, so arithmetic wouldn’t be possible. You might want to consider converting all values to fixed-width character strings (CHAR(n) or NCHAR(n)) or to fixed-width binary strings (BINARY(n)).
Here’s an example of returning the most recent order for each employee, where MAX(orderid) is the tiebreaker, using binary concatenation:
SELECT empid, CAST(SUBSTRING(binstr, 1, 8) AS DATETIME) AS orderdate, CAST(SUBSTRING(binstr, 9, 4) AS INT) AS orderid, CAST(SUBSTRING(binstr, 13, 4) AS INT) AS custid, CAST(SUBSTRING(binstr, 17, 8) AS DATETIME) AS requireddate FROM (SELECT empid, MAX(CAST(orderdate AS BINARY(8)) + CAST(orderid AS BINARY(4)) + CAST(custid AS BINARY(4)) + CAST(requireddate AS BINARY(8))) AS binstr FROM Sales.Orders GROUP BY empid) AS D;
The derived table D contains the maximum concatenated string for each employee. Notice that each value was converted to the appropriate fixed-size string before concatenation based on its data type (DATETIME—8 bytes, INT—4 bytes, and so on).
The outer query uses SUBSTRING to extract the individual elements, and it converts them back to their original data types.
The real benefit in this solution is that it scans the data only once regardless of whether you have a good index. If you do, you’ll probably get an ordered scan of the index and a sort-based aggregate (a stream aggregate). If you don’t have a good index—as is the case here—you’ll probably get a hash-based aggregate, as you can see in Figure 8-1.
Figure 8-1 Execution plan for a tiebreaker query
Things get trickier when the sort columns and tiebreaker attributes have different sort directions within them. For example, suppose the tiebreaker was MIN(orderid). In that case, you would need to apply MAX to orderdate and MIN to orderid. There is a logical solution when the attribute with the opposite direction is numeric. Say you need to calculate the MIN value of a nonnegative integer column n, using only MAX, and you need to use binary concatenation. You can get the minimum by using <maxint> - MAX(<maxint> - n).
The following query incorporates this logical technique:
SELECT empid, CAST(SUBSTRING(binstr, 1, 8) AS DATETIME) AS orderdate, 2147483647 - CAST(SUBSTRING(binstr, 9, 4) AS INT) AS orderid, CAST(SUBSTRING(binstr, 13, 4) AS INT) AS custid, CAST(SUBSTRING(binstr, 17, 8) AS DATETIME) AS requireddate FROM (SELECT empid, MAX(CAST(orderdate AS BINARY(8)) + CAST(2147483647 - orderid AS BINARY(4)) + CAST(custid AS BINARY(4)) + CAST(requireddate AS BINARY(8))) AS binstr FROM Sales.Orders GROUP BY empid) AS D;
Another technique to calculate the minimum by using the MAX function is based on bitwise manipulation and works with nonnegative integers. The minimum value of a column n is equal to ~MAX(~n), where ~ is the bitwise NOT operator.
The following query incorporates this technique:
SELECT empid, CAST(SUBSTRING(binstr, 1, 8) AS DATETIME) AS orderdate, ~CAST(SUBSTRING(binstr, 9, 4) AS INT) AS orderid, CAST(SUBSTRING(binstr, 13, 4) AS INT) AS custid, CAST(SUBSTRING(binstr, 17, 8) AS DATETIME) AS requireddate FROM (SELECT empid, MAX(CAST(orderdate AS BINARY(8)) + CAST(~orderid AS BINARY(4)) + CAST(custid AS BINARY(4)) + CAST(requireddate AS BINARY(8))) AS binstr FROM Sales.Orders GROUP BY empid) AS D;
Of course, you can play with the tiebreakers you’re using in any way you like. For example, the following query returns the most recent order for each employee, using MAX(requireddate), MAX(orderid) as the tiebreaker:
SELECT empid, CAST(SUBSTRING(binstr, 1, 8) AS DATETIME) AS orderdate, CAST(SUBSTRING(binstr, 9, 8) AS DATETIME) AS requireddate, CAST(SUBSTRING(binstr, 17, 4) AS INT) AS orderid, CAST(SUBSTRING(binstr, 21, 4) AS INT) AS custid FROM (SELECT empid, MAX(CAST(orderdate AS BINARY(8)) + CAST(requireddate AS BINARY(8)) + CAST(orderid AS BINARY(4)) + CAST(custid AS BINARY(4)) ) AS binstr FROM Sales.Orders GROUP BY empid) AS D;