Aggregating and Pivoting Data in Microsoft SQL Server 2008 T-SQL
- 3/25/2009
In my solutions in this chapter, I’ll reuse techniques that I introduced earlier. I’ll also introduce new techniques for you to familiarize yourself with.
Logic will naturally be an integral element in the solutions. Remember that at the heart of every querying problem lies a logical puzzle.
OVER Clause
The OVER clause allows you to request window-based calculations—that is, calculations performed over a whole window of rows. In Chapter 6, I described in detail how you use the OVER clause with analytical ranking functions. Microsoft SQL Server also supports the OVER clause with scalar aggregate functions; however, currently you can provide only the PARTITION BY clause. Future versions of SQL Server will most likely also support the other ANSI elements of aggregate window functions, including the ORDER BY and ROWS clauses.
The purpose of using the OVER clause with scalar aggregates is to calculate, for each row, an aggregate based on a window of values that extends beyond that row—and to do all this without using a GROUP BY clause in the query. In other words, the OVER clause allows you to add aggregate calculations to the results of an ungrouped query. This capability provides an alternative to requesting aggregates with subqueries in case you need to include both base row attributes and aggregates in your results.
Remember that in Chapter 7, I presented a problem in which you were required to calculate two aggregates for each order row: the percentage the row contributed to the total value of all orders and the difference between the row’s order value and the average value over all orders. In my examples I used a table called MyOrderValues that you create and populate by running the following code:
SET NOCOUNT ON; USE InsideTSQL2008; IF OBJECT_ID('dbo.MyOrderValues', 'U') IS NOT NULL DROP TABLE dbo.MyOrderValues; GO SELECT * INTO dbo.MyOrderValues FROM Sales.OrderValues; ALTER TABLE dbo.MyOrderValues ADD CONSTRAINT PK_MyOrderValues PRIMARY KEY(orderid); CREATE INDEX idx_val ON dbo.MyOrderValues(val);
I showed the following optimized query in which I used a cross join between the base table and a derived table of aggregates instead of using multiple subqueries:
SELECT orderid, custid, val, CAST(val / sumval * 100. AS NUMERIC(5, 2)) AS pct, CAST(val - avgval AS NUMERIC(12, 2)) AS diff FROM dbo.MyOrderValues CROSS JOIN (SELECT SUM(val) AS sumval, AVG(val) AS avgval FROM dbo.MyOrderValues) AS Aggs;
This query produces the following output:
orderid custid val pct diff -------- ------- --------- ----- ------------- 10248 85 440.00 0.03 -1085.05 10249 79 1863.40 0.15 338.35 10250 34 1552.60 0.12 27.55 10251 84 654.06 0.05 -870.99 10252 76 3597.90 0.28 2072.85 10253 34 1444.80 0.11 -80.25 10254 14 556.62 0.04 -968.43 10255 68 2490.50 0.20 965.45 10256 88 517.80 0.04 -1007.25 ...
The motivation for calculating the two aggregates in a single derived table instead of as two separate subqueries stemmed from the fact that each subquery accessed the base table separately, while the derived table calculated the aggregates using a single scan of the data. SQL Server’s query optimizer didn’t use the fact that the two subqueries aggregated the same data into the same groups.
When you specify multiple aggregates with identical OVER clauses in the same SELECT list, however, the aggregates refer to the same window, as with a derived table, and SQL Server’s query optimizer evaluates them all with one scan of the source data. Here’s how you use the OVER clause to answer the same request:
SELECT orderid, custid, val, CAST(val / SUM(val) OVER() * 100. AS NUMERIC(5, 2)) AS pct, CAST(val - AVG(val) OVER() AS NUMERIC(12, 2)) AS diff FROM dbo.MyOrderValues;
Here, because I didn’t specify a PARTITION BY clause, the aggregates were calculated based on the whole input. Logically, SUM(val) OVER() is equivalent here to the subquery (SELECT SUM(val) FROM dbo.MyOrderValues). Physically, it’s a different story. As an exercise, you can compare the execution plans of the following two queries, each requesting a different number of aggregates using the same OVER clause:
SELECT orderid, custid, val, SUM(val) OVER() AS sumval FROM dbo.MyOrderValues; SELECT orderid, custid, val, SUM(val) OVER() AS sumval, COUNT(val) OVER() AS cntval, AVG(val) OVER() AS avgval, MIN(val) OVER() AS minval, MAX(val) OVER() AS maxval FROM dbo.MyOrderValues;
You’ll find the two plans nearly identical, with the only difference being that the single Stream Aggregate operator calculates a different number of aggregates. The query costs are identical. On the other hand, compare the execution plans of the following two queries, each requesting a different number of aggregates using subqueries:
SELECT orderid, custid, val, (SELECT SUM(val) FROM dbo.MyOrderValues) AS sumval FROM dbo.MyOrderValues; SELECT orderid, custid, val, (SELECT SUM(val) FROM dbo.MyOrderValues) AS sumval, (SELECT COUNT(val) FROM dbo.MyOrderValues) AS cntval, (SELECT AVG(val) FROM dbo.MyOrderValues) AS avgval, (SELECT MIN(val) FROM dbo.MyOrderValues) AS minval, (SELECT MAX(val) FROM dbo.MyOrderValues) AS maxval FROM dbo.MyOrderValues;
You’ll find that they have different plans, with the latter being more expensive because it rescans the source data for each aggregate.
Another benefit of the OVER clause is that it allows for shorter and simpler code. This is especially apparent when you need to calculate partitioned aggregates. Using OVER, you simply specify a PARTITION BY clause. Using subqueries, you have to correlate the inner query to the outer, making the query longer and more complex.
As an example of using the PARTITION BY clause, the following query calculates the percentage of the order value out of the customer total and the difference from the customer average:
SELECT orderid, custid, val, CAST(val / SUM(val) OVER(PARTITION BY custid) * 100. AS NUMERIC(5, 2)) AS pct, CAST(val - AVG(val) OVER(PARTITION BY custid) AS NUMERIC(12, 2)) AS diff FROM dbo.MyOrderValues ORDER BY custid;
This query generates the following output:
orderid custid val pct diff -------- ------- ------- ------ ------------ 10643 1 814.50 19.06 102.33 10692 1 878.00 20.55 165.83 10702 1 330.00 7.72 -382.17 10835 1 845.80 19.79 133.63 10952 1 471.20 11.03 -240.97 11011 1 933.50 21.85 221.33 10926 2 514.40 36.67 163.66 10759 2 320.00 22.81 -30.74 10625 2 479.75 34.20 129.01 10308 2 88.80 6.33 -261.94 ...
In short, the OVER clause allows for more concise and faster-running queries.
When you’re done, run the following code for cleanup:
IF OBJECT_ID('dbo.MyOrderValues', 'U') IS NOT NULL DROP TABLE dbo.MyOrderValues;