Aggregating and Pivoting Data in Microsoft SQL Server 2008 T-SQL
- 3/25/2009
- OVER Clause
- Tiebreakers
- Running Aggregations
- Pivoting
- Unpivoting
- Custom Aggregations
- Histograms
- Grouping Factor
- Grouping Sets
- Conclusion
Running Aggregations
Running aggregations are aggregations of data over a sequence (typically temporal). Running aggregate problems have many variations, and I’ll describe several important ones here.
In my examples, I’ll use a summary table called EmpOrders that contains one row for each employee and month, with the total quantity of orders made by that employee in that month. Run the following code to create the EmpOrders table and populate it with sample data:
USE tempdb; IF OBJECT_ID('dbo.EmpOrders') IS NOT NULL DROP TABLE dbo.EmpOrders; CREATE TABLE dbo.EmpOrders ( empid INT NOT NULL, ordmonth DATE NOT NULL, qty INT NOT NULL, PRIMARY KEY(empid, ordmonth) ); GO INSERT INTO dbo.EmpOrders(empid, ordmonth, qty) SELECT O.empid, DATEADD(month, DATEDIFF(month, 0, O.orderdate), 0) AS ordmonth, SUM(qty) AS qty FROM InsideTSQL2008.Sales.Orders AS O JOIN InsideTSQL2008.Sales.OrderDetails AS OD ON O.orderid = OD.orderid GROUP BY empid, DATEADD(month, DATEDIFF(month, 0, O.orderdate), 0);
Run the following query to get the contents of the EmpOrders table:
SELECT empid, CONVERT(VARCHAR(7), ordmonth, 121) AS ordmonth, qty FROM dbo.EmpOrders ORDER BY empid, ordmonth;
This query generates the following output, shown here in abbreviated form:
empid ordmonth qty ----------- -------- ----------- 1 2006-07 121 1 2006-08 247 1 2006-09 255 1 2006-10 143 1 2006-11 318 1 2006-12 536 1 2007-01 304 1 2007-02 168 1 2007-03 275 1 2007-04 20 ... 2 2006-07 50 2 2006-08 94 2 2006-09 137 2 2006-10 248 2 2006-11 237 2 2006-12 319 2 2007-01 230 2 2007-02 36 2 2007-03 151 2 2007-04 468 ...
I’ll discuss three types of running aggregation problems: cumulative, sliding, and year-to-date (YTD).
Cumulative Aggregations
Cumulative aggregations accumulate data from the first element within the sequence up to the current point. For example, imagine the following request: for each employee and month, return the total quantity and average monthly quantity from the beginning of the employee’s activity through the month in question.
Recall the techniques for calculating row numbers without using the built-in ROW_NUMBER function; using these techniques, you scan the same rows we need here to calculate the total quantities. The difference is that for row numbers you used the aggregate COUNT, and here you’ll use the aggregates SUM and AVG. I demonstrated two set-based solutions to calculate row numbers without the ROW_NUMBER function—one using subqueries and one using joins. In the solution using joins, I applied what I called an expand-collapse technique. To me, the subquery solution is much more intuitive than the join solution, with its artificial expand-collapse technique. So, when there’s no performance difference, I’d rather use subqueries. Typically, you won’t see a performance difference when only one aggregate is involved because the plans would be similar. However, when you request multiple aggregates, the subquery solution might result in a plan that scans the data separately for each aggregate. Compare this to the plan for the join solution, which typically calculates all aggregates during a single scan of the source data.
So my choice is usually simple—use a subquery for one aggregate and use a join for multiple aggregates. The following query applies the expand-collapse approach to produce the desired result:
SELECT O1.empid, CONVERT(VARCHAR(7), O1.ordmonth, 121) AS ordmonth, O1.qty AS qtythismonth, SUM(O2.qty) AS totalqty, CAST(AVG(1.*O2.qty) AS NUMERIC(12, 2)) AS avgqty FROM dbo.EmpOrders AS O1 JOIN dbo.EmpOrders AS O2 ON O2.empid = O1.empid AND O2.ordmonth <= O1.ordmonth GROUP BY O1.empid, O1.ordmonth, O1.qty ORDER BY O1.empid, O1.ordmonth;
This query generates the following output, shown here in abbreviated form:
empid ordmonth qtythismonth totalqty avgqty ----------- -------- ------------ ----------- ---------- 1 2006-07 121 121 121.00 1 2006-08 247 368 184.00 1 2006-09 255 623 207.67 1 2006-10 143 766 191.50 1 2006-11 318 1084 216.80 1 2006-12 536 1620 270.00 1 2007-01 304 1924 274.86 1 2007-02 168 2092 261.50 1 2007-03 275 2367 263.00 1 2007-04 20 2387 238.70 ... 2 2006-07 50 50 50.00 2 2006-08 94 144 72.00 2 2006-09 137 281 93.67 2 2006-10 248 529 132.25 2 2006-11 237 766 153.20 2 2006-12 319 1085 180.83 2 2007-01 230 1315 187.86 2 2007-02 36 1351 168.88 2 2007-03 151 1502 166.89 2 2007-04 468 1970 197.00 ...
Now let’s say that you are asked to return only one aggregate (say, total quantity). You can safely use the subquery approach:
SELECT O1.empid, CONVERT(VARCHAR(7), O1.ordmonth, 121) AS ordmonth, O1.qty AS qtythismonth, (SELECT SUM(O2.qty) FROM dbo.EmpOrders AS O2 WHERE O2.empid = O1.empid AND O2.ordmonth <= O1.ordmonth) AS totalqty FROM dbo.EmpOrders AS O1 GROUP BY O1.empid, O1.ordmonth, O1.qty;
As was the case for calculating row numbers based on subqueries or joins, when calculating running aggregates based on similar techniques, the N2 performance issues I discussed before apply once again. Because running aggregates typically are calculated on a fairly small number of rows per group, you won’t be adversely affected by performance issues, assuming you have appropriate indexes (keyed on grouping columns, then sort columns, and including covering columns).
Let p be the number of partitions involved (employees in our case), let n be the average number of rows per partition (months in our case), and let a be the number of aggregates involved. The total number of rows scanned using the join approach can be expressed as pn + p(n+n2)/2 and as pn + ap(n+n2)/2 using the subquery approach because with subqueries the optimizer uses a separate scan per subquery. It’s important to note that the N2 complexity is relevant to the partition size and not the table size. If the number of rows in the table grows by a factor of f but the partition size doesn’t change, the run time increases by a factor of f as well. If, on the other hand, the average partition size grows by a factor of f, the run time increases by a factor of f2. With small partitions (say, up to several dozen rows), this set-based solution provides reasonable performance. With large partitions, a cursor solution would be faster despite the overhead associated with row-by-row manipulation because a cursor scans the rows only once, and the per-row overhead is constant.
You might also be requested to filter the data—for example, return monthly aggregates for each employee only for months before the employee reached a certain target. Typically, you’ll have a target for each employee stored in a Targets table that you’ll need to join to. To make this example simple, I’ll assume that all employees have the same target total quantity—1,000. In practice, you’ll use the target attribute from the Targets table. Because you need to filter an aggregate, not an attribute, you must specify the filter expression (in this case, SUM(O2.qty) < 1000) in the HAVING clause, not the WHERE clause. The solution is as follows:
SELECT O1.empid, CONVERT(VARCHAR(7), O1.ordmonth, 121) AS ordmonth, O1.qty AS qtythismonth, SUM(O2.qty) AS totalqty, CAST(AVG(1.*O2.qty) AS NUMERIC(12, 2)) AS avgqty FROM dbo.EmpOrders AS O1 JOIN dbo.EmpOrders AS O2 ON O2.empid = O1.empid AND O2.ordmonth <= O1.ordmonth GROUP BY O1.empid, O1.ordmonth, O1.qty HAVING SUM(O2.qty) < 1000 ORDER BY O1.empid, O1.ordmonth;
This query generates the following output, shown here in abbreviated form:
empid ordmonth qtythismonth totalqty avgqty ----------- -------- ------------ ----------- ---------- 1 2006-07 121 121 121.00 1 2006-08 247 368 184.00 1 2006-09 255 623 207.67 1 2006-10 143 766 191.50 2 2006-07 50 50 50.00 2 2006-08 94 144 72.00 2 2006-09 137 281 93.67 2 2006-10 248 529 132.25 2 2006-11 237 766 153.20 3 2006-07 182 182 182.00 3 2006-08 228 410 205.00 3 2006-09 75 485 161.67 3 2006-10 151 636 159.00 3 2006-11 204 840 168.00 3 2006-12 100 940 156.67 ...
Things get a bit tricky if you also need to include the rows for those months in which the employees reached their target. If you specify SUM(O2.qty) <= 1000 (that is, write <= instead of <), you still won’t get the row in which the employee reached the target unless the total through that month is exactly 1,000. But remember that you have access to both the cumulative total and the current month’s quantity, and using these two values together, you can solve this problem. If you change the HAVING filter to SUM(O2.qty) – O1.qty < 1000, you get the months in which the employee’s total quantity, excluding the current month’s orders, had not reached the target. In particular, the first month in which an employee reached or exceeded the target satisfies this new criterion, and that month will appear in the results. The complete solution follows:
SELECT O1.empid, CONVERT(VARCHAR(7), O1.ordmonth, 121) AS ordmonth, O1.qty AS qtythismonth, SUM(O2.qty) AS totalqty, CAST(AVG(1.*O2.qty) AS NUMERIC(12, 2)) AS avgqty FROM dbo.EmpOrders AS O1 JOIN dbo.EmpOrders AS O2 ON O2.empid = O1.empid AND O2.ordmonth <= O1.ordmonth GROUP BY O1.empid, O1.ordmonth, O1.qty HAVING SUM(O2.qty) - O1.qty < 1000 ORDER BY O1.empid, O1.ordmonth;
This query generates the following output, shown here in abbreviated form:
empid ordmonth qtythismonth totalqty avgqty ----------- -------- ------------ ----------- ---------- 1 2006-07 121 121 121.00 1 2006-08 247 368 184.00 1 2006-09 255 623 207.67 1 2006-10 143 766 191.50 1 2006-11 318 1084 216.80 2 2006-07 50 50 50.00 2 2006-08 94 144 72.00 2 2006-09 137 281 93.67 2 2006-10 248 529 132.25 2 2006-11 237 766 153.20 2 2006-12 319 1085 180.83 3 2006-07 182 182 182.00 3 2006-08 228 410 205.00 3 2006-09 75 485 161.67 3 2006-10 151 636 159.00 3 2006-11 204 840 168.00 3 2006-12 100 940 156.67 3 2007-01 364 1304 186.29 ...
Suppose you’re interested in seeing results only for the specific month in which the employee reached the target of 1,000, without seeing results for preceding months. What’s true for only those rows in the output of the last query? You’re looking for rows where the total quantity is greater than or equal to 1,000. Simply add this criterion to the HAVING filter. Here’s the query followed by its output:
SELECT O1.empid, CONVERT(VARCHAR(7), O1.ordmonth, 121) AS ordmonth, O1.qty AS qtythismonth, SUM(O2.qty) AS totalqty, CAST(AVG(1.*O2.qty) AS NUMERIC(12, 2)) AS avgqty FROM dbo.EmpOrders AS O1 JOIN dbo.EmpOrders AS O2 ON O2.empid = O1.empid AND O2.ordmonth <= O1.ordmonth GROUP BY O1.empid, O1.ordmonth, O1.qty HAVING SUM(O2.qty) - O1.qty < 1000 AND SUM(O2.qty) >= 1000 ORDER BY O1.empid, O1.ordmonth; empid ordmonth qtythismonth totalqty avgqty ----------- -------- ------------ ----------- ---------- 1 2006-11 318 1084 216.80 2 2006-12 319 1085 180.83 3 2007-01 364 1304 186.29 4 2006-10 613 1439 359.75 5 2007-05 247 1213 173.29 6 2007-01 64 1027 171.17 7 2007-03 191 1069 152.71 8 2007-01 305 1228 175.43 9 2007-06 161 1007 125.88
Sliding Aggregations
Sliding aggregates are calculated over a sliding window in a sequence (again, typically temporal), as opposed to being calculated from the beginning of the sequence until the current point. A moving average—such as the employee’s average quantity over the last three months—is one example of a sliding aggregate.
The main difference between the solution for cumulative aggregates and the solution for sliding aggregates is in the join condition (or in the subquery’s filter in the case of the alternate solution using subqueries). Instead of using O2.ordmonth <= O1.current_month, you use O2.ordmonth > three_months_before_current AND O2.ordmonth <= O1.current_month. In T-SQL, this translates to the following query:
SELECT O1.empid, CONVERT(VARCHAR(7), O1.ordmonth, 121) AS tomonth, O1.qty AS qtythismonth, SUM(O2.qty) AS totalqty, CAST(AVG(1.*O2.qty) AS NUMERIC(12, 2)) AS avgqty FROM dbo.EmpOrders AS O1 JOIN dbo.EmpOrders AS O2 ON O2.empid = O1.empid AND (O2.ordmonth > DATEADD(month, -3, O1.ordmonth) AND O2.ordmonth <= O1.ordmonth) GROUP BY O1.empid, O1.ordmonth, O1.qty ORDER BY O1.empid, O1.ordmonth;
This query generates the following output, shown here in abbreviated form:
empid tomonth qtythismonth totalqty avgqty ----------- ------- ------------ ----------- ---------- 1 2006-07 121 121 121.00 1 2006-08 247 368 184.00 1 2006-09 255 623 207.67 1 2006-10 143 645 215.00 1 2006-11 318 716 238.67 1 2006-12 536 997 332.33 1 2007-01 304 1158 386.00 1 2007-02 168 1008 336.00 1 2007-03 275 747 249.00 1 2007-04 20 463 154.33 ... 2 2006-07 50 50 50.00 2 2006-08 94 144 72.00 2 2006-09 137 281 93.67 2 2006-10 248 479 159.67 2 2006-11 237 622 207.33 2 2006-12 319 804 268.00 2 2007-01 230 786 262.00 2 2007-02 36 585 195.00 2 2007-03 151 417 139.00 2 2007-04 468 655 218.33 ...
Note that this solution includes aggregates for three-month periods that don’t include three months of actual data. If you want to return only periods with three full months accumulated, without the first two periods that do not cover three months, you can add the criterion MIN(O2.ordmonth) = DATEADD(month, –2, O1.ordmonth) to the HAVING filter.
Year-to-Date (YTD)
YTD aggregates accumulate values from the beginning of a period based on some date and time unit (say, a year) until the current point. The calculation is very similar to the sliding aggregates solution. The only difference is the lower bound provided in the query’s filter, which is the calculation of the beginning of the year. For example, the following query returns YTD aggregates for each employee and month:
SELECT O1.empid, CONVERT(VARCHAR(7), O1.ordmonth, 121) AS ordmonth, O1.qty AS qtythismonth, SUM(O2.qty) AS totalqty, CAST(AVG(1.*O2.qty) AS NUMERIC(12, 2)) AS avgqty FROM dbo.EmpOrders AS O1 JOIN dbo.EmpOrders AS O2 ON O2.empid = O1.empid AND (O2.ordmonth >= CAST(CAST(YEAR(O1.ordmonth) AS CHAR(4)) + '0101' AS DATETIME) AND O2.ordmonth <= O1.ordmonth) GROUP BY O1.empid, O1.ordmonth, O1.qty ORDER BY O1.empid, O1.ordmonth;
This query generates the following output, shown here in abbreviated form:
empid ordmonth qtythismonth totalqty avgqty ----------- -------- ------------ ----------- ---------- 1 2006-07 121 121 121.00 1 2006-08 247 368 184.00 1 2006-09 255 623 207.67 1 2006-10 143 766 191.50 1 2006-11 318 1084 216.80 1 2006-12 536 1620 270.00 1 2007-01 304 304 304.00 1 2007-02 168 472 236.00 1 2007-03 275 747 249.00 1 2007-04 20 767 191.75 ... 2 2006-07 50 50 50.00 2 2006-08 94 144 72.00 2 2006-09 137 281 93.67 2 2006-10 248 529 132.25 2 2006-11 237 766 153.20 2 2006-12 319 1085 180.83 2 2007-01 230 230 230.00 2 2007-02 36 266 133.00 2 2007-03 151 417 139.00 2 2007-04 468 885 221.25 ...