Aggregating and Pivoting Data in Microsoft SQL Server 2008 T-SQL
- 3/25/2009
Grouping Sets
A grouping set is simply a set of attributes that you group by, such as in a query that has the following GROUP BY clause:
GROUP BY custid, empid, YEAR(orderdate)
You define a single grouping set—(custid, empid, YEAR(orderdate)). Traditionally, aggregate queries define a single grouping set, as demonstrated in the previous example. SQL Server supports features that allow you to define multiple grouping sets in the same query and return a single result set with aggregates calculated for the different grouping sets. The ability to define multiple grouping sets in the same query was available prior to SQL Server 2008 in the form of options called WITH CUBE and WITH ROLLUP and a helper function called GROUPING. However, those options were neither standard nor flexible enough. SQL Server 2008 introduces several new features that allow you to define multiple grouping sets in the same query. The new features include the GROUPING SETS, CUBE, and ROLLUP subclauses of the GROUP BY clause (not to be confused with the older WITH CUBE and WITH ROLLUP options) and the helper function GROUPING_ID. These new features are ISO compliant and substantially more flexible than the older, nonstandard ones.
Before I provide the technicalities of the grouping sets–related features, I’d like to explain the motivation for using those and the kind of problems that they solve. If you’re interested only in the technicalities, feel free to skip this section.
Consider a data warehouse with a large volume of sales data. Users of this data warehouse frequently need to analyze aggregated views of the data by various dimensions, such as customer, employee, product, time, and so on. When a user such as a sales manager starts the analysis process, the user asks for some initial aggregated view of the data—for example, the total quantities for each customer and year. This request translates in more technical terms to a request to aggregate data for the grouping set (custid, YEAR(orderdate)). The user then analyzes the data, and based on the findings the user makes the next request—say, to return total quantities for each year and month. This is a request to aggregate data for a new grouping set—(YEAR(orderdate), MONTH(orderdate)). In this manner the user keeps asking for different aggregated views of the data—in other words, to aggregate data for different grouping sets.
To address such analysis needs of your system’s users, you could develop an application that generates a different GROUP BY query for each user request. Each query would need to scan all applicable base data and process the aggregates. With large volumes of data, this approach is very inefficient, and the response time will probably be unreasonable.
To provide fast response time, you need to preprocess aggregates for all grouping sets that users might ask for and store those in the data warehouse. For example, you could do this every night. When the user requests aggregates for a certain grouping set, the aggregates will be readily available. The problem is that given n dimensions, 2n possible grouping sets can be constructed from those dimensions. For example, with 10 dimensions you get 1,024 grouping sets. If you actually run a separate GROUP BY query for each, it will take a very long time to process all aggregates, and you might not have a sufficient processing window for this.
This is where the new grouping features come into the picture. They allow you to calculate aggregates for multiple grouping sets without rescanning the base data separately for each. Instead, SQL Server scans the data the minimum number of times that the optimizer figures is optimal, calculates the base aggregates, and on top of the base aggregates calculates the super aggregates (aggregates of aggregates).
Note that the product Microsoft SQL Server Analysis Services (SSAS, or just AS) specializes in preprocessing aggregates for multiple grouping sets and storing them in a specialized multidimensional database. It provides very fast response time to user requests, which are made with a language called Multidimensional Expressions (MDX). The recommended approach to handling needs for dynamic analysis of aggregated data is to implement an Analysis Services solution. However, some organizations don’t need the scale and sophistication levels provided by Analysis Services and would rather get the most they can from their relational data warehouse with T-SQL. For those organizations, the new grouping features provided by SQL Server can come in very handy.
The following sections describe the technicalities of the grouping sets–related features supported by SQL Server 2008.
Sample Data
In my examples I will use the Orders table that you create and populate in tempdb by running the code provided earlier in Example 8-1. This code is provided here again for your convenience:
SET NOCOUNT ON; USE tempdb; IF OBJECT_ID('dbo.Orders', 'U') IS NOT NULL DROP TABLE dbo.Orders; GO CREATE TABLE dbo.Orders ( orderid INT NOT NULL, orderdate DATETIME NOT NULL, empid INT NOT NULL, custid VARCHAR(5) NOT NULL, qty INT NOT NULL, CONSTRAINT PK_Orders PRIMARY KEY(orderid) ); GO INSERT INTO dbo.Orders (orderid, orderdate, empid, custid, qty) VALUES (30001, '20060802', 3, 'A', 10), (10001, '20061224', 1, 'A', 12), (10005, '20061224', 1, 'B', 20), (40001, '20070109', 4, 'A', 40), (10006, '20070118', 1, 'C', 14), (20001, '20070212', 2, 'B', 12), (40005, '20080212', 4, 'A', 10), (20002, '20080216', 2, 'C', 20), (30003, '20080418', 3, 'B', 15), (30004, '20060418', 3, 'C', 22), (30007, '20060907', 3, 'D', 30);
The GROUPING SETS Subclause
SQL Server 2008 allows you to define multiple grouping sets in the same query by using the new GROUPING SETS subclause of the GROUP BY clause. Within the outermost pair of parentheses, you specify a list of grouping sets separated by commas. Each grouping set is expressed by a pair of parentheses containing the set’s elements separated by commas. For example, the following query defines four grouping sets:
SELECT custid, empid, YEAR(orderdate) AS orderyear, SUM(qty) AS qty FROM dbo.Orders GROUP BY GROUPING SETS ( ( custid, empid, YEAR(orderdate) ), ( custid, YEAR(orderdate) ), ( empid, YEAR(orderdate) ), () );
The first grouping set is (custid, empid, YEAR(orderdate)), the second is (custid, YEAR(orderdate)), the third is (empid, YEAR(orderdate)), and the fourth is the empty grouping set (), which is used to calculate grand totals. This query generates the following output:
custid empid orderyear qty ------ ----------- ----------- ----------- A 1 2006 12 B 1 2006 20 NULL 1 2006 32 C 1 2007 14 NULL 1 2007 14 B 2 2007 12 NULL 2 2007 12 C 2 2008 20 NULL 2 2008 20 A 3 2006 10 C 3 2006 22 D 3 2006 30 NULL 3 2006 62 B 3 2008 15 NULL 3 2008 15 A 4 2007 40 NULL 4 2007 40 A 4 2008 10 NULL 4 2008 10 NULL NULL NULL 205 A NULL 2006 22 B NULL 2006 20 C NULL 2006 22 D NULL 2006 30 A NULL 2007 40 B NULL 2007 12 C NULL 2007 14 A NULL 2008 10 B NULL 2008 15 C NULL 2008 20
As you can see in the output of the query, NULLs are used as placeholders in inapplicable attributes. You could also think of these NULLs as indicating that the row represents an aggregate over all values of that column. This way, SQL Server can combine rows associated with different grouping sets to one result set. So, for example, in rows associated with the grouping set (custid, YEAR(orderdate)), the empid column is NULL. In rows associated with the empty grouping set, the columns empid, custid, and orderyear are NULLs and so on.
Compared to a query that unifies the result sets of four GROUP BY queries, our query that uses the GROUPING SETS subclause requires much less code. It has a performance advantage as well. Examine the execution plan of this query shown in Figure 8-3.
Figure 8-3 Execution plan of query with GROUPING SETS subclause
Observe that even though the query defines four grouping sets, the execution plan shows only two scans of the data. In particular, observe that the first branch of the plan shows two Stream Aggregate operators. The Sort operator sorts the data by empid, YEAR(orderdate), custid. Based on this sorting, the first Stream Aggregate operator calculates the aggregates for the grouping set (custid, empid, YEAR(orderdate)); the second Stream Aggregate operates on the results of the first and calculates the aggregates for the grouping set (empid, YEAR(orderdate)) and the empty grouping set. The second branch of the plan sorts the data by YEAR(orderdate), custid to allow the Stream Aggregate operator that follows to calculate aggregates for the grouping set (custid, YEAR(orderdate)).
Following is a query that is logically equivalent to the previous one, except that this one actually invokes four GROUP BY queries—one for each grouping set—and unifies their result sets:
SELECT custid, empid, YEAR(orderdate) AS orderyear, SUM(qty) AS qty FROM dbo.Orders GROUP BY custid, empid, YEAR(orderdate) UNION ALL SELECT custid, NULL AS empid, YEAR(orderdate) AS orderyear, SUM(qty) AS qty FROM dbo.Orders GROUP BY custid, YEAR(orderdate) UNION ALL SELECT NULL AS custid, empid, YEAR(orderdate) AS orderyear, SUM(qty) AS qty FROM dbo.Orders GROUP BY empid, YEAR(orderdate) UNION ALL SELECT NULL AS custid, NULL AS empid, NULL AS orderyear, SUM(qty) AS qty FROM dbo.Orders;
The execution plan for this query is shown in Figure 8-4. You can see that the data is scanned four times.
Figure 8-4 Execution plan of code unifying four GROUP BY queries
SQL Server 2008 allows you to define up to 4,096 grouping sets in a single query.
The CUBE Subclause
SQL Server 2008 also introduces the CUBE subclause of the GROUP BY clause (not to be confused with the older WITH CUBE option). The CUBE subclause is merely an abbreviated way to express a large number of grouping sets without actually listing them in a GROUPING SETS subclause. CUBE accepts a list of elements as input and defines all possible grouping sets out of those, including the empty grouping set. In set theory, this is called the power set of a set. The power set of a set V is the set of all subsets of V. Given n elements, CUBE produces 2n grouping sets. For example, CUBE(a, b, c) is equivalent to GROUPING SETS( (a, b, c), (a, b), (a, c), (b, c), (a), (b), (c), () ).
The following query uses the CUBE option to define all four grouping sets that can be made of the elements custid and empid:
SELECT custid, empid, SUM(qty) AS qty FROM dbo.Orders GROUP BY CUBE(custid, empid);
This query generates the following output:
custid empid qty ------ ----------- ----------- A 1 12 B 1 20 C 1 14 NULL 1 46 B 2 12 C 2 20 NULL 2 32 A 3 10 B 3 15 C 3 22 D 3 30 NULL 3 77 A 4 50 NULL 4 50 NULL NULL 205 A NULL 72 B NULL 47 C NULL 56 D NULL 30
The following query using the GROUPING SETS subclause is equivalent to the previous query:
SELECT custid, empid, SUM(qty) AS qty FROM dbo.Orders GROUP BY GROUPING SETS ( ( custid, empid ), ( custid ), ( empid ), () );
Note that each of the elements in the list you provide to CUBE as input can be made of either a single attribute or multiple attributes. The previous CUBE expression used two single-attribute elements. To define a multi-attribute element, simply list the element’s attributes in parentheses. As an example, the expression CUBE( x, y, z ) has three single-attribute elements and defines eight grouping sets: (x, y, z), (x, y), (x, z), (y, z), (x), (y), (z), (). The expression CUBE( (x, y), z ) has one two-attribute element and one single-attribute element and defines four grouping sets: (x, y, z), (x, y), (z), ().
Prior to SQL Server 2008, you could achieve something similar to what the CUBE subclause gives you by using a WITH CUBE option that you specified after the GROUP BY clause, like so:
SELECT custid, empid, SUM(qty) AS qty FROM dbo.Orders GROUP BY custid, empid WITH CUBE;
This is an equivalent to our previous CUBE query, but it has two drawbacks. First, it’s not standard, while the new CUBE subclause is. Second, when you specify the WITH CUBE option, you cannot define additional grouping sets beyond the ones defined by CUBE, while you can with the new CUBE subclause.
The ROLLUP Subclause
The new ROLLUP subclause of the GROUP BY clause is similar to the CUBE subclause. It also allows defining multiple grouping sets in an abbreviated way. However, while CUBE defines all possible grouping sets that can be made of the input elements (the power set), ROLLUP defines only a subset of those. ROLLUP assumes a hierarchy between the input elements. For example, ROLLUP(a, b, c) assumes a hierarchy between the elements a, b, and c. When there is a hierarchy, not all possible grouping sets that can be made of the input elements make sense in terms of having business value. Consider, for example, the hierarchy country, region, city. You can see the business value in the grouping sets (country, region, city), (country, region), (country), and (). But as grouping sets, (city), (region), (region, city) and (country, city) have no business value. For example, the grouping set (city) has no business value because different cities can have the same name, and a business typically needs totals by city, not by city name. When the input elements represent a hierarchy, ROLLUP produces only the grouping sets that make business sense for the hierarchy. Given n elements, ROLLUP will produce n + 1 grouping sets.
The following query shows an example of using the ROLLUP subclause:
SELECT YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, DAY(orderdate) AS orderday, SUM(qty) AS qty FROM dbo.Orders GROUP BY ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate));
Out of the three input elements, ROLLUP defines four (3 + 1) grouping sets—(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)), (YEAR(orderdate), MONTH(orderdate)), (YEAR(orderdate)), and (). This query generates the following output:
orderyear ordermonth orderday qty ----------- ----------- ----------- ----------- 2006 4 18 22 2006 4 NULL 22 2006 8 2 10 2006 8 NULL 10 2006 9 7 30 2006 9 NULL 30 2006 12 24 32 2006 12 NULL 32 2006 NULL NULL 94 2007 1 9 40 2007 1 18 14 2007 1 NULL 54 2007 2 12 12 2007 2 NULL 12 2007 NULL NULL 66 2008 2 12 10 2008 2 16 20 2008 2 NULL 30 2008 4 18 15 2008 4 NULL 15 2008 NULL NULL 45 NULL NULL NULL 205
This query is equivalent to the following query that uses the GROUPING SETS subclause to define the aforementioned grouping sets explicitly:
SELECT YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, DAY(orderdate) AS orderday, SUM(qty) AS qty FROM dbo.Orders GROUP BY GROUPING SETS ( ( YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ), ( YEAR(orderdate), MONTH(orderdate) ), ( YEAR(orderdate) ), () );
Like with CUBE, each of the elements in the list you provide to ROLLUP as input can be made of either a single attribute or multiple attributes. As an example, the expression ROLLUP( x, y, z ) defines four grouping sets: (x, y, z), (x, y), (x), (). The expression ROLLUP( (x, y), z ) defines three grouping sets: (x, y, z), (x, y), ().
Similar to the WITH CUBE option that I described earlier, previous versions of SQL Server prior to SQL Server 2008 supported a WITH ROLLUP option. Following is a query that is equivalent to the previous ROLLUP query, except that it uses the older WITH ROLLUP option:
SELECT YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, DAY(orderdate) AS orderday, SUM(qty) AS qty FROM dbo.Orders GROUP BY YEAR(orderdate), MONTH(orderdate), DAY(orderdate) WITH ROLLUP;
Like the WITH CUBE option, the WITH ROLLUP option is nonstandard and doesn’t allow you to define further grouping sets in the same query.
Grouping Sets Algebra
One beautiful thing about the design of the grouping sets–related features implemented in SQL Server 2008 is that they support a whole algebra of operations that can help you define a large number of grouping sets using minimal coding. You have support for operations that you can think of as multiplication, division, and addition.
Multiplication
Multiplication means producing a Cartesian product of grouping sets. You perform multiplication by separating GROUPING SETS subclauses (or the abbreviated CUBE and ROLLUP subclauses) by commas. For example, if A represents a set of attributes a1, a2, . . ., an, and B represents a set of attributes b1, b2, . . ., bn, and so on, the product GROUPING SETS( (A), (B), (C) ), GROUPING SETS( (D), (E) ) is equal to GROUPING SETS ( (A, D), (A, E), (B, D), (B, E), (C, D), (C, E) ).
Consider the following query and try to figure out which grouping sets it defines:
SELECT custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, DAY(orderdate) AS orderday, SUM(qty) AS qty FROM dbo.Orders GROUP BY CUBE(custid, empid), ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate));
First, expand the CUBE and ROLLUP subclauses to the corresponding GROUPING SETS subclauses, and you get the following query:
SELECT custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, SUM(qty) AS qty FROM dbo.Orders GROUP BY GROUPING SETS ( ( custid, empid ), ( custid ), ( empid ), () ), GROUPING SETS ( ( YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ), ( YEAR(orderdate), MONTH(orderdate) ), ( YEAR(orderdate) ), () );
Now apply the multiplication between the GROUPING SETS subclauses, and you get the following query:
SELECT custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, SUM(qty) AS qty FROM dbo.Orders GROUP BY GROUPING SETS ( ( custid, empid, YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ), ( custid, empid, YEAR(orderdate), MONTH(orderdate) ), ( custid, empid, YEAR(orderdate) ), ( custid, empid ), ( custid, YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ), ( custid, YEAR(orderdate), MONTH(orderdate) ), ( custid, YEAR(orderdate) ), ( custid ), ( empid, YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ), ( empid, YEAR(orderdate), MONTH(orderdate) ), ( empid, YEAR(orderdate) ), ( empid ), ( YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ), ( YEAR(orderdate), MONTH(orderdate) ), ( YEAR(orderdate) ), () );
Division
When multiple grouping sets in an existing GROUPING SETS subclause share common elements, you can separate the common elements to another GROUPING SETS subclause and multiply the two. The concept is similar to arithmetic division, where you divide operands of an expression by a common element and pull it outside the parentheses. For example, (5×3 + 5×7) can be expressed as (5)×(3 + 7). Based on this logic, you can sometimes reduce the amount of code needed to define multiple grouping sets. For example, see if you can reduce the code in the following query while preserving the same grouping sets:
SELECT custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, SUM(qty) AS qty FROM dbo.Orders GROUP BY GROUPING SETS ( ( custid, empid, YEAR(orderdate), MONTH(orderdate) ), ( custid, empid, YEAR(orderdate) ), ( custid, YEAR(orderdate), MONTH(orderdate) ), ( custid, YEAR(orderdate) ), ( empid, YEAR(orderdate), MONTH(orderdate) ), ( empid, YEAR(orderdate) ) );
Because YEAR(orderdate) is a common element to all grouping sets, you can move it to another GROUPING SETS subclause and multiply the two, like so:
SELECT custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, SUM(qty) AS qty FROM dbo.Orders GROUP BY GROUPING SETS ( ( YEAR(orderdate) ) ), GROUPING SETS ( ( custid, empid, MONTH(orderdate) ), ( custid, empid ), ( custid, MONTH(orderdate) ), ( custid ), ( empid, MONTH(orderdate) ), ( empid ) );
Note that when a GROUPING SETS subclause contains only one grouping set, it is equivalent to listing the grouping set’s elements directly in the GROUP BY clause. Hence, the previous query is logically equivalent to the following:
SELECT custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, SUM(qty) AS qty FROM dbo.Orders GROUP BY YEAR(orderdate), GROUPING SETS ( ( custid, empid, MONTH(orderdate) ), ( custid, empid ), ( custid, MONTH(orderdate) ), ( custid ), ( empid, MONTH(orderdate) ), ( empid ) );
You can reduce this form even further. Notice in the remaining GROUPING SETS subclause that three subsets of elements appear once with MONTH(orderdate) and once without. Hence, you can reduce this form to a multiplication between a GROUPING SETS subclause containing those three and another containing two grouping sets, (MONTH(orderdate)) and the empty grouping set, like so:
SELECT custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, SUM(qty) AS qty FROM dbo.Orders GROUP BY YEAR(orderdate), GROUPING SETS ( ( custid, empid ), ( custid ), ( empid ) ), GROUPING SETS ( ( MONTH(orderdate) ), () );
Addition
Recall that when you separate GROUPING SETS, CUBE, and ROLLUP subclauses by commas, you get a Cartesian product between the sets of grouping sets that each represents. But what if you have an existing GROUPING SETS subclause and you just want to add—not multiply—the grouping sets that are defined by a CUBE or ROLLUP subclause? This can be achieved by specifying the CUBE or ROLLUP subclause (or multiple ones) within the parentheses of the GROUPING SETS subclause.
For example, the following query demonstrates adding the grouping sets defined by a ROLLUP subclause to the grouping sets defined by the hosting GROUPING SETS subclause:
SELECT custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, SUM(qty) AS qty FROM dbo.Orders GROUP BY GROUPING SETS ( ( custid, empid ), ( custid ), ( empid ), ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) );
This query is a logical equivalent of the following query:
SELECT custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, SUM(qty) AS qty FROM dbo.Orders GROUP BY GROUPING SETS ( ( custid, empid ), ( custid ), ( empid ), ( YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ), ( YEAR(orderdate), MONTH(orderdate) ), ( YEAR(orderdate) ), () );
Unfortunately, there is no built-in option to do subtraction. For example, you can’t somehow express the idea of CUBE( a, b, c, d ) minus GROUPING SETS ( (a, c), (b, d), () ). Of course, you can achieve this with the EXCEPT set operation and other techniques but not as a direct algebraic operation on grouping sets–related subclauses.
The GROUPING_ID Function
In your applications you may need to be able to identify the grouping set with which each result row of your query is associated. Relying on the NULL placeholders may lead to convoluted code, not to mention the fact that if a column is defined in the table as allowing NULLs, a NULL in the result will be ambiguous. SQL Server 2008 introduces a very convenient tool for this purpose in the form of a function called GROUPING_ID. This function accepts a list of attributes as input and constructs an integer bitmap where each bit represents the corresponding attribute (the rightmost bit represents the rightmost input attribute). The bit is 0 when the corresponding attribute is a member of the grouping set and 1 otherwise.
You provide the function with all attributes that participate in any grouping set as input, and you will get a unique integer representing each grouping set. So, for example, the expression GROUPING_ID( a, b, c, d ) would return 0 ( 0×8 + 0×4 + 0×2 + 0×1 ) for rows associated with the grouping set ( a, b, c, d ), 1 ( 0×8 + 0×4 + 0×2 + 1×1 ) for the grouping set ( a, b, c ), 2 ( 0×8 + 0×4 + 1×2 + 0×1 ) for the grouping set ( a, b, d ), 3 ( 0×8 + 0×4 + 1×2 + 1×1 ) for the grouping set ( a, b ), and so on.
The following query demonstrate the use of the GROUPING_ID function:
SELECT GROUPING_ID( custid, empid, YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ) AS grp_id, custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, DAY(orderdate) AS orderday, SUM(qty) AS qty FROM dbo.Orders GROUP BY CUBE(custid, empid), ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate));
This query generates the following output:
grp_id custid empid orderyear ordermonth orderday qty ------- ------ ------ ---------- ----------- --------- ---- 0 C 3 2006 4 18 22 16 NULL 3 2006 4 18 22 0 A 3 2006 8 2 10 24 NULL NULL 2006 4 18 22 25 NULL NULL 2006 4 NULL 22 16 NULL 3 2006 8 2 10 24 NULL NULL 2006 8 2 10 25 NULL NULL 2006 8 NULL 10 0 D 3 2006 9 7 30 16 NULL 3 2006 9 7 30 ...
For example, the grp_id value 25 represents the grouping set ( YEAR(orderdate), MONTH(orderdate) ). These attributes are represented by the second (value 2) and third (value 4) bits. However, remember that the bits representing members that participate in the grouping set are turned off. The bits representing the members that do not participate in the grouping set are turned on. In our case, those are the first (1), fourth (8), and fifth (16) bits representing the attributes DAY(orderdate), empid and custid, respectively. The sum of the values of the bits that are turned on is 1 + 8 + 16 = 25.
The following query helps you see which bits are turned on or off in each integer bitmap generated by the GROUPING_ID function with five input elements:
SELECT GROUPING_ID(e, d, c, b, a) as n, COALESCE(e, 1) as [16], COALESCE(d, 1) as [8], COALESCE(c, 1) as [4], COALESCE(b, 1) as [2], COALESCE(a, 1) as [1] FROM (VALUES(0, 0, 0, 0, 0)) AS D(a, b, c, d, e) GROUP BY CUBE (a, b, c, d, e) ORDER BY n;
This query generates the following output:
n 16 8 4 2 1 ----------- ----------- ----------- ----------- ----------- ----------- 0 0 0 0 0 0 1 0 0 0 0 1 2 0 0 0 1 0 3 0 0 0 1 1 4 0 0 1 0 0 5 0 0 1 0 1 6 0 0 1 1 0 7 0 0 1 1 1 8 0 1 0 0 0 9 0 1 0 0 1 10 0 1 0 1 0 11 0 1 0 1 1 12 0 1 1 0 0 13 0 1 1 0 1 14 0 1 1 1 0 15 0 1 1 1 1 16 1 0 0 0 0 17 1 0 0 0 1 18 1 0 0 1 0 19 1 0 0 1 1 20 1 0 1 0 0 21 1 0 1 0 1 22 1 0 1 1 0 23 1 0 1 1 1 24 1 1 0 0 0 25 1 1 0 0 1 26 1 1 0 1 0 27 1 1 0 1 1 28 1 1 1 0 0 29 1 1 1 0 1 30 1 1 1 1 0 31 1 1 1 1 1
Remember—when the bit is off, the corresponding member is part of the grouping set.
As mentioned, the GROUPING_ID function was introduced in SQL Server 2008. You could produce a similar integer bitmap prior to SQL Server 2008, but it involved more work. You could use a function called GROUPING that accepts a single attribute as input and returns 0 if the attribute is a member of the grouping set and 1 otherwise. You could construct the integer bitmap by multiplying the GROUPING value of each attribute by a different power of 2 and summing all values. Here’s an example of implementing this logic in a query that uses the older WITH CUBE option:
SELECT GROUPING(custid) * 4 + GROUPING(empid) * 2 + GROUPING(YEAR(orderdate)) * 1 AS grp_id, custid, empid, YEAR(orderdate) AS orderyear, SUM(qty) AS totalqty FROM dbo.Orders GROUP BY custid, empid, YEAR(orderdate) WITH CUBE;
This query generates the following output:
grp_id custid empid orderyear totalqty ----------- ------ ----------- ----------- ----------- 0 A 1 2006 12 0 B 1 2006 20 4 NULL 1 2006 32 0 A 3 2006 10 0 C 3 2006 22 0 D 3 2006 30 4 NULL 3 2006 62 6 NULL NULL 2006 94 0 C 1 2007 14 4 NULL 1 2007 14 ...
Materialize Grouping Sets
Recall that before I started describing the technicalities of the grouping sets–related features, I explained that one of their uses is to preprocess aggregates for multiple grouping sets and store those in the data warehouse for fast retrieval. The following code demonstrates materializing aggregates for multiple grouping sets, including an integer identifier of the grouping set calculated with the GROUPING_ID function in a table called MyGroupingSets:
USE tempdb; IF OBJECT_ID('dbo.MyGroupingSets', 'U') IS NOT NULL DROP TABLE dbo.MyGroupingSets; GO SELECT GROUPING_ID( custid, empid, YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ) AS grp_id, custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, DAY(orderdate) AS orderday, SUM(qty) AS qty INTO dbo.MyGroupingSets FROM dbo.Orders GROUP BY CUBE(custid, empid), ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)); CREATE UNIQUE CLUSTERED INDEX idx_cl_grp_id_grp_attributes ON dbo.MyGroupingSets(grp_id, custid, empid, orderyear, ordermonth, orderday);
The index created on the table MyGroupingSets is defined on the grp_id column as the first key to allow efficient retrieval of all rows associated with a single grouping set. For example, consider the following query, which asks for all rows associated with the grouping set ( custid, YEAR(orderdate), MONTH(orderdate) ):
SELECT * FROM dbo.MyGroupingSets WHERE grp_id = 9;
This query generates the following output:
grp_id custid empid orderyear ordermonth orderday qty ----------- ------ ----------- ----------- ----------- ----------- ----------- 9 A NULL 2006 8 NULL 10 9 A NULL 2006 12 NULL 12 9 A NULL 2007 1 NULL 40 9 A NULL 2008 2 NULL 10 9 B NULL 2006 12 NULL 20 9 B NULL 2007 2 NULL 12 9 B NULL 2008 4 NULL 15 9 C NULL 2006 4 NULL 22 9 C NULL 2007 1 NULL 14 9 C NULL 2008 2 NULL 20 9 D NULL 2006 9 NULL 30
Figure 8-5 shows the plan for this query.
Figure 8-5 Execution plan of query that filters a single grouping set
This plan is very efficient. It scans only the rows that are associated with the requested grouping set because they reside in a consecutive section in the leaf of the clustered index.
Provided that you are using aggregates that are additive measures, like SUM, COUNT, and AVG, you can apply incremental updates to the stored aggregates with only the delta of additions since you last processed those aggregates. You can achieve this by using the new MERGE statement that was introduced in SQL Server 2008. Here I’m just going to show the code to demonstrate how this is done. For details about the MERGE statement, please refer to Chapter 10.
Run the following code to simulate another day’s worth of order activity (April 19, 2008):
INSERT INTO dbo.Orders (orderid, orderdate, empid, custid, qty) VALUES (50001, '20080419', 1, 'A', 10), (50002, '20080419', 1, 'B', 30), (50003, '20080419', 2, 'A', 20), (50004, '20080419', 2, 'B', 5), (50005, '20080419', 3, 'A', 15)
Then run the following code to incrementally update the stored aggregates with the new day’s worth of data:
WITH LastDay AS ( SELECT GROUPING_ID( custid, empid, YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ) AS grp_id, custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, DAY(orderdate) AS orderday, SUM(qty) AS qty FROM dbo.Orders WHERE orderdate = '20080419' GROUP BY CUBE(custid, empid), ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) ) MERGE INTO dbo.MyGroupingSets AS TGT USING LastDay AS SRC ON (TGT.grp_id = SRC.grp_id) AND (TGT.orderyear = SRC.orderyear OR (TGT.orderyear IS NULL AND SRC.orderyear IS NULL)) AND (TGT.ordermonth = SRC.ordermonth OR (TGT.ordermonth IS NULL AND SRC.ordermonth IS NULL)) AND (TGT.orderday = SRC.orderday OR (TGT.orderday IS NULL AND SRC.orderday IS NULL)) AND (TGT.custid = SRC.custid OR (TGT.custid IS NULL AND SRC.custid IS NULL)) AND (TGT.empid = SRC.empid OR (TGT.empid IS NULL AND SRC.empid IS NULL)) WHEN MATCHED THEN UPDATE SET TGT.qty += SRC.qty WHEN NOT MATCHED THEN INSERT (grp_id, custid, empid, orderyear, ordermonth, orderday) VALUES (SRC.grp_id, SRC.custid, SRC.empid, SRC.orderyear, SRC.ordermonth, SRC.orderday);
The code in the CTE LastDay calculates aggregates for the same grouping sets as in the original query but filters only the last day’s worth of data. The MERGE statement then increments the quantities of groups that already exist in the target by adding the new quantities and inserts the groups that don’t exist in the target.
Sorting
Consider a request to calculate the total quantity aggregate for all grouping sets in the hierarchy order year > order month > order day. You can achieve this, of course, by simply using the ROLLUP subclause. However, a tricky part of the request is that you need to sort the rows in the output in a hierarchical manner, that is, days of a month, followed by the month total, months of a year followed by the yearly total, and finally the grand total. This can be achieved with the help of the GROUPING function as follows:
SELECT YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, DAY(orderdate) AS orderday, SUM(qty) AS totalqty FROM dbo.Orders GROUP BY ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)) ORDER BY GROUPING(YEAR(orderdate)) , YEAR(orderdate), GROUPING(MONTH(orderdate)), MONTH(orderdate), GROUPING(DAY(orderdate)) , DAY(orderdate);
Remember that the GROUPING function returns 0 when the element is a member of a grouping set (representing detail) and 1 when the element isn’t (representing an aggregate). Because we want to present detail before aggregates, the GROUPING function is very convenient. We want to first see the detail of years and at the end the grand total. Within the detail of years, we want to sort by year. Within each year, we want to first see the detail of months and then the year total. Within the detail of months, we want to sort by month. Within the month we want to sort by the detail of days and then month total. Within the detail of days, we want to sort by day.
This query generates the following output:
orderyear ordermonth orderday totalqty ----------- ----------- ----------- ----------- 2006 4 18 22 2006 4 NULL 22 2006 8 2 10 2006 8 NULL 10 2006 9 7 30 2006 9 NULL 30 2006 12 24 32 2006 12 NULL 32 2006 NULL NULL 94 2007 1 9 40 2007 1 18 14 2007 1 NULL 54 2007 2 12 12 2007 2 NULL 12 2007 NULL NULL 66 2008 2 12 10 2008 2 16 20 2008 2 NULL 30 2008 4 18 15 2008 4 19 80 2008 4 NULL 95 2008 NULL NULL 125 NULL NULL NULL 285