Aggregating and Pivoting Data in Microsoft SQL Server 2008 T-SQL
- 3/25/2009
- OVER Clause
- Tiebreakers
- Running Aggregations
- Pivoting
- Unpivoting
- Custom Aggregations
- Histograms
- Grouping Factor
- Grouping Sets
- Conclusion
Custom Aggregations
Custom aggregations are aggregations that are not provided as built-in aggregate functions—for example, concatenating strings, calculating products, performing bitwise manipulations, calculating medians, and others. In this section, I’ll provide solutions to several custom aggregate requests. Some techniques that I’ll cover are generic, in the sense that you can use similar logic for other aggregate requests; other techniques are specific to one kind of aggregate request.
In my examples, I’ll use the generic Groups table, which you create and populate by running the following code:
USE tempdb; IF OBJECT_ID('dbo.Groups') IS NOT NULL DROP TABLE dbo.Groups; CREATE TABLE dbo.Groups ( groupid VARCHAR(10) NOT NULL, memberid INT NOT NULL, string VARCHAR(10) NOT NULL, val INT NOT NULL, PRIMARY KEY (groupid, memberid) ); GO INSERT INTO dbo.Groups(groupid, memberid, string, val) VALUES ('a', 3, 'stra1', 6), ('a', 9, 'stra2', 7), ('b', 2, 'strb1', 3), ('b', 4, 'strb2', 7), ('b', 5, 'strb3', 3), ('b', 9, 'strb4', 11), ('c', 3, 'strc1', 8), ('c', 7, 'strc2', 10), ('c', 9, 'strc3', 12); -- Show the contents of the table SELECT * FROM dbo.Groups;
This generates the following output:
groupid memberid string val ---------- ----------- ---------- ----------- a 3 stra1 6 a 9 stra2 7 b 2 strb1 3 b 4 strb2 7 b 5 strb3 3 b 9 strb4 11 c 3 strc1 8 c 7 strc2 10 c 9 strc3 12
The Groups table has a column representing the group (groupid), a column representing a unique identifier within the group (memberid), and some value columns (string and val) that need to be aggregated. I like to use such a generic form of data because it allows you to focus on the techniques and not on the data. Note that this is merely a generic form of a table containing data that you want to aggregate. For example, it could represent a Sales table where groupid stands for empid, val stands for qty, and so on.
Custom Aggregations Using Pivoting
One technique for solving custom aggregate problems is pivoting. You pivot the values that need to participate in the aggregate calculation; when they all appear in the same result row, you perform the calculation as a linear one across the columns. With a large number of elements you’ll end up with very lengthy query strings; therefore, this pivoting technique is limited to situations where each group has a small number of elements. Note that unless you have a sequencing column within the group, you need to calculate row numbers that will be used to identify the position of elements within the group. For example, if you need to concatenate all values from the string column per group, what do you specify as the pivoted attribute list (the spreading values)? The values in the memberid column are not known ahead of time, plus they differ in each group. Row numbers representing positions within the group solve this problem.
String Concatenation Using Pivoting
As the first example, the following query calculates an aggregate string concatenation over the column string for each group with a pivoting technique:
SELECT groupid, [1] + COALESCE(',' + [2], '') + COALESCE(',' + [3], '') + COALESCE(',' + [4], '') AS string FROM (SELECT groupid, string, ROW_NUMBER() OVER(PARTITION BY groupid ORDER BY memberid) AS rn FROM dbo.Groups AS A) AS D PIVOT(MAX(string) FOR rn IN([1],[2],[3],[4])) AS P;
This query generates the following output:
groupid string ---------- ------------------------- a stra1,stra2 b strb1,strb2,strb3,strb4 c strc1,strc2,strc3
The query that generates the derived table D calculates a row number within the group based on memberid order. The outer query pivots the values based on the row numbers, and it performs linear concatenation. I’m assuming here that each group has at most four rows, so I specified four row numbers. You need as many row numbers as the maximum number of elements you anticipate.
The COALESCE function is used to replace a NULL representing a nonexistent element with an empty string so as not to cause the result to become NULL. You don’t need the COALESCE function with the first element ([1]) because at least one element must exist in the group; otherwise, the group won’t appear in the table.
Aggregate Product Using Pivoting
In a similar manner, you can calculate the product of the values in the val column for each group:
SELECT groupid, [1] * COALESCE([2], 1) * COALESCE([3], 1) * COALESCE([4], 1) AS product FROM (SELECT groupid, val, ROW_NUMBER() OVER(PARTITION BY groupid ORDER BY memberid) AS rn FROM dbo.Groups AS A) AS D PIVOT(MAX(val) FOR rn IN([1],[2],[3],[4])) AS P;
This query generates the following output:
groupid product ---------- ----------- a 42 b 693 c 960
The need for an aggregate product is common in financial applications—for example, to calculate compound interest rates.
User Defined Aggregates (UDA)
SQL Server allows you to create your own user-defined aggregates (UDAs). You write UDAs in a .NET language of your choice (for example, C# or Visual Basic), and you use them in T-SQL. This book is dedicated to T-SQL and not to the common language runtime (CLR), so I won’t explain CLR UDAs at great length. Rather, I’ll provide you with a couple of examples with step-by-step instructions and, of course, the T-SQL interfaces involved. Examples are provided in both C# and Visual Basic.
Let’s start with a concrete implementation of two UDAs. The steps involved in creating a CLR-based UDA are as follows:
Define the UDA as a class in a .NET language.
Compile the class you defined to build a CLR assembly.
Register the assembly in SQL Server using the CREATE ASSEMBLY command in T-SQL.
Use the CREATE AGGREGATE command in T-SQL to create the UDA that references the registered assembly.
This section will provide examples for creating aggregate string concatenation and aggregate product functions in both C# and Visual Basic. You can find the code for the C# classes in Example 8-2 and the code for the Visual Basic classes in Example 8-3. You’ll be provided with the requirements for a CLR UDA alongside the development of a UDA.
Example 8-2. C# code for UDAs
using System; using System.Data; using Microsoft.SqlServer.Server; using System.Data.SqlTypes; using System.IO; using System.Text; using System.Runtime.InteropServices; [Serializable] [SqlUserDefinedAggregate( Format.UserDefined, // use user defined serialization IsInvariantToNulls = true, // NULLs don't matter IsInvariantToDuplicates = false, // duplicates matter IsInvariantToOrder = false, // order matters IsNullIfEmpty = false, // do not yield a NULL for a set of zero strings MaxByteSize = -1) // max size unlimited ] public struct StringConcat : IBinarySerialize { private StringBuilder sb; public void Init() { this.sb = new StringBuilder(); } //two arguments public void Accumulate(SqlString v, SqlString separator) { if (v.IsNull) { return; // ignore NULLs approach } this.sb.Append(v.Value).Append(separator.Value); } public void Merge(StringConcat other) { this.sb.Append(other.sb); } public SqlString Terminate() { string output = string.Empty; if (this.sb != null && this.sb.Length > 0) { // remove last separator output = this.sb.ToString(0, this.sb.Length - 1); } return new SqlString(output); } public void Read(BinaryReader r) { sb = new StringBuilder(r.ReadString()); } public void Write(BinaryWriter w) { w.Write(this.sb.ToString()); } } // end StringConcat [Serializable] [StructLayout(LayoutKind.Sequential)] [SqlUserDefinedAggregate( Format.Native, // use native serialization IsInvariantToNulls = true, // NULLs don't matter IsInvariantToDuplicates = false, // duplicates matter IsInvariantToOrder = false)] // order matters public class Product { private SqlInt64 si; public void Init() { si = 1; } public void Accumulate(SqlInt64 v) { if (v.IsNull || si.IsNull) // NULL input = NULL output approach { si = SqlInt64.Null; return; } if (v == 0 || si == 0) // to prevent an exception in next if { si = 0; return; } // stop before we reach max v if (Math.Abs(v.Value) <= SqlInt64.MaxValue / Math.Abs(si.Value)) { si = si * v; } else { si = 0; // if we reach too big v, return 0 } } public void Merge(Product Group) { Accumulate(Group.Terminate()); } public SqlInt64 Terminate() { return (si); } } // end Product
Example 8-3. Visual Basic code for UDAs
Imports System Imports System.Data Imports System.Data.SqlTypes Imports Microsoft.SqlServer.Server Imports System.Text Imports System.IO Imports System.Runtime.InteropServices <Serializable(), _ SqlUserDefinedAggregate( _ Format.UserDefined, _ IsInvariantToDuplicates:=False, _ IsInvariantToNulls:=True, _ IsInvariantToOrder:=False, _ IsNullIfEmpty:=False, _ MaxByteSize:=-1)> _ Public Structure StringConcat Implements IBinarySerialize Private sb As StringBuilder Public Sub Init() Me.sb = New StringBuilder() End Sub Public Sub Accumulate(ByVal v As SqlString, ByVal separator As SqlString) If v.IsNull Then Return End If Me.sb.Append(v.Value).Append(separator.Value) End Sub Public Sub Merge(ByVal other As StringConcat) Me.sb.Append(other.sb) End Sub Public Function Terminate() As SqlString Dim output As String = String.Empty If Not (Me.sb Is Nothing) AndAlso Me.sb.Length > 0 Then output = Me.sb.ToString(0, Me.sb.Length - 1) End If Return New SqlString(output) End Function Public Sub Read(ByVal r As BinaryReader) _ Implements IBinarySerialize.Read sb = New StringBuilder(r.ReadString()) End Sub Public Sub Write(ByVal w As BinaryWriter) _ Implements IBinarySerialize.Write w.Write(Me.sb.ToString()) End Sub End Structure <Serializable(), _ StructLayout(LayoutKind.Sequential), _ SqlUserDefinedAggregate( _ Format.Native, _ IsInvariantToOrder:=False, _ IsInvariantToNulls:=True, _ IsInvariantToDuplicates:=False)> _ Public Class Product Private si As SqlInt64 Public Sub Init() si = 1 End Sub Public Sub Accumulate(ByVal v As SqlInt64) If v.IsNull = True Or si.IsNull = True Then si = SqlInt64.Null Return End If If v = 0 Or si = 0 Then si = 0 Return End If If (Math.Abs(v.Value) <= SqlInt64.MaxValue / Math.Abs(si.Value)) _ Then si = si * v Else si = 0 End If End Sub Public Sub Merge(ByVal Group As Product) Accumulate(Group.Terminate()) End Sub Public Function Terminate() As SqlInt64 If si.IsNull = True Then Return SqlInt64.Null Else Return si End If End Function End Class
Use the following step-by-step instructions to create and deploy the assemblies in Visual Studio 2008.
Creating and Deploying an Assembly in Visual Studio 2008
In Visual Studio 2008, create a new C# or Visual Basic project based on your language preference. Use the Database folder and the SQL Server Project template.
In the New Project dialog box, specify the following information:
Name UDAs
Location C:\
Solution Name UDAs
When you’re done entering the information, confirm that it is correct.
At this point, you’ll be requested to specify a database reference. Create a new database reference to the tempdb database in the SQL Server instance you’re working with and choose it. The database reference you choose tells Visual Studio where to deploy the UDAs that you develop.
After confirming the choice of database reference, in the Solution Explorer window, right-click the UDAs project, select the menu items Add and Aggregate, and then choose the Aggregate template. If you’re using C#, rename the class Aggregate1.cs to UDAClasses.cs. If you’re using Visual Basic, rename Aggregate1.vb to UDAClasses.vb. Confirm.
Examine the code of the template. You’ll find that a UDA is implemented as a structure (struct in C#, Structure in Visual Basic). It can be implemented as a class as well. The first block of code in the template includes namespaces that are used in the assembly (lines of code starting with using in C# and with Imports in Visual Basic). Add three more statements to include the following namespaces: System.Text, System.IO, and System.Runtime.InteropServices. (You can copy those from Example 8-2 or Example 8-3.) You’ll use the StringBuilder class from the System.Text namespace, the BinaryReader and BinaryWriter classes from the System.IO namespace, and the StructLayout attribute from the System.Runtime.InteropServices namespace (in the second UDA).
Rename the default name of the UDA—which is currently the same name as the name of the class (UDAClasses)—to StringConcat.
You’ll find four methods that are already provided by the template. These are the methods that every UDA must implement. However, if you use the Class Library template for your project, you have to write them manually. Using the Aggregate template, all you have to do is fill them with your code. Following is a description of the four methods:
Init. This method is used to initialize the computation. It is invoked once for each group that the query processor is aggregating.
Accumulate. The name of the method gives you a hint at its purpose—accumulating the aggregate values, of course. This method is invoked once for each value (that is, for every single row) in the group that is being aggregated. It uses input parameters, and the parameters have to be of the data types corresponding to the native SQL Server data types of the columns you are going to aggregate. The data type of the input can also be a CLR UDT. In SQL Server 2005, UDAs supported no more than one input parameter. In SQL Server 2008, UDAs support multiple input parameters.
Merge. Notice that this method uses an input parameter with the type that is the aggregate class. The method is used to merge multiple partial computations of an aggregation.
Terminate. This method finishes the aggregation and returns the result.
Add an internal (private) variable—sb—to the class just before the Init method. You can do so by simply copying the code that declares it from Example 8-2 or Example 8-3, depending on your choice of language. The variable sb is of type StringBuilder and will hold the intermediate aggregate value.
Override the current code for the four methods with the code implementing them from Example 8-2 or Example 8-3. Keep in mind the following points for each method:
In the Init method, you initialize sb with an empty string.
The Accumulate method accepts two input parameters (new in SQL Server 2008)—v and separator. The parameter v represents the value to be concatenated, and the parameter separator is obviously the separator. If v is NULL, it is simply ignored, similar to the way built-in aggregates handle NULLs. If v is not NULL, the value in v and a separator are appended to sb.
In the Merge method, you are simply adding a partial aggregation to the current one. You do so by calling the Accumulate method of the current aggregation and adding the termination (final value) of the other partial aggregation. The input of the Merge function refers to the class name, which you revised earlier to StringConcat.
The Terminate method is very simple as well; it just returns the string representation of the aggregated value minus the superfluous separator at the end.
Delete the last two rows of the code in the class from the template; these are a placeholder for a member field. You already defined the member field you need at the beginning of the UDA.
Next, go back to the top of the UDA, right after the inclusion of the namespaces. You’ll find attribute names that you want to include. Attributes help Visual Studio in deployment, and they help SQL Server to optimize the usage of the UDA. UDAs have to include the Serializable attribute. Serialization in .NET means saving the values of the fields of a class persistently. UDAs need serialization for intermediate results. The format of the serialization can be native, meaning they are left to SQL Server or defined by the user. Serialization can be native if you use only .NET value types; it has to be user defined if you use .NET reference types. Unfortunately, the string type is a reference type in .NET. Therefore, you have to prepare your own serialization. You have to implement the IBinarySerialize interface, which defines just two methods: Read and Write. The implementation of these methods in our UDA is very simple. The Read method uses the ReadString method of the StringBuilder class. The Write method uses the default ToString method. The ToString method is inherited by all .NET classes from the topmost class, called System.Object.
Continue implementing the UDA by following these steps:
11.1.
Specify that you are going to implement the IBinarySerialize interface in the structure. If you’re using C#, you do so by adding a colon and the name of the interface right after the name of the structure (the UDA name). If you’re using Visual Basic, you do so by adding Implements IBinarySerialize after the name of the structure.
11.2.
Copy the Read and Write methods from Example 8-2 or Example 8-3 to the end of your UDA.
11.3.
Change the Format.Native property of the SqlUserDefinedAggregate attribute to Format.UserDefined. In SQL Server 2005, with user-defined serialization, your aggregate was limited to 8,000 bytes only. You had to specify how many bytes your UDA could return at maximum with the MaxByteSize property of the SqlUserDefinedAggregate attribute. SQL Server 2008 lifts this restriction and supports unlimited size (or more accurately, the maximum size supported by large object types like VARCHAR(MAX), which is currently 2 GB). A value of –1 in the MaxByteSize property indicates unlimited size.
You’ll find some other interesting properties of the SqlUserDefinedAggregate attribute in Example 8-2 and Example 8-3. Let’s explore them:
IsInvariantToDuplicates. This is an optional property. For example, the MAX aggregate is invariant to duplicates, while SUM is not.
IsInvariantToNulls. This is another optional property. It specifies whether the aggregate is invariant to NULLs.
IsInvariantToOrder. This property is reserved for future use. It is currently ignored by the query processor. Therefore, order is currently not guaranteed. If you want to concatenate elements in a certain order, you have to implement your own sorting logic either in the Accumulate or the Terminate methods. This naturally incurs extra cost and unfortunately cannot benefit from index ordering.
IsNullIfEmpty. This property indicates whether the aggregate returns a NULL if no values have been accumulated.
Add the aforementioned properties to your UDA by copying them from Example 8-2 or Example 8-3. Your first UDA is now complete!
Example 8-2 and Example 8-3 also have the code to implement a product UDA (Product). Copy the complete code implementing Product to your script. Note that this UDA involves handling of big integers only. Because the UDA internally deals only with value types, it can use native serialization. Native serialization requires that the StructLayoutAttribute be specified as StructLayout.LayoutKind.Sequential if the UDA is defined in a class and not a structure. Otherwise, the UDA implements the same four methods as your previous UDA. An additional check in the Accumulate method prevents out-of-range values.
Save all files by choosing the File menu item and then choosing Save All.
Create the assembly file in the project folder by building the solution. You do this by choosing the Build menu item and then choosing Build Solution.
Deploy the assembly in SQL Server.
Explicit deployment of the UDAs in SQL Server involves running the CREATE ASSEMBLY command to import the intermediate language code from the assembly file into the target database (tempdb in our case) and the CREATE AGGREGATE command to register each aggregate. If you used C# to define the UDAs, run the following code while connected to the tempdb database:
CREATE ASSEMBLY UDAs FROM 'C:\UDAs\UDAs\bin\Debug\UDAs.dll'; CREATE AGGREGATE dbo.StringConcat ( @value AS NVARCHAR(MAX), @separator AS NCHAR(1) ) RETURNS NVARCHAR(MAX) EXTERNAL NAME UDAs.StringConcat; CREATE AGGREGATE dbo.Product ( @value AS BIGINT ) RETURNS BIGINT EXTERNAL NAME UDAs.Product;
If you used Visual Basic, run the following code:
CREATE ASSEMBLY UDAs FROM 'C:\UDAs\UDAs\bin\UDAs.dll'; CREATE AGGREGATE dbo.StringConcat ( @value AS NVARCHAR(MAX), @separator AS NCHAR(1) ) RETURNS NVARCHAR(MAX) EXTERNAL NAME UDAs.[UDAs.StringConcat]; CREATE AGGREGATE dbo.Product ( @value AS BIGINT ) RETURNS BIGINT EXTERNAL NAME UDAs.[UDAs.Product];
The assembly should be cataloged at this point, and both UDAs should be created.
You can check whether the deployment was successful by browsing the sys.assemblies and sys.assembly_modules catalog views, which are in the tempdb database in our case. Run the following code to query those views:
SELECT * FROM sys.assemblies; SELECT * FROM sys.assembly_modules;
Note that to run user-defined assemblies in SQL Server, you need to enable the server configuration option ‘clr enabled’ (which is disabled by default). You do so by running the following code:
EXEC sp_configure 'clr enabled', 1; RECONFIGURE WITH OVERRIDE;
This requirement is applicable only if you want to run user-defined assemblies; this option is not required to be turned on if you want to run system-supplied assemblies.
That’s basically it. You use UDAs just like you use any built-in aggregate function—and that’s one of their great advantages compared to other solutions to custom aggregates. To test the new functions, run the following code, and you’ll get the same results returned by the other solutions to custom aggregates I presented earlier:
SELECT groupid, dbo.StringConcat(string, N',') AS string FROM dbo.Groups GROUP BY groupid; SELECT groupid, dbo.Product(val) AS product FROM dbo.Groups GROUP BY groupid;
Note that the StringConcat function expects a non-NULL separator as input and will fail if provided with a NULL. Of course, you can add logic to the function’s definition to use some default separator when a NULL is specified.
Specialized Solutions
Another type of solution for custom aggregates is developing a specialized, optimized solution for each aggregate. The advantage is usually the improved performance of the solution. The disadvantage is that you probably won’t be able to use similar logic for other aggregate calculations.
Specialized Solution for Aggregate String Concatenation
A specialized solution for aggregate string concatenation uses the PATH mode of the FOR XML query option. This beautiful (and extremely fast) technique was devised by Michael Rys, a program manager with the Microsoft SQL Server development team, and Eugene Kogan, a technical lead on the Microsoft SQL Server Engine team. The PATH mode provides an easier way to mix elements and attributes than the EXPLICIT directive. Here’s the specialized solution for aggregate string concatenation:
SELECT groupid, STUFF((SELECT ',' + string AS [text()] FROM dbo.Groups AS G2 WHERE G2.groupid = G1.groupid ORDER BY memberid FOR XML PATH('')), 1, 1, '') AS string FROM dbo.Groups AS G1 GROUP BY groupid;
The subquery basically returns an ordered path of all strings within the current group. Because an empty string is provided to the PATH clause as input, a wrapper element is not generated. An expression with no alias (for example, ‘,’ + string) or one aliased as [text()] is inlined, and its contents are inserted as a text node. The purpose of the STUFF function is simply to remove the first comma (by substituting it with an empty string).
Dynamic Pivoting
Now that you are familiar with a fast, specialized solution to string concatenation, you can put it to use to achieve dynamic pivoting. Recall from the “Pivoting” section that the static solutions for pivoting in SQL Server require you to explicitly list the spreading values (the values in the spreading element). Consider the following static query, which I covered earlier in the “Pivoting” section:
SELECT * FROM (SELECT custid, YEAR(orderdate) AS orderyear, qty FROM dbo.Orders) AS D PIVOT(SUM(qty) FOR orderyear IN([2006],[2007],[2008])) AS P;
Note that this query is against the dbo.Orders table that you created and populated earlier by running the code in Example 8-1. Here you have to explicitly list the order years in the IN clause. If you want to make this solution more dynamic, query the distinct order years from the table and use the FOR XML PATH technique to construct the comma-separated list of years. You can use the QUOTENAME function to convert the integer years to Unicode character strings and add brackets around them. Also, using QUOTENAME is critical to prevent SQL Injection if this technique is used for a nonnumeric spreading column. The query that produces the comma-separated list of years looks like this:
SELECT STUFF( (SELECT N',' + QUOTENAME(orderyear) AS [text()] FROM (SELECT DISTINCT YEAR(orderdate) AS orderyear FROM dbo.Orders) AS Years ORDER BY orderyear FOR XML PATH('')), 1, 1, '');
Note that this useful technique has some limitations, though not serious ones, because it’s XML based. For example, characters that have special meaning in XML, like ‘<’, will be converted to codes (like <), yielding the wrong pivot statement.
What’s left is to construct the whole query string, store it in a variable and use the sp_executesql stored procedure to execute it dynamically, like so:
DECLARE @sql AS NVARCHAR(1000); SET @sql = N'SELECT * FROM (SELECT custid, YEAR(orderdate) AS orderyear, qty FROM dbo.Orders) AS D PIVOT(SUM(qty) FOR orderyear IN(' + STUFF( (SELECT N',' + QUOTENAME(orderyear) AS [text()] FROM (SELECT DISTINCT YEAR(orderdate) AS orderyear FROM dbo.Orders) AS Years ORDER BY orderyear FOR XML PATH('')), 1, 1, '') + N')) AS P;'; EXEC sp_executesql @stmt = @sql;
Specialized Solution for Aggregate Product
Keep in mind that to calculate an aggregate product, you have to scan all values in the group. So the performance potential your solution can reach is to achieve the calculation by scanning the data only once, using a set-based query. In the case of an aggregate product, this can be achieved using mathematical manipulation based on logarithms. I’ll rely on the following logarithmic equations:
Equation 1: loga(b) = x if and only if ax = b
Equation 2: loga(v1 * v2 * . . . * vn) = loga(v1) + loga(v2) + . . . + loga(vn)
Basically, what you’re going to do here is a transformation of calculations. You have support in T-SQL for the LOG, POWER, and SUM functions. Using those, you can generate the missing product. Group the data by the groupid column, as you would with any built-in aggregate. The expression SUM(LOG10(val)) corresponds to the right side of Equation 2, where the base a is equal to 10 in our case, because you used the LOG10 function. To get the product of the elements, all you have left to do is raise the base (10) to the power of the right side of the equation. In other words, the expression POWER(10., SUM(LOG10(val))) gives you the product of elements within the group. Here’s what the full query looks like:
SELECT groupid, POWER(10., SUM(LOG10(val))) AS product FROM dbo.Groups GROUP BY groupid;
This is the final solution if you’re dealing only with positive values. However, the logarithm function is undefined for zero and negative numbers. You can use pivoting techniques to identify and deal with zeros and negatives as follows:
SELECT groupid, CASE WHEN MAX(CASE WHEN val = 0 THEN 1 END) = 1 THEN 0 ELSE CASE WHEN COUNT(CASE WHEN val < 0 THEN 1 END) % 2 = 0 THEN 1 ELSE -1 END * POWER(10., SUM(LOG10(NULLIF(ABS(val), 0)))) END AS product FROM dbo.Groups GROUP BY groupid;
The outer CASE expression first uses a pivoting technique to check whether a 0 value appears in the group, in which case it returns a 0 as the result. The ELSE clause invokes another CASE expression, which also uses a pivoting technique to count the number of negative values in the group. If that number is even, it produces a +1; if it’s odd, it produces a –1. The purpose of this calculation is to determine the numerical sign of the result. The sign (–1 or +1) is then multiplied by the product of the absolute values of the numbers in the group to give the desired product.
Note that NULLIF is used here to substitute zeros with NULLs. You might expect this part of the expression not to be evaluated at all if a zero is found. But remember that the optimizer can consider many different physical plans to execute your query. As a result, you can’t be certain of the actual order in which parts of an expression will be evaluated. By substituting zeros with NULLs, you ensure that you’ll never get a domain error if the LOG10 function ends up being invoked with a zero as an input. This use of NULLIF, together with the use of ABS, allows this solution to accommodate inputs of any sign (negative, zero, and positive).
You could also use a pure mathematical approach to handle zeros and negative values using the following query:
SELECT groupid, CAST(ROUND(EXP(SUM(LOG(ABS(NULLIF(val,0)))))* (1-SUM(1-SIGN(val))%4)*(1-SUM(1-SQUARE(SIGN(val)))),0) AS INT) AS product FROM dbo.Groups GROUP BY groupid;
This example shows that you should never lose hope when searching for an efficient solution. If you invest the time and think outside the box, in most cases you’ll find a solution.
Specialized Solutions for Aggregate Bitwise Operations
In this section, I’ll introduce specialized solutions for aggregating the T-SQL bitwise operations—bitwise OR (|), bitwise AND (&), and bitwise XOR (^). I’ll assume that you’re familiar with the basics of bitwise operators and their uses and provide only a brief overview. If you’re not, please refer first to the section “Bitwise Operators” in SQL Server Books Online.
Bitwise operations are operations performed on the individual bits of integer data. Each bit has two possible values, 1 and 0. Integers can be used to store bitmaps, or strings of bits, and in fact they are used internally by SQL Server to store metadata information—for example, properties of indexes (clustered, unique, and so on) and properties of databases (readonly, restrict access, autoshrink, and so on). You might also choose to store bitmaps yourself to represent sets of binary attributes—for example, a set of permissions where each bit represents a different permission.
Some experts advise against using such a design because it violates 1NF (first normal form, which requires attributes to be atomic). You might well prefer to design your data in a more normalized form, where attributes like this are stored in separate columns. I don’t want to get into a debate about which design is better. Here I’ll assume a given design that does store bitmaps with sets of flags, and I’ll assume that you need to perform aggregate bitwise activities on these bitmaps. I just want to introduce the techniques for cases where you do find the need to use them.
Bitwise OR (|) is usually used to construct bitmaps or to generate a result bitmap that accumulates all bits that are turned on. In the result of bitwise OR, bits are turned on (that is, have value 1) if they are turned on in at least one of the separate bitmaps.
Bitwise AND (&) is usually used to check whether a certain bit (or a set of bits) is turned on by ANDing the source bitmap and a mask. It’s also used to accumulate only bits that are turned on in all bitmaps. It generates a result bit that is turned on if that bit is turned on in all the individual bitmaps.
Bitwise XOR (^) is usually used to calculate parity or as part of a scheme to encrypt data. For each bit position, the result bit is turned on if it is on in an odd number of the individual bitmaps.
Aggregate versions of the bitwise operators are not provided in SQL Server, and I’ll provide solutions here to perform aggregate bitwise operations. I’ll use the same Groups table that I used in my other custom aggregate examples. Assume that the integer column val represents a bitmap. To see the bit representation of each integer, first create the function DecToBase by running the following code:
IF OBJECT_ID('dbo.DecToBase') IS NOT NULL DROP FUNCTION dbo.DecToBase; GO CREATE FUNCTION dbo.DecToBase(@val AS BIGINT, @base AS INT) RETURNS VARCHAR(63) AS BEGIN DECLARE @r AS VARCHAR(63), @alldigits AS VARCHAR(36); SET @alldigits = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'; SET @r = ''; WHILE @val > 0 BEGIN SET @r = SUBSTRING(@alldigits, @val % @base + 1, 1) + @r; SET @val = @val / @base; END RETURN @r; END GO
The function accepts two inputs: a 64-bit integer holding the source bitmap and a base in which you want to represent the data. Use the following query to return the bit representation of the integers in the val column of Groups:
SELECT groupid, val, RIGHT(REPLICATE('0', 32) + CAST(dbo.DecToBase(val, 2) AS VARCHAR(64)), 32) AS binval FROM dbo.Groups;
This code generates the following output (only the 10 rightmost digits of binval are shown):
groupid val binval ---------- ----------- --------- a 6 00000110 a 7 00000111 b 3 00000011 b 7 00000111 b 3 00000011 b 11 00001011 c 8 00001000 c 10 00001010 c 12 00001100
The binval column shows the val column in base 2 representation, with leading zeros to create a string with a fixed number of digits. Of course, you can adjust the number of leading zeros according to your needs, which I did to produce the outputs I’ll show. To avoid distracting you from the techniques I want to focus on, however, the code for that adjustment is not in my code samples.
Aggregate Bitwise OR
Without further ado, let’s start with calculating an aggregate bitwise OR. To give tangible context to the problem, imagine that you’re maintaining application security in the database. The groupid column represents a user, and the val column represents a bitmap with permission states (either 1 for granted or 0 for not granted) of a role the user is a member of. You’re after the effective permissions bitmap for each user (group), which should be calculated as the aggregate bitwise OR between all bitmaps of roles the user is a member of.
The main aspect of a bitwise OR operation that I’ll rely on in my solutions is the fact that it’s equivalent to the arithmetic sum of the values represented by each distinct bit value that is turned on in the individual bitmaps. Within an integer, a bit represents the value 2(bit_pos-1). For example, the bit value of the third bit is 22 = 4. Take, for example, the bitmaps for user c: 8 (1000), 10 (1010), and 12 (1100). The bitmap 8 has only one bit turned on—the bit value representing 8; 10 has the bits representing 8 and 2 turned on; and 12 has the 8 and 4 bits turned on. The distinct bits turned on in any of the integers 8, 10, and 12 are the 2, 4, and 8 bits, so the aggregate bitwise OR of 8, 10, and 12 is equal to 2 + 4 + 8 = 14 (1110).
The following solution relies on the aforementioned logic by extracting the individual bit values that are turned on in any of the participating bitmaps. The extraction is achieved using the expression MAX(val & <bitval>). The query then performs an arithmetic sum of the individual bit values:
SELECT groupid, MAX(val & 1) + MAX(val & 2) + MAX(val & 4) + MAX(val & 8) -- ... + MAX(val & 1073741824) AS agg_or FROM dbo.Groups GROUP BY groupid;
This query generates the following output:
groupid agg_or binval ---------- ----------- -------- a 7 00000111 b 15 00001111 c 14 00001110
Note that I added a third column (binval) to the output showing the 10 rightmost digits of the binary representation of the result value. I’ll continue to do so with the rest of the queries that apply aggregate bitwise operations.
Similarly, you can use SUM(DISTINCT val & <bitval>) instead of MAX(val & <bitval>) because the only possible results are <bitval> and 0:
SELECT groupid, SUM(DISTINCT val & 1) + SUM(DISTINCT val & 2) + SUM(DISTINCT val & 4) + SUM(DISTINCT val & 8) -- ... + SUM(DISTINCT val & 1073741824) AS agg_or FROM dbo.Groups GROUP BY groupid;
Both solutions suffer from the same limitation—lengthy query strings—because of the need for a different expression for each bit value. In an effort to shorten the query strings, you can use an auxiliary table. You join the Groups table with an auxiliary table that contains all relevant bit values, using val & bitval = bitval as the join condition. The result of the join will include all bit values that are turned on in any of the bitmaps. You can then find SUM(DISTINCT <bitval>) for each group. You can easily generate the auxiliary table of bit values from the Nums table used earlier. Filter as many numbers as the bits that you might need and raise 2 to the power n–1. Here’s the complete solution:
SELECT groupid, SUM(DISTINCT bitval) AS agg_or FROM dbo.Groups JOIN (SELECT POWER(2, n-1) AS bitval FROM dbo.Nums WHERE n <= 31) AS Bits ON val & bitval = bitval GROUP BY groupid;
Aggregate Bitwise AND
In a similar manner, you can calculate an aggregate bitwise AND. In the permissions scenario, an aggregate bitwise AND represents the most restrictive permission set. Just keep in mind that a bit value should be added to the arithmetic sum only if it’s turned on in all bitmaps. So first group the data by groupid and bitval and filter only the groups where MIN(val & bitval) > 0, meaning that the bit value was turned on in all bitmaps. In an outer query, group the data by groupid and perform the arithmetic sum of the bit values from the inner query:
SELECT groupid, SUM(bitval) AS agg_and FROM (SELECT groupid, bitval FROM dbo.Groups, (SELECT POWER(2, n-1) AS bitval FROM dbo.Nums WHERE n <= 31) AS Bits GROUP BY groupid, bitval HAVING MIN(val & bitval) > 0) AS D GROUP BY groupid;
This query generates the following output:
groupid agg_and binval ---------- ----------- -------- a 6 00000110 b 3 00000011 c 8 00001000
Aggregate Bitwise XOR
To calculate an aggregate bitwise XOR operation, filter only the groupid, bitval groups that have an odd number of bits turned on, as shown in the following code, which illustrates an aggregate bitwise XOR using Nums:
SELECT groupid, SUM(bitval) AS agg_xor FROM (SELECT groupid, bitval FROM dbo.Groups, (SELECT POWER(2, n-1) AS bitval FROM dbo.Nums WHERE n <= 31) AS Bits GROUP BY groupid, bitval HAVING SUM(SIGN(val & bitval)) % 2 = 1) AS D GROUP BY groupid;
This query produces the following output:
groupid agg_xor binval ---------- ----------- -------- a 1 00000001 b 12 00001100 c 14 00001110
Median
As another example of a specialized custom aggregate solution, I’ll use the statistical median calculation. Suppose that you need to calculate the median of the val column for each group. There are two different definitions of median. Here we will return the middle value in case we have an odd number of elements and the average of the two middle values in case we have an even number of elements.
The following code shows a technique for calculating the median:
WITH Tiles AS ( SELECT groupid, val, NTILE(2) OVER(PARTITION BY groupid ORDER BY val) AS tile FROM dbo.Groups ), GroupedTiles AS ( SELECT groupid, tile, COUNT(*) AS cnt, CASE WHEN tile = 1 THEN MAX(val) ELSE MIN(val) END AS val FROM Tiles GROUP BY groupid, tile ) SELECT groupid, CASE WHEN MIN(cnt) = MAX(cnt) THEN AVG(1.*val) ELSE MIN(val) END AS median FROM GroupedTiles GROUP BY groupid;
This code generates the following output:
groupid median ---------- ---------- a 6.500000 b 5.000000 c 10.000000
The Tiles CTE calculates the NTILE(2) value within the group, based on val order. When you have an even number of elements, the first half of the values gets tile number 1, and the second half gets tile number 2. In an even case, the median is supposed to be the average of the highest value within the first tile and the lowest in the second. When you have an odd number of elements, remember that an additional row is added to the first group. This means that the highest value in the first tile is the median.
The second CTE (GroupedTiles) groups the data by group and tile number, returning the row count for each group and tile as well as the val column, which for the first tile is the maximum value within the tile and for the second tile is the minimum value within the tile.
The outer query groups the two rows in each group (one representing each tile). A CASE expression in the SELECT list determines what to return based on the parity of the group’s row count. When the group has an even number of rows (that is, the group’s two tiles have the same row count), you get the average of the maximum in the first tile and the minimum in the second. When the group has an odd number of elements (that is, the group’s two tiles have different row counts), you get the minimum of the two values, which happens to be the maximum within the first tile, which, in turn, happens to be the median.
Using the ROW_NUMBER function, you can come up with additional solutions to finding the median that are more elegant and somewhat simpler. Here’s the first example:
WITH RN AS ( SELECT groupid, val, ROW_NUMBER() OVER(PARTITION BY groupid ORDER BY val, memberid) AS rna, ROW_NUMBER() OVER(PARTITION BY groupid ORDER BY val DESC, memberid DESC) AS rnd FROM dbo.Groups ) SELECT groupid, AVG(1.*val) AS median FROM RN WHERE ABS(rna - rnd) <= 1 GROUP BY groupid;
The idea is to calculate two row numbers for each row: one based on val, memberid (the tiebreaker) in ascending order (rna) and the other based on the same attributes in descending order (rnd). Two sequences sorted in opposite directions have an interesting mathematical relationship that you can use to your advantage. The absolute difference between the two is smaller than or equal to 1 only for the elements that need to participate in the median calculation. Take, for example, a group with an odd number of elements; ABS(rna – rnd) is equal to 0 only for the middle row. For all other rows, it is greater than 1. Given an even number of elements, the difference is 1 for the two middle rows and greater than 1 for all others.
The reason for using memberid as a tiebreaker is to guarantee determinism of the row number calculations. Because you’re calculating two different row numbers, you want to make sure that a value that appears at the nth position from the beginning in ascending order appears at the nth position from the end in descending order.
Once the values that need to participate in the median calculation are isolated, you just need to group them by groupid and calculate the average per group.
You can avoid the need to calculate two separate row numbers by deriving the second from the first. The descending row numbers can be calculated by subtracting the ascending row numbers from the count of rows in the group and adding one. For example, in a group of four elements, the row that got an ascending row number 1 would get the descending row number 4–1+1 = 4. Ascending row number 2 would get the descending row number 4–2+1 = 3 and so on. Deriving the descending row number from the ascending one eliminates the need for a tiebreaker. You’re not dealing with two separate calculations; therefore, nondeterminism is not an issue anymore.
So the calculation rna – rnd becomes the following: rn – (cnt-rn+1) = 2*rn – cnt – 1. Here’s a query that implements this logic:
WITH RN AS ( SELECT groupid, val, ROW_NUMBER() OVER(PARTITION BY groupid ORDER BY val) AS rn, COUNT(*) OVER(PARTITION BY groupid) AS cnt FROM dbo.Groups ) SELECT groupid, AVG(1.*val) AS median FROM RN WHERE ABS(2*rn - cnt - 1) <= 1 GROUP BY groupid;
Here’s another way to figure out which rows participate in the median calculation based on the row number and the count of rows in the group: rn IN((cnt+1)/2, (cnt+2)/2). For an odd number of elements, both expressions yield the middle row number. For example, if you have 7 rows, both (7+1)/2 and (7+2)/2 equal 4. For an even number of elements, the first expression yields the row number just before the middle point, and the second yields the row number just after it. If you have 8 rows, (8+1)/2 yields 4, and (8+2)/2 yields 5. Here’s the query that implements this logic:
WITH RN AS ( SELECT groupid, val, ROW_NUMBER() OVER(PARTITION BY groupid ORDER BY val) AS rn, COUNT(*) OVER(PARTITION BY groupid) AS cnt FROM dbo.Groups ) SELECT groupid, AVG(1.*val) AS median FROM RN WHERE rn IN((cnt+1)/2, (cnt+2)/2) GROUP BY groupid;
Mode
The last specialized solution of a custom aggregate that I’ll cover is for the mode of a distribution. The mode is the most frequently occurring value. As an example of mode calculation, consider a request to return for each customer the ID of the employee who handled the most orders for that customer, according to the Sales.Orders table in the InsideTSQL2008 database. In case of ties, you need to determine what you want to do. One option is to return all tied employees; another option is to use a tiebreaker to determine which to return—for example, the one with the higher employee ID.
The first solution that I’ll present is based on ranking calculations. I’ll first describe a solution that applies a tiebreaker, and then I’ll explain the required revisions for the solution to return all ties.
You group the rows by customer ID and employee ID. You calculate a count of orders per group, plus a row number partitioned by customer ID, based on the order of count descending and employee ID descending. The rows with the employee ID that is the mode—with the higher employee ID used as a tiebreaker—have row number 1. What’s left is to define a table expression based on the query and in the outer query filter only the rows where the row number is equal to 1, like so:
USE InsideTSQL2008; WITH C AS ( SELECT custid, empid, COUNT(*) AS cnt, ROW_NUMBER() OVER(PARTITION BY custid ORDER BY COUNT(*) DESC, empid DESC) AS rn FROM Sales.Orders GROUP BY custid, empid ) SELECT custid, empid, cnt FROM C WHERE rn = 1;
This query generates the following output, shown here in abbreviated form:
custid empid cnt ----------- ----------- ----------- 1 4 2 2 3 2 3 3 3 4 4 4 5 3 6 6 9 3 7 4 3 8 4 2 9 4 4 10 3 4 11 6 2 12 8 2 ...
If you want to return all ties, simply use the RANK function instead of ROW_NUMBER and calculate it based on count ordering alone (without the employee ID tiebreaker), like so:
WITH C AS ( SELECT custid, empid, COUNT(*) AS cnt, RANK() OVER(PARTITION BY custid ORDER BY COUNT(*) DESC) AS rn FROM Sales.Orders GROUP BY custid, empid ) SELECT custid, empid, cnt FROM C WHERE rn = 1;
This time, as you can see in the following output, ties are returned:
custid empid cnt ----------- ----------- ----------- 1 1 2 1 4 2 2 3 2 3 3 3 4 4 4 5 3 6 6 9 3 7 4 3 8 4 2 9 4 4 10 3 4 11 6 2 11 4 2 11 3 2 12 8 2 ...
In case you do want to apply a tiebreaker, you can use another solution that is very efficient. It is based on the concatenation technique that I presented earlier in the chapter. Write a query that groups the data by customer ID and employee ID, and for each group, concatenate the count of rows and the employee ID to a single value (call it binval). Define a table expression based on this query. Have the outer query group the data by customer ID and calculate for each customer the maximum binval. This maximum value contains the max count and within it the maximum employee ID. What’s left is to extract the count and employee ID from the binary value by using the SUBSTRING function and convert the values to the original types. Here’s the complete solution query:
SELECT custid, CAST(SUBSTRING(MAX(binval), 5, 4) AS INT) AS empid, CAST(SUBSTRING(MAX(binval), 1, 4) AS INT) AS cnt FROM (SELECT custid, CAST(COUNT(*) AS BINARY(4)) + CAST(empid AS BINARY(4)) AS binval FROM Sales.Orders GROUP BY custid, empid) AS D GROUP BY custid;
As an exercise, you can test the solutions against a table with a large number of rows. You will see that this solution is very fast.