SQL Windowing
- 4/15/2012
- Background of Window Functions
- A Glimpse of Solutions Using Window Functions
- Elements of Window Functions
- Query Elements Supporting Window Functions
- Potential for Additional Filters
- Reuse of Window Definitions
- Summary
Query Elements Supporting Window Functions
Window functions aren’t supported in all query clauses; rather, they’re supported only in the SELECT and ORDER BY clauses. To help you understand the reason for this restriction, I first need to explain a concept called logical query processing. Then I’ll get to the clauses that support window functions, and finally I’ll explain how to circumvent the restriction with the other clauses.
Logical Query Processing
Logical query processing describes the conceptual way in which a SELECT query is evaluated according to the logical language design. It describes a process made of a series of steps, or phases, that proceed from the query’s input tables to the query’s final result set. Note that by “logical query processing,” I mean the conceptual way in which the query is evaluated—not necessarily the physical way SQL Server processes the query. As part of the optimization, SQL Server can make shortcuts, rearrange the order of some steps, and pretty much do whatever it likes. But that’s as long as it guarantees that it will produce the same output as the one defined by logical query processing applied to the declarative query request.
Each step in logical query processing operates on one or more tables (sets of rows) that serve as its input and returns a table as its output. The output table of one step then becomes the input table for the next step.
Figure 1-6 is a flow diagram illustrating the logical query processing flow in SQL Server 2012.
Figure 1-6. Logical query processing.
Note that when you write a query, the SELECT clause appears first in terms of the keyed-in order, but observe that in terms of the logical query processing order, it appears almost last—just before the ORDER BY clause is handled.
There’s much more to say about logical query processing, but the details are a topic for another book. For the purposes of our discussion, what’s important to note is the order in which the various clauses are evaluated. The following list shows the order (with the phases in which window functions are allowed shown in bold):
FROM
WHERE
GROUP BY
HAVING
SELECT
5-1. Evaluate Expressions
5-2. Remove Duplicates
ORDER BY
OFFSET-FETCH/TOP
Understanding logical query processing and the logical query processing order enables you to understand the motivation behind restricting window functions to only specific clauses.
Clauses Supporting Window Functions
As illustrated in Figure 1-6, only the query clauses SELECT and ORDER BY support window functions directly. The reason for the limitation is to avoid ambiguity by operating on (almost) the final result set of the query as the starting point for the window. If window functions are allowed in phases previous to the SELECT phase, their initial window could be different than that in the SELECT phase, and therefore, with some query forms, it could be very difficult to figure out the right result. I’ll try to demonstrate the ambiguity problem through an example. First run the following code to create the table T1 and populate it with sample data:
SET NOCOUNT ON; USE TSQL2012; IF OBJECT_ID('dbo.T1', 'U') IS NOT NULL DROP TABLE dbo.T1; GO CREATE TABLE dbo.T1 ( col1 VARCHAR(10) NOT NULL CONSTRAINT PK_T1 PRIMARY KEY ); INSERT INTO dbo.T1(col1) VALUES('A'),('B'),('C'),('D'),('E'),('F');
Suppose that window functions were allowed in phases prior to the SELECT—for example, in the WHERE phase. Consider then the following query, and try to figure out which col1 values should appear in the result:
SELECT col1 FROM dbo.T1 WHERE col1 > 'B' AND ROW_NUMBER() OVER(ORDER BY col1) <= 3;
Before you assume that the answer should obviously be the values C, D, and E, consider the all-at-once concept in SQL. The concept of all-at-once means that all expressions that appear in the same logical phase are conceptually evaluated at the same point in time. This means that the order in which the expressions are evaluated shouldn’t matter. With this in mind, the following query should be semantically equivalent to the previous one:
SELECT col1 FROM dbo.T1 WHERE ROW_NUMBER() OVER(ORDER BY col1) <= 3 AND col1 > 'B';
Now, can you figure out what the right answer is? Is it C, D, and E, or is it just C?
That’s an example of the ambiguity I was talking about. By restricting window functions to only the SELECT and ORDER BY clauses of a query, this ambiguity is eliminated.
Looking at Figure 1-6, you might have noticed that within the SELECT phase, it’s step 5-1 (Evaluate Expressions) that supports window functions, and this step is evaluated before step 5-2 (Remove Duplicates). If you wonder why it is important to know such subtleties, I’ll demonstrate why.
Following is a query returning the empid and country attributes of all employees from the Employees table:
SELECT empid, country FROM HR.Employees; empid country ----------- --------------- 1 USA 2 USA 3 USA 4 USA 5 UK 6 UK 7 UK 8 USA 9 UK
Next, examine the following query and see if you can guess what its output is before executing it:
SELECT DISTINCT country, ROW_NUMBER() OVER(ORDER BY country) AS rownum FROM HR.Employees;
Some expect to get the following output:
country rownum --------------- -------------------- UK 1 USA 2
But in reality you get this:
country rownum --------------- -------------------- UK 1 UK 2 UK 3 UK 4 USA 5 USA 6 USA 7 USA 8 USA 9
Now consider that the ROW_NUMBER function in this query is evaluated in step 5-1 where the SELECT list expressions are evaluated—prior to the removal of the duplicates in step 5-2. The ROW_NUMBER function assigns nine unique row numbers to the nine employee rows, and then the DISTINCT clause has no duplicates left to remove.
When you realize this and understand that it has to do with the logical query processing order of the different elements, you can think of a solution. For example, you can have a table expression defined based on a query that just returns distinct countries and have the outer query assign the row numbers after duplicates are removed, like so:
WITH EmpCountries AS ( SELECT DISTINCT country FROM HR.Employees ) SELECT country, ROW_NUMBER() OVER(ORDER BY country) AS rownum FROM EmpCountries;
Can you think of other ways to solve the problem, perhaps even simpler ways than this one?
The fact that window functions are evaluated in the SELECT or ORDER BY phase means that the window defined for the calculation—before applying further restrictions—is the intermediate form of rows of the query after all previous phases—that is, after applying the FROM with all of its table operators (for example, joins), and after the WHERE filtering, the grouping, and the filtering of the groups. Consider the following query as an example:
SELECT O.empid, SUM(OD.qty) AS qty, RANK() OVER(ORDER BY SUM(OD.qty) DESC) AS rnk FROM Sales.Orders AS O JOIN Sales.OrderDetails AS OD ON O.orderid = OD.orderid WHERE O.orderdate >= '20070101' AND O.orderdate < '20080101' GROUP BY O.empid; empid qty rnk ------ ----- --- 4 5273 1 3 4436 2 1 3877 3 8 2843 4 2 2604 5 7 2292 6 6 1738 7 5 1471 8 9 955 9
First the FROM clause is evaluated and the join is performed. Then only the rows where the order year is 2007 are filtered. Then the remaining rows are grouped by employee ID. Only then are the expressions in the SELECT list evaluated, including the RANK function, which is calculated based on ordering by the total quantity descending. If there were other window functions in the SELECT list, they would all use the same result set as their starting point. Recall from earlier discussions about alternative options to window functions (for example, subqueries) that they start their view of the data from scratch—meaning that you have to repeat all the logic you have in the outer query in each of your subqueries, leading to much more verbose code.
Circumventing the Limitations
I explained the reasoning behind disallowing the use of window functions in logical query processing phases that are evaluated prior to the SELECT clause. But what if you need to filter by or group by a calculation based on window functions? The solution is to use a table expression such as a CTE or a derived table. Have a query invoke the window function in its SELECT list, assigning the calculation an alias. Define a table expression based on that query, and then have the outer query refer to that alias where you need it.
Here’s an example showing how you can filter by the result of a window function using a CTE:
WITH C AS ( SELECT orderid, orderdate, val, RANK() OVER(ORDER BY val DESC) AS rnk FROM Sales.OrderValues ) SELECT * FROM C WHERE rnk <= 5; orderid orderdate val rnk -------- ----------------------- --------- ---- 10865 2008-02-02 00:00:00.000 16387.50 1 10981 2008-03-27 00:00:00.000 15810.00 2 11030 2008-04-17 00:00:00.000 12615.05 3 10889 2008-02-16 00:00:00.000 11380.00 4 10417 2007-01-16 00:00:00.000 11188.40 5
With modification statements, window functions are disallowed altogether because those don’t support SELECT and ORDER BY clauses. But there are cases where involving window functions in modification statements is needed. Table expressions can be used to address this need as well because T-SQL supports modifying data through table expressions. I’ll demonstrate this capability with an UPDATE example. First run the following code to create a table called T1 with columns col1 and col2 and populate it with sample data:
SET NOCOUNT ON; USE TSQL2012; IF OBJECT_ID('dbo.T1', 'U') IS NOT NULL DROP TABLE dbo.T1; GO CREATE TABLE dbo.T1 ( col1 INT NULL, col2 VARCHAR(10) NOT NULL ); INSERT INTO dbo.T1(col2) VALUES('C'),('A'),('B'),('A'),('C'),('B');
Explicit values were provided in col2, and NULLs were used as defaults in col1.
Suppose this table represents a situation with data-quality problems. A key wasn’t enforced in this table, and therefore it is not possible to uniquely identify rows. You want to assign unique col1 values in all rows. You’re thinking of using the ROW_NUMBER function in an UPDATE statement, like so:
UPDATE dbo.T1 SET col1 = ROW_NUMBER() OVER(ORDER BY col2);
But remember that this is not allowed. The workaround is to write a query against T1 returning col1 and an expression based on the ROW_NUMBER function (call it rownum); define a table expression based on this query; finally, have an outer UPDATE statement against the CTE assign rownum to col1, like so:
WITH C AS ( SELECT col1, col2, ROW_NUMBER() OVER(ORDER BY col2) AS rownum FROM dbo.T1 ) UPDATE C SET col1 = rownum;
Query T1, and observe that all rows got unique col1 values:
SELECT col1, col2 FROM dbo.T1; col1 col2 ----------- ---------- 5 C 1 A 3 B 2 A 6 C 4 B