T-SQL Querying: TOP and OFFSET-FETCH
- 3/11/2015
- The TOP and OFFSET-FETCH filters
- Optimization of filters demonstrated through paging
- Using the TOP option with modifications
- Top N per group
- Median
- Conclusion
Classic filters in SQL like ON, WHERE, and HAVING are based on predicates. TOP and OFFSET-FETCH are filters that are based on a different concept: you indicate order and how many rows to filter based on that order. Many filtering tasks are defined based on order and a required number of rows. It’s certainly good to have language support in T-SQL that allows you to phrase the request in a manner that is similar to the way you think about the task.
This chapter starts with the logical design aspects of the filters. It then uses a paging scenario to demonstrate their optimization. The chapter also covers the use of TOP with modification statements. Finally, the chapter demonstrates the use of TOP and OFFSET-FETCH in solving practical problems like top N per group and median.
The TOP and OFFSET-FETCH filters
You use the TOP and OFFSET-FETCH filters to implement filtering requirements in your queries in an intuitive manner. The TOP filter is a proprietary feature in T-SQL, whereas the OFFSET-FETCH filter is a standard feature. T-SQL started supporting OFFSET-FETCH with Microsoft SQL Server 2012. As of SQL Server 2014, the implementation of OFFSET-FETCH in T-SQL is still missing a couple of standard elements—interestingly, ones that are available with TOP. With the current implementation, each of the filters has capabilities that are not supported by the other.
I’ll start by describing the logical design aspects of TOP and then cover those of OFFSET-FETCH.
The TOP filter
The TOP filter is a commonly used construct in T-SQL. Its popularity probably can be attributed to the fact that its design is so well aligned with the way many filtering requirements are expressed—for example, “Return the three most recent orders.” In this request, the order for the filter is based on orderdate, descending, and the number of rows you want to filter based on this order is 3.
You specify the TOP option in the SELECT list with an input value typed as BIGINT indicating how many rows you want to filter. You provide the ordering specification in the classic ORDER BY clause. For example, you use the following query to get the three most recent orders.
USE TSQLV3; SELECT TOP (3) orderid, orderdate, custid, empid FROM Sales.Orders ORDER BY orderdate DESC;
I got the following output from this query:
orderid orderdate custid empid ----------- ---------- ----------- ----------- 11077 2015-05-06 65 1 11076 2015-05-06 9 4 11075 2015-05-06 68 8
Instead of specifying the number of rows you want to filter, you can use TOP to specify the percent (of the total number of rows in the query result). You do so by providing a value in the range 0 through 100 (typed as FLOAT) and add the PERCENT keyword. For example, in the following query you request to filter one percent of the rows:
SELECT TOP (1) PERCENT orderid, orderdate, custid, empid FROM Sales.Orders ORDER BY orderdate DESC;
SQL Server rounds up the number of rows computed based on the input percent. For example, the result of 1 percent applied to 830 rows in the Orders table is 8.3. Rounding up this number, you get 9. Here’s the output I got for this query:
orderid orderdate custid empid ----------- ---------- ----------- ----------- 11074 2015-05-06 73 7 11075 2015-05-06 68 8 11076 2015-05-06 9 4 11077 2015-05-06 65 1 11070 2015-05-05 44 2 11071 2015-05-05 46 1 11072 2015-05-05 20 4 11073 2015-05-05 58 2 11067 2015-05-04 17 1
Note that to translate the input percent to a number of rows, SQL Server has to first figure out the count of rows in the query result, and this usually requires extra work.
Interestingly, ordering specification is optional for the TOP filter. For example, consider the following query:
SELECT TOP (3) orderid, orderdate, custid, empid FROM Sales.Orders;
I got the following output from this query:
orderid orderdate custid empid ----------- ---------- ----------- ----------- 10248 2013-07-04 85 5 10249 2013-07-05 79 6 10250 2013-07-08 34 4
The selection of which three rows to return is nondeterministic. This means that if you run the query again, without the underlying data changing, theoretically you could get a different set of three rows. In practice, the row selection will depend on physical conditions like optimization choices, storage engine choices, data layout, and other factors. If you actually run the query multiple times, as long as those physical conditions don’t change, there’s some likelihood you will keep getting the same results. But it is critical to understand the “physical data independence” principle from the relational model, and remember that at the logical level you do not have a guarantee for repeatable results. Without ordering specification, you should consider the order as being arbitrary, resulting in a nondeterministic row selection.
Even when you do provide ordering specification, it doesn’t mean that the query is deterministic. For example, an earlier TOP query used orderdate, DESC as the ordering specification. The orderdate column is not unique; therefore, the selection between rows with the same order date is nondeterministic. So what do you do in cases where you must guarantee determinism? There are two options: using WITH TIES or unique ordering.
The WITH TIES option causes ties to be included in the result. Here’s how you apply it to our example:
SELECT TOP (3) WITH TIES orderid, orderdate, custid, empid FROM Sales.Orders ORDER BY orderdate DESC;
Here’s the result I got from this query:
orderid orderdate custid empid ----------- ---------- ----------- ----------- 11077 2015-05-06 65 1 11076 2015-05-06 9 4 11075 2015-05-06 68 8 11074 2015-05-06 73 7
SQL Server filters the three rows with the most recent order dates, plus it includes all other rows that have the same order date as in the last row. As a result, you can get more rows than the number you specified. In this query, you specified you wanted to filter three rows but ended up getting four. What’s interesting to note here is that the row selection is now deterministic, but the presentation order between rows with the same order date is nondeterministic.
The second method to guarantee a deterministic result is to make the ordering specification unique by adding a tiebreaker. For example, you could add orderid, DESC as the tiebreaker in our example. This means that, in the case of ties in the order date values, a row with a higher order ID value is preferred to a row with a lower one. Here’s our query with the tiebreaker applied:
SELECT TOP (3) orderid, orderdate, custid, empid FROM Sales.Orders ORDER BY orderdate DESC, orderid DESC;
This query generates the following output:
orderid orderdate custid empid ----------- ---------- ----------- ----------- 11077 2015-05-06 65 1 11076 2015-05-06 9 4 11075 2015-05-06 68 8
Use of unique ordering makes both the row selection and presentation ordering deterministic. The result set as well as the presentation ordering of the rows are guaranteed to be repeatable so long as the underlying data doesn’t change.
If you have a case where you need to filter a certain number of rows but truly don’t care about order, it could be a good idea to specify ORDER BY (SELECT NULL), like so:
SELECT TOP (3) orderid, orderdate, custid, empid FROM Sales.Orders ORDER BY (SELECT NULL);
This way, you let everyone know your choice of arbitrary order is intentional, which helps to avoid confusion and doubt.
As a reminder of what I explained in Chapter 1, “Logical query processing,” about the TOP and OFFSET-FETCH filters, presentation order is guaranteed only if the outer query has an ORDER BY clause. For example, in the following query presentation, ordering is not guaranteed:
SELECT orderid, orderdate, custid, empid FROM ( SELECT TOP (3) orderid, orderdate, custid, empid FROM Sales.Orders ORDER BY orderdate DESC, orderid DESC ) AS D;
To provide a presentation-ordering guarantee, you must specify an ORDER BY clause in the outer query, like so:
SELECT orderid, orderdate, custid, empid FROM ( SELECT TOP (3) orderid, orderdate, custid, empid FROM Sales.Orders ORDER BY orderdate DESC, orderid DESC ) AS D ORDER BY orderdate DESC, orderid DESC;
The OFFSET-FETCH filter
The OFFSET-FETCH filter is a standard feature designed similar to TOP but with an extra element. You can specify how many rows you want to skip before specifying how many rows you want to filter.
As you could have guessed, this feature can be handy in implementing paging solutions—that is, returning a result to the user one chunk at a time upon request when the full result set is too long to fit in one screen or web page.
The OFFSET-FETCH filter requires an ORDER BY clause to exist, and it is specified right after it. You start by indicating how many rows to skip in an OFFSET clause, followed by how many rows to filter in a FETCH clause. For example, based on the indicated order, the following query skips the first 50 rows and filters the next 25 rows:
SELECT orderid, orderdate, custid, empid FROM Sales.Orders ORDER BY orderdate DESC, orderid DESC OFFSET 50 ROWS FETCH NEXT 25 ROWS ONLY;
In other words, the query filters rows 51 through 75. In paging terms, assuming a page size of 25 rows, this query returns the third page.
To allow natural declarative language, you can use the keyword FIRST instead of NEXT if you like, though the meaning is the same. Using FIRST could be more intuitive if you’re not skipping any rows. Even if you don’t want to skip any rows, T-SQL still makes it mandatory to specify the OFFSET clause (with 0 ROWS) to avoid parsing ambiguity. Similarly, instead of using the plural form of the keyword ROWS, you can use the singular form ROW in both the OFFSET and the FETCH clauses. This is more natural if you need to skip or filter only one row.
If you’re curious what the purpose of the keyword ONLY is, it means not to include ties. Standard SQL defines the alternative WITH TIES; however, T-SQL doesn’t support it yet. Similarly, standard SQL defines the PERCENT option, but T-SQL doesn’t support it yet either. These two missing options are available with the TOP filter.
As mentioned, the OFFSET-FETCH filter requires an ORDER BY clause. If you want to use arbitrary order, like TOP without an ORDER BY clause, you can use the trick with ORDER BY (SELECT NULL), like so:
SELECT orderid, orderdate, custid, empid FROM Sales.Orders ORDER BY (SELECT NULL) OFFSET 0 ROWS FETCH NEXT 3 ROWS ONLY;
The FETCH clause is optional. If you want to skip a certain number of rows but not limit how many rows to return, simply don’t indicate a FETCH clause. For example, the following query skips 50 rows but doesn’t limit the number of returned rows:
SELECT orderid, orderdate, custid, empid FROM Sales.Orders ORDER BY orderdate DESC, orderid DESC OFFSET 50 ROWS;
Concerning presentation ordering, the behavior is the same as with the TOP filter; namely, with OFFSET-FETCH also, presentation ordering is guaranteed only if the outermost query has an ORDER BY clause.