Temporary Tables and Table Variables in Microsoft SQL Server 2008 T-SQL Programming
- 9/23/2009
Summary Exercises
This section will introduce three scenarios in which you need to work with temporary objects. Based on the knowledge you’ve acquired in this chapter, you need to implement a solution with the appropriate temporary object type.
The scenarios involve querying Customers and Orders tables. To test the logical correctness of your solutions, use the Sales.Customers and Sales.Orders tables in the InsideTSQL2008 sample database. To test the performance of your solutions, use the tables that you create and populate in tempdb by running the code in Example 7-1.
Example 7-1. Code that creates large tables for summary exercises
SET NOCOUNT ON; USE tempdb; IF SCHEMA_ID('Sales') IS NULL EXEC('CREATE SCHEMA Sales'); IF OBJECT_ID('Sales.Customers', 'U') IS NOT NULL DROP TABLE Sales.Customers; IF OBJECT_ID('Sales.Orders', 'U') IS NOT NULL DROP TABLE Sales.Orders; GO SELECT n AS custid INTO Sales.Customers FROM dbo.Nums WHERE n <= 10000; ALTER TABLE Sales.Customers ADD PRIMARY KEY(custid); SELECT n AS orderid, DATEADD(day, ABS(CHECKSUM(NEWID())) % (4*365), '20060101') AS orderdate, 1 + ABS(CHECKSUM(NEWID())) % 10000 AS custid, 1 + ABS(CHECKSUM(NEWID())) % 40 AS empid, CAST('a' AS CHAR(200)) AS filler INTO Sales.Orders FROM dbo.Nums WHERE n <= 1000000; ALTER TABLE Sales.Orders ADD PRIMARY KEY(orderid); CREATE INDEX idx_cid_eid ON Sales.Orders(custid, empid);
Comparing Periods
The first exercise involves multiple references to the same intermediate result set of a query. The task is to query the Orders table, and return for each order year the number of orders placed that year, and the difference from the number of orders placed in the previous year. Here’s the desired output when you run your solution against InsideTSQL2008:
orderyear numorders diff ----------- ----------- ----------- 2006 152 NULL 2007 408 256 2008 270 -138
You could use a table expression representing yearly counts of orders, and join two instances of the table expression to match to each current year the previous year, so that you can calculate the difference. Here’s an example for implementing such an approach using a CTE:
SET STATISTICS IO ON; WITH YearlyCounts AS ( SELECT YEAR(orderdate) AS orderyear, COUNT(*) AS numorders FROM Sales.Orders GROUP BY YEAR(orderdate) ) SELECT C.orderyear, C.numorders, C.numorders - P.numorders AS diff FROM YearlyCounts AS C LEFT OUTER JOIN YearlyCounts AS P ON C.orderyear = P.orderyear + 1;
Remember that a table expression is nothing but a reflection of the underlying tables. When you query two occurrences of the YearlyCounts CTE, both get expanded behind the scenes. All the work of scanning the data and aggregating it happens twice. You can see this clearly in the query’s execution plan shown in Figure 7-4.
Figure 7-4. Execution plan for a solution to the “comparing periods” exercise (using table expressions)
Scanning the base data from the clustered index involves 28,807 reads. Because the data was scanned twice, STATISTICS IO reports 57,614 reads. As you can realize, scanning and aggregating the base data twice is unnecessary. This is a scenario where you should consider using a temporary table or a table variable. When choosing between the two, remember that one of the things to consider is the size of the intermediate result set that will be stored in the temporary object. Because the intermediate result set here will have only one row per year, obviously it’s going to be very tiny, and it will probably require only one or two pages. In this case, it makes sense to use a table variable and benefit from the fact that it will not cause plan optimality related recompiles.
Here’s the solution using a table variable:
DECLARE @YearlyCounts AS TABLE ( orderyear INT PRIMARY KEY, numorders INT ); INSERT INTO @YearlyCounts(orderyear, numorders) SELECT YEAR(orderdate) AS orderyear, COUNT(*) AS numorders FROM Sales.Orders GROUP BY YEAR(orderdate); SELECT C.orderyear, C.numorders, C.numorders - P.numorders AS diff FROM @YearlyCounts AS C LEFT OUTER JOIN @YearlyCounts AS P ON C.orderyear = P.orderyear + 1;
The work that includes scanning the base data and aggregating it happens only once and the tiny result is stored in a table variable. Then the last query joins two instances of the tiny table variable to produce the desired output. The execution plan for this solution is shown in Figure 7-5.
Figure 7-5. Execution plan for a solution to the “comparing periods” exercise (using table variables)
Because the base data from the clustered index on the Orders table was scanned only once, STATISTICS IO reports only about half the number of reads (28,621) compared to the previous solution. It also reports a very small number of reads (11) from the table variable.
Recent Orders
The task in the second exercise is to query the Orders table, and return for each customer the orders with the most recent order date for the customer. Here’s the desired output when you run your solution against InsideTSQL2008, shown in abbreviated form:
orderid orderdate custid empid ----------- ----------------------- ----------- ----------- 11044 2008-04-23 00:00:00.000 91 4 11005 2008-04-07 00:00:00.000 90 2 11066 2008-05-01 00:00:00.000 89 7 10935 2008-03-09 00:00:00.000 88 4 11025 2008-04-15 00:00:00.000 87 6 11046 2008-04-23 00:00:00.000 86 8 10739 2007-11-12 00:00:00.000 85 3 10850 2008-01-23 00:00:00.000 84 1 10994 2008-04-02 00:00:00.000 83 2 10822 2008-01-08 00:00:00.000 82 6 ...
There are many ways to solve this problem, some of which I’ll present here. But most solutions benefit from the following index on custid, orderdate as the keys and empid, orderid as included columns:
CREATE INDEX idx_cid_od_i_eid_oid ON Sales.Orders(custid, orderdate) INCLUDE(empid, orderid);
The first solution I’ll present is one where I use a CTE to calculate the maximum order date per customer, and then in the outer query join the Orders table with the CTE to return the orders with the maximum order date for each customer, like so:
WITH CustMax AS ( SELECT custid, MAX(orderdate) AS mx FROM Sales.Orders GROUP BY custid ) SELECT O.orderid, O.orderdate, O.custid, O.empid FROM Sales.Orders AS O JOIN CustMax AS M ON O.custid = M.custid AND O.orderdate = M.mx;
Here the fact that a table expression is not materialized—rather its definition gets expanded—is an advantage. You might expect SQL Server to scan the data twice—once to process the inner reference to the Orders table in the CTE query, and another for the outer reference to the Orders table. But the optimizer figured out a way to handle this query by scanning the data only once, which is truly admirable. Figure 7-6 shows the execution plan the optimizer produced for this query.
Figure 7-6. Execution plan for a solution to the “recent orders” exercise (using a CTE and join)
The Index Seek operator against the index you just created seeks the last entry in the leaf of the index, and then starts scanning the leaf level backward. The Segment operator segments the rows by customer, and the Top operator filters only the rows with the maximum order date per customer. This is a very efficient plan that requires scanning the index you created earlier only once, in order. STATISTICS IO reports 3,231 reads, which is close to the number of pages in the leaf of the index.
You realize that if you implement a similar solution, except using a temporary table instead of the table expression, the data will have to be scanned more than once—one time to produce the aggregated information you store in the temporary table, and another time to process the outer reference to Orders representing the base data that you join with the temporary table. Here’s an implementation of this approach:
CREATE TABLE #CustMax ( custid INT NOT NULL PRIMARY KEY, mx DATETIME NOT NULL ); INSERT INTO #CustMax(custid, mx) SELECT custid, MAX(orderdate) AS mx FROM Sales.Orders GROUP BY custid; SELECT O.orderid, O.orderdate, O.custid, O.empid FROM Sales.Orders AS O JOIN #CustMax AS M ON O.custid = M.custid AND O.orderdate = M.mx; DROP TABLE #CustMax;
The execution plan for this solution is shown in Figure 7-7.
Figure 7-7. Execution plan for a solution to the “recent orders” exercise (using temporary tables)
The first plan is for the population of the temporary table, and here you can see the first scan of the index you created earlier, plus aggregation of the data, and storing the result in the temp table’s clustered index. The second plan is for the join query, showing that the base data from the index on Orders is scanned again, as well as the data from the temporary table, and the two inputs are joined using a merge join algorithm. STATISTICS IO reports twice 3,231 logical reads against Orders for the first plan, plus 3,231 logical reads against Orders and 28 logical reads against the temporary table for the second plan.
Clearly, in this case, the approach using the table expression was more efficient. By the way, this problem has other solutions using table expressions. For example, the following solution uses the CROSS APPLY operator and a derived table:
SELECT A.* FROM Sales.Customers AS C CROSS APPLY (SELECT TOP (1) WITH TIES orderid, orderdate, custid, empid FROM Sales.Orders AS O WHERE O.custid = C.custid ORDER BY orderdate DESC) AS A;
Figure 7-8 shows the execution plan for this query.
Figure 7-8. Execution plan for a solution to the “recent orders” exercise (using a derived table and APPLY)
As you can see, the plan scans the clustered index on the Customers table, and for each customer, uses a seek operation against the nonclustered index on Orders to pull the orders that were placed by the current customer in its maximum order date. With a low density of customers—as in our case—this plan is less efficient than the one shown in Figure 7-7 for the previous solution based on a CTE. STATISTICS IO reports 31,931 reads from Orders, and those are random reads unlike the sequential ones you got from the plan in Figure 7-7. The solution based on the APPLY operator excels particularly when the density of customers is very high, and the number of seek operations is therefore accordingly small.
Finally, another solution based on table expressions that you might want to consider is one that assigns ranks to orders partitioned by customer—ordered by order date descending—and then filter only the rows with a rank value equal to 1. For a partitioned ranking calculation, the optimizer will only use an index and avoid sorting if the key columns have the same sorting direction in the index as they do in the ranking calculation’s OVER clause. Create a nonclustered index with orderdate descending like so:
CREATE INDEX idx_cid_od_i_eid_oidD ON Sales.Orders(custid, orderdate DESC) INCLUDE(empid, orderid);
Then you will get an efficient plan from the following solution:
WITH OrderRanks AS ( SELECT orderid, orderdate, custid, empid, RANK() OVER(PARTITION BY custid ORDER BY orderdate DESC) AS rnk FROM Sales.Orders ) SELECT * FROM OrderRanks WHERE rnk = 1;
The plan is shown in Figure 7-9.
Figure 7-9. Execution plan for a solution to the “recent orders” exercise (using ranks and a CTE)
The efficiency of the plan is quite similar to the one shown earlier in Figure 7-7. Here as well the nonclustered index is scanned once in order. STATISTICS IO reports 3,231 logical reads as expected. This plan, like the one shown in Figure 7-7, excels when the density of customers is low.
When you’re done, run the following code for cleanup:
DROP INDEX Sales.Orders.idx_cid_od_i_eid_oid; DROP INDEX Sales.Orders.idx_cid_od_i_eid_oidD;
Relational Division
For the last summary exercise, you’re given the following task: you need to determine which customers have orders handled by the same set of employees. The result set should contain one row for each customer, with two columns: the customer ID and a value that identifies the group of employees that handled orders for the customer. The latter is expressed as the minimum customer ID out of all customers that share the same group of employees. That is, if customers A, B, and D were handled by one group of employees (for example, 3, 7, 9), and customers C and E by a different group (for example, 3 and 7), the result set would contain {(A, A), (B, A), (D, A), (C, C), (E, C)}. It will be convenient to use NULL instead of the minimum customer ID to identify the group of no employees for customers without orders. Following is the desired result against the InsideTSQL2008 database, shown here in abbreviated form:
custid grp ----------- ----------- 22 NULL 57 NULL 1 1 2 2 78 2 3 3 81 3 4 4 5 5 34 5 ...
You can observe, for example, that customers 2 and 78 were handled by the same group of employees, because for both customers, grp is 2. Remember that you should use the sample tables in the InsideTSQL2008 database only to check the accuracy of your result. For performance estimations, use the tables you created earlier in tempdb by running the code in Example 7-1. Like before, also with this problem the performance measures I will mention were measured against the tables in tempdb.
The first solution doesn’t make any use of temporary objects; rather, it implements a classic relational division approach applying reverse logic with subqueries:
SELECT custid, CASE WHEN EXISTS(SELECT * FROM Sales.Orders AS O WHERE O.custid = C1.custid) THEN COALESCE( (SELECT MIN(C2.custid) FROM Sales.Customers AS C2 WHERE C2.custid < C1.custid AND NOT EXISTS (SELECT * FROM Sales.Orders AS O1 WHERE O1.custid = C1.custid AND NOT EXISTS (SELECT * FROM Sales.Orders AS O2 WHERE O2.custid = C2.custid AND O2.empid = O1.empid)) AND NOT EXISTS (SELECT * FROM Sales.Orders AS O2 WHERE O2.custid = C2.custid AND NOT EXISTS (SELECT * FROM Sales.Orders AS O1 WHERE O1.custid = C1.custid AND O1.empid = O2.empid))), custid) END AS grp FROM Sales.Customers AS C1 ORDER BY grp, custid;
The query invokes a CASE expression for every customer from the Customers table (C1). The CASE expression invokes the COALESCE function for customers who placed orders, and returns NULL for customers who placed no orders. If the customer placed orders, COALESCE will substitute a NULL returned by the input expression with the current custid. The input expression will return the result of the following:
Return the minimum custid from a second instance of Customers (C2)
Where C2.custid (cust2) is smaller than C1.custid (cust1)
And you cannot find an employee in cust1’s orders that does not appear in cust2’s orders
And you cannot find an employee in cust2’s orders that does not appear in cust1’s orders
Logically, you could do without filtering cust2 < cust1, but this expression is used to avoid wasting resources. Anyway, you need to return the minimum custid out of the ones with the same employee list. If customer A has the same employee group as customer B, both will end up with a grp value of A. For customer B, there’s a point in comparing it to customer A (smaller ID), but for customer A there’s no point in comparing it to customer B (higher ID). Naturally, the minimum custid with a given employee group will not have the same employee group as any customers with smaller IDs. In such a case, the expression will return NULL, and the outer COALESCE will substitute the NULL with the current custid. As for the rest, it’s a classical phrasing of relational division with reverse logic.
This solution is expensive because of the excessive scan count, which has to do with the large number of invocations of the correlated subqueries. To give you a sense, this solution ran over an hour before I gave up waiting for it to finish and stopped it. Most standard set-based solutions you can come up with for this problem that don’t use temporary objects will typically be expensive.
If you devise a solution in which you generate an interim set that can benefit from an index, you might want to consider using temporary tables. For example, you can materialize the distinct list of custid, empid values; index the temporary table; and continue from there. The materialized data would substantially reduce the number of rows in the set you’ll query. Still, you won’t be dealing with a tiny set, and most probably your solution will access the table multiple times. You want efficient plans to be generated based on distribution statistics and accurate cardinality information. All this should lead you to use a local temporary table and not a table variable.
Here’s a solution that first creates the suggested local temporary table, indexes it, and then queries it:
SELECT DISTINCT custid, empid INTO #CustsEmps FROM Sales.Orders; CREATE UNIQUE CLUSTERED INDEX idx_cid_eid ON #CustsEmps(custid, empid); GO WITH Agg AS ( SELECT custid, MIN(empid) AS MN, MAX(empid) AS MX, COUNT(*) AS CN, SUM(empid) AS SM, CHECKSUM_AGG(empid) AS CS FROM #CustsEmps GROUP BY custid ), AggJoin AS ( SELECT A1.custid AS cust1, A2.custid AS cust2, A1.CN FROM Agg AS A1 JOIN Agg AS A2 ON A2.custid <= A1.custid AND A2.MN = A1.MN AND A2.MX = A1.MX AND A2.CN = A1.CN AND A2.SM = A1.SM AND A2.CS = A1.CS ), CustGrp AS ( SELECT cust1, MIN(cust2) AS grp FROM AggJoin AS AJ WHERE CN = (SELECT COUNT(*) FROM #CustsEmps AS C1 JOIN #CustsEmps AS C2 ON C1.custid = AJ.cust1 AND C2.custid = AJ.cust2 AND C2.empid = C1.empid) GROUP BY cust1 ) SELECT custid, grp FROM Sales.Customers AS C LEFT OUTER JOIN CustGrp AS G ON C.custid = G.cust1 ORDER BY grp, custid; GO DROP TABLE #CustsEmps;
I also used CTEs here to build the solution in a modular approach. The first CTE (Agg) groups the data from the temporary table by custid, and returns several aggregates based on empid for each customer (MIN, MAX, COUNT, SUM, CHECKSUM_AGG).
The second CTE (AggJoin) joins two instances of Agg (A1 and A2)—matching each customer in A1 to all customers in A2 with a lower custid that have the same values for all the aggregates. The purpose of comparing aggregates is to identify pairs of customers that potentially share the same group of employees. The reasoning behind the use of less than or equal to (<=) in the filter is similar to the one in the previous solution. That is, comparing groups of employees between customers when A2.custid (cust2) is greater than A1.custid (cust1) is superfluous.
The third CTE, (CustGrp), filters from AggJoin only pairs of customers that actually share the same group of employees, by verifying that the count of matching employees in both groups is identical to the total count of employees in each group by itself. The query aggregates the filtered rows by cust1, returning the minimum cust2 for each cust1. At this point, CustGrp contains the correct grp value for each customer.
Finally, the outer query performs a left outer join that adds customers without orders.
This solution runs for eight seconds. Note that you could use a CTE with the set of distinct custid, empid combinations instead of the temporary table #CustEmps. This way, you could avoid using temporary tables altogether. I tested such a solution and it ran for about 12 seconds—50 percent more than the solution that utilizes a temporary table. The advantage in the temporary table approach was that you could index it.
Considering the fastest solution we had so far—the one utilizing a temporary table—is this really the best you can get? Apparently not. You can use the FOR XML PATH option to concatenate all distinct empid values per customer. You can then group the data by the concatenated string, and return for each customer the minimum custid within the group using the OVER clause. The fast and nifty concatenation technique was originally devised by Michael Rys and Eugene Kogan. The PATH mode provides an easier way to mix elements and attributes than the EXPLICIT directive. Here’s the complete solution:
WITH CustGroups AS ( SELECT custid, (SELECT CAST(empid AS VARCHAR(10)) + ';' AS [text()] FROM (SELECT DISTINCT empid FROM dbo.Orders AS O WHERE O.custid = C.custid) AS D ORDER BY empid FOR XML PATH('')) AS CustEmps FROM dbo.Customers AS C ) SELECT custid, CASE WHEN CustEmps IS NULL THEN NULL ELSE MIN(custid) OVER(PARTITION BY CustEmps) END AS grp FROM CustGroups ORDER BY grp, custid;
The solution is short and slick, doesn’t use temporary tables at all, and runs for six seconds!