Temporary Tables and Table Variables in Microsoft SQL Server 2008 T-SQL Programming
- 9/23/2009
Table Variables
Table variables are probably among the least understood T-SQL elements. Many myths and misconceptions surround them, and these are embraced even by experienced T-SQL programmers. One widespread myth is that table variables are memory-resident only, without physical representation. Another myth is that table variables are always preferable to temporary tables. In this section, I’ll dispel these myths and explain the scenarios in which table variables are preferable to temporary tables as well as scenarios in which they aren’t preferable. I’ll do so by first going through the fundamentals of table variables, just as I did with temporary tables, and follow with tangible examples.
You create a table variable using a DECLARE statement, followed by the variable name and the table definition. You then refer to it as you do with permanent tables. Here’s a very basic example:
DECLARE @T1 TABLE(col1 INT); INSERT @T1 VALUES(1); SELECT * FROM @T1;
Note that the table-valued parameters that were added in SQL Server 2008 are implemented internally like table variables. So the performance discussions in this section regarding table variables apply to table-valued parameters as well. Table-valued parameters were discussed earlier in the book in Chapter 3.
Limitations
Many limitations apply to table variables but not to temporary tables. In this section, I’ll describe some of them, whereas others will be described in dedicated sections.
You cannot create explicit indexes on table variables, only PRIMARY KEY and UNIQUE constraints, which create unique indexes underneath the covers. You cannot create non-unique indexes. If you need an index on a non-unique column, you must add attributes that make the combination unique and create a PRIMARY KEY or UNIQUE constraint on the combination.
You cannot alter the definition of a table variable once it is declared. This means that everything you need in the table definition must be included in the original DECLARE statement. This fact is limiting on one hand, but it also results in fewer recompilations. Remember that one of the triggers of recompilations is DDL changes.
You cannot issue SELECT INTO against a table variable, rather you have to use INSERT SELECT instead. Prior to SQL Server 2008 this limitation put table variables at a disadvantage compared to temporary tables because SELECT INTO could be done as a minimally logged operation, though INSERT SELECT couldn’t. SQL Server 2008 introduces improvements in minimally logged operations, including the ability to process INSERT SELECT with minimal logging. I’ll demonstrate this capability later in the chapter.
You cannot qualify a column name with a nondelimited table variable name (as in @T1.col1). This is especially an issue when referring to a table variable’s column in correlated subqueries with column name ambiguity. To circumvent this limitation, you have to delimit the table variable name (as in [@T1].col1, or “@T1”.col1).
In queries that modify table variables, parallel plans will not be used. Queries that only read from table variables can be parallelized.
tempdb
To dispel what probably is the most widespread myth involving table variables, let me state that they do have physical representation in tempdb, very similar to temporary tables.
As proof, run the following code that shows which temporary tables currently exist in tempdb by querying metadata info, creating a table variable, and querying metadata info again:
SELECT TABLE_NAME FROM tempdb.INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME LIKE '#%'; GO DECLARE @T TABLE(col1 INT); INSERT INTO @T VALUES(1); SELECT TABLE_NAME FROM tempdb.INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME LIKE '#%';
When I ran this code, the first batch returned no output, whereas the second returned #0CBAE877, which is the name of the temporary table in tempdb that represents the table variable @T. Of course, you will probably get a different name when you run this code. But the point is to show that a hidden temporary table is created behind the scenes. Just like temporary tables, a table variable’s pages reside in cache when the table is small enough and when SQL Server has enough memory to spare. So the discussion about aspects of working with temporary tables with regard to tempdb applies to table variables as well.
Scope and Visibility
The scope of a table variable is well defined. It is defined as the current level, and within it the current batch only, just as with any other variable. That is, a table variable is not accessible to inner levels, and not even to other batches within the same level. In short, you can use it only within the same batch it was created. This scope is much more limited than that of a local temporary table and is typically an important factor in choosing a temporary object type.
Transaction Context
Unlike a temporary table, a table variable is not part of an outer transaction; rather, its transaction scope is limited to the statement level to support statement rollback capabilities only. If you modify a table variable and the modification statement is aborted, the changes of that particular statement will be undone. However, if the statement is part of an outer transaction that is rolled back, changes against the table variable that finished will not be undone. Table variables are unique in this respect.
You can rely on this behavior to your advantage. For example, suppose that you need to write an audit trigger that audits changes against some table. If some logical condition is met, you want to roll back the change; however, you still want to audit the attempted change. If you copy data from inserted/deleted to your audit tables, a rollback in the trigger also undoes the audit writes. If you first roll back the change and then try to audit it, deleted and inserted are empty.
To solve the problem, you first copy data from inserted/deleted to table variables, issue a rollback, and then in a new transaction within the trigger, copy the data from the table variables to your audit tables. This is the simplest way around the problem.
The unique transaction context of table variables has performance advantages over temporary tables because less logging and locking are involved.
Statistics
As I mentioned earlier, SQL Server doesn’t create distribution statistics or maintain accurate cardinality information for table variables as it does for temporary tables. This is one of the main factors you should consider when choosing a type of temporary object for a given task. The downside is that you might get inefficient plans when the optimizer needs to consult histograms to determine selectivity. This is especially a problem with big tables, where you might end up with excessive I/O. The upside is that table variables, for the very same reason, involve much fewer recompilations. Before making your choice, you need to figure out which is more expensive in the particular task for which you’re designating the temporary object.
To explain the statistics aspect of table variables in a more tangible way, I’ll show you some queries, their execution plans, and their I/O costs.
Examine the following code, and request an estimated execution plan for it from SQL Server Management Studio (SSMS):
DECLARE @T TABLE ( col1 INT NOT NULL PRIMARY KEY, col2 INT NOT NULL, filler CHAR(200) NOT NULL DEFAULT('a'), UNIQUE(col2, col1) ); INSERT INTO @T(col1, col2) SELECT n, (n - 1) % 10000 + 1 FROM dbo.Nums WHERE n <= 100000; SELECT * FROM @T WHERE col1 = 1; SELECT * FROM @T WHERE col1 <= 50000; SELECT * FROM @T WHERE col2 = 1; SELECT * FROM @T WHERE col2 <= 2; SELECT * FROM @T WHERE col2 <= 5000;
You can find the code to create and populate the Nums table in Chapter 2.
The estimated execution plans generated for these queries are shown in Figure 7-1.
Figure 7-1. Estimated execution plans for queries against a table variable
The code creates a table variable called @T with two columns. The values in col1 are unique, and each value in col2 appears 10 times. The code creates two unique indexes underneath the covers: one on col1, and one on (col2, col1).
The first important thing to notice in the estimated plans is the number of rows the optimizer estimates to be returned from each operator—one in all five cases, even when looking for a non-unique value or ranges. You realize that unless you filter a unique column, the optimizer simply cannot estimate the selectivity of queries for lack of statistics. So it assumes one row. This hard-coded assumption is based on the fact that SQL Server assumes that you use table variables only with small sets of data.
As for the efficiency of the plans, the first two queries get a good plan (seek, followed by a partial scan in the second query). But that’s because you have a clustered index on the filtered column, and the optimizer doesn’t need statistics to figure out what the optimal plan is in this case. However, with the third and fourth queries you get a table scan (an unordered clustered index scan) even though both queries are very selective and would benefit from using the index on (col2, col1), followed by a small number of lookups. The fifth query would benefit from a table scan because it has low selectivity. Fortunately, it got an adequate plan, but that’s by chance. To analyze I/O costs, run the code after turning on the SET STATISTICS IO option. The amount of I/O involved with each of the last three queries is 2,713 reads, which is equivalent to the number of pages consumed by the table.
Next, go through the same analysis process with the following code, which uses a temporary table instead of a table variable:
SELECT n AS col1, (n - 1) % 10000 + 1 AS col2, CAST('a' AS CHAR(200)) AS filler INTO #T FROM dbo.Nums WHERE n <= 100000; ALTER TABLE #T ADD PRIMARY KEY(col1); CREATE UNIQUE INDEX idx_col2_col1 ON #T(col2, col1); GO SELECT * FROM #T WHERE col1 = 1; SELECT * FROM #T WHERE col1 <= 50000; SELECT * FROM #T WHERE col2 = 1; SELECT * FROM #T WHERE col2 <= 2; SELECT * FROM #T WHERE col2 <= 5000;
The estimated execution plans generated for these queries are shown in Figure 7-2 and Figure 7-3.
Figure 7-2. Estimated execution plans for queries 1, 2, and 3 against a temporary table
Figure 7-3. Estimated execution plans for queries 4 and 5 against a temporary table
As an aside, in case you’re curious about the Missing Index messages, SSMS 2008 reports this information in the graphical execution plan. Both SQL Server 2005 and SQL Server 2008 may enter a phase in optimization where they report missing index info. In both versions this information is available in the XML form of the execution plan. The new feature in SSMS 2008 is that it exposes this info graphically with the green-colored messages, whereas SSMS 2005 didn’t.
Now that statistics are available, the optimizer can make educated estimations. You can see that the estimated number of rows returned from each operator is more reasonable. You can also see that high-selectivity queries 3 and 4 use the index on (col2, col1), and the low-selectivity query 5 does a table scan, as it should.
STATISTICS IO reports dramatically reduced I/O costs for queries 3 and 4. These are 32 and 62 reads, respectively, against the temporary table versus 2,713 for each of these queries against the table variable.
When you’re done, drop #T for cleanup:
DROP TABLE #T;
Minimally Logged Inserts
As mentioned earlier, you can use SELECT INTO with temporary tables but not with table variables. With table variables you have to use INSERT SELECT instead. Prior to SQL Server 2008, INSERT SELECT involved more logging than SELECT INTO. This was true even with the reduced logging that happens with inserts against objects in tempdb. SQL Server 2008 adds the INSERT SELECT statement to the list of insertion methods that can be performed in a minimally logged mode, just like SELECT INTO.
I’ll demonstrate this capability through an example. I’ll insert data into the temporary object using SELECT INTO and INSERT SELECT in both SQL Server 2005 and SQL Server 2008. To figure out the amount of logging involved with the operation, I’ll query the undocumented fn_dblog function before and after the operation, and calculate the differences in terms of number of log records, and total record lengths, like so:
CHECKPOINT; GO DECLARE @numrecords AS INT, @size AS BIGINT; SELECT @numrecords = COUNT(*), @size = COALESCE(SUM([Log Record Length]), 0) FROM fn_dblog(NULL, NULL) AS D; -- <operation> SELECT COUNT(*) - @numrecords AS numrecords, CAST((COALESCE(SUM([Log Record Length]), 0) - @size) / 1024. / 1024. AS NUMERIC(12, 2)) AS size_mb FROM fn_dblog(NULL, NULL) AS D;
The first test is with the SELECT INTO statement that is processed with minimal logging in both SQL Server 2005 and SQL Server 2008, provided that the recovery model of the database is not set to FULL. As a reminder, tempdb’s recovery model is SIMPLE and cannot be changed. Here’s the code I used for this test:
USE tempdb; CHECKPOINT; GO DECLARE @numrecords AS INT, @size AS BIGINT; SELECT @numrecords = COUNT(*), @size = COALESCE(SUM([Log Record Length]), 0) FROM fn_dblog(NULL, NULL) AS D; SELECT n, CAST('a' AS CHAR(2000)) AS filler INTO #TestLogging FROM dbo.Nums WHERE n <= 100000; SELECT COUNT(*) - @numrecords AS numrecords, CAST((COALESCE(SUM([Log Record Length]), 0) - @size) / 1024. / 1024. AS NUMERIC(12, 2)) AS size_mb FROM fn_dblog(NULL, NULL) AS D; GO DROP TABLE #TestLogging;
As you can see, the operation is a SELECT INTO statement populating the temporary table #TestLogging with 100,000 rows by querying the Nums table. The output I got in SQL Server 2005 and SQL Server 2008 was similar:
numrecords size_mb ----------- -------- 9560 0.63
The number of log records is far lower than the number of rows inserted because only changes in allocation bitmaps (GAM, SGAM, PFS, IAM) were recorded in the log. Also, the total size recorded in the log is very small.
Next, I used the following code to test an INSERT SELECT against a table variable populating it with the same sample data used in the SELECT INTO test:
USE tempdb; CHECKPOINT; GO DECLARE @numrecords AS INT, @size AS BIGINT; SELECT @numrecords = COUNT(*), @size = COALESCE(SUM([Log Record Length]), 0) FROM fn_dblog(NULL, NULL) AS D; DECLARE @TestLogging AS TABLE(n INT, filler CHAR(2000)); INSERT INTO @TestLogging(n, filler) SELECT n, CAST('a' AS CHAR(2000)) FROM dbo.Nums WHERE n <= 100000; SELECT COUNT(*) - @numrecords AS numrecords, CAST((COALESCE(SUM([Log Record Length]), 0) - @size) / 1024. / 1024. AS NUMERIC(12, 2)) AS size_mb FROM fn_dblog(NULL, NULL) AS D; GO
Here’s the output I got in SQL Server 2005, indicating more logging activity than the corresponding SELECT INTO method:
numrecords size_mb ----------- -------- 184394 12.92
In SQL Server 2008 the output of the INSERT SELECT method was similar to the output I got for the corresponding SELECT INTO test, indicating minimal logging in both cases:
numrecords size_mb ----------- -------- 9539 0.63
This improvement in SQL Server 2008 means that temporary tables don’t have an advantage over table variables in terms of amount of logging of SELECT INTO versus INSERT SELECT.