Deferred Deallocations of Pages in TempDb

2012-02-19 - General, SQL Server Internals, TempDb

The other day I ran into a question on msdn:Tempdb Deallocation (dm_db_session_space_usage). A person with the pseudonym DontPageMeBro noticed an odd behavior in his SQL Server installation: The sum of the reported usage of tempdb across all sessions was often bigger than tempdb itself. Something is clearly off here, so I figured it was worth digging a little deeper.

TempDb Usage

First let us look at how to get information about tempdb usage. SQL Server provides two DMVs that return information about the space used in tempdb:

sys.dm_db_session_space_usage

and

sys.dm_db_task_space_usage

sys.dm_db_task_space_usage reports the amount of pages taken and released by a task currently running. If a request uses parallelism you might see more than one row for that request returned by this DMV.
After each task finishes, it's numbers get rolled up into the sys.dm_db_session_space_usage. So here you can see how much space in tempdb a request used in total, but this information is only available after the request is completed.

Both DMVs report the number of pages in four columns:

user_objects_alloc_page_count, 
user_objects_dealloc_page_count, 
internal_objects_alloc_page_count, 
internal_objects_dealloc_page_count

The two "user" columns report on spaced used by user created objects like #temptables. The two "internal" columns report on objects SQL Server decided to create to support the execution of a query. Both pair are made up of a "alloc" and a "dealloc" page count column. The former is the total count of all pages that where allocated in the scope of the DMV whereas the latter contains the total count of all deallocated pages.

So, to get the current user object space allocation in tempdb for a given session, you need to subtract the "user_objects_dealloc_page_count" from the "user_objects_alloc_page_count" in both DMVs and then SUM all those values together. The following query shows how to do that for user as well as for internal space used:

[sql] SELECT session_id ,
SUM(user_objects_alloc_page_count - user_objects_dealloc_page_count) CurrentUserPages,
SUM(internal_objects_alloc_page_count - internal_objects_dealloc_page_count) CurrentInternalPages
FROM ( SELECT session_id ,
user_objects_alloc_page_count ,
user_objects_dealloc_page_count ,
internal_objects_alloc_page_count ,
internal_objects_dealloc_page_count
FROM sys.dm_db_session_space_usage
UNION ALL
SELECT session_id ,
user_objects_alloc_page_count ,
user_objects_dealloc_page_count ,
internal_objects_alloc_page_count ,
internal_objects_dealloc_page_count
FROM sys.dm_db_task_space_usage
) X
GROUP BY X.session_id
ORDER BY X.session_id ASC ;
[/sql]

DontPageMeBro is dealing with the problem that some users create large #temptables and then leave the session open without dropping those tables. If you have a lot of users like that your tempdb might grow to a significant size. Because it is always advisable to have tempdb on very fast storage, this problem can mean a significant cost to the business, so it makes sense to follow up and see how big the problem really is. The sys.dm_db_session_space_usage can give us that information. Every time there is a session that has more allocated than deallocated pages, that session is holding on to a temporary object. If the last request for that session has finished a while ago, that means, that space is not actually in use and those objects should have been dropped.

Counting Inaccuracies

So far the theory. However, as DontPageMeBro noticed, if a #temptable is big it seems that SQL Server sometimes forgets to count the page deallocation. So, even though the #temptable is gone, the pages are still being counted as in use by the session that created the table.

To understand what is going on here, we need to look at how SQL Server handles page deallocations after dropping an object.

Each Table consists of at least one allocation_unit. An allocation_unit contains for example all the data pages for a given table or index. There might also be additional allocation_units holding special pages for e.g. row overflow data. Every index in turn has it's own set of allocation_units.

When dropping an object, all allocation_units belonging to that object are deallocated and all their pages are marked as free. This can take quite some time for big objects and usually this cleanup work is done synchronously, which means that the session issuing the drop request has to wait. However if the allocation unit contains more than 1024 pages, this cleanup work is deferred. The allocation_unit is marked as "DROPPED" and the link to the table is removed at the time the drop statement executes. The actual work of freeing up all those pages is then done by a cleanup process later on. This process is the same in all databases, not only tempdb.

Lets look at an example to demonstrate this:

[sql] CREATE TABLE #temp
(
id INT IDENTITY ,
name VARCHAR(100)
);
GO
SELECT allocation_unit_id
INTO #aus
FROM tempdb.sys.system_internals_allocation_units u
JOIN tempdb.sys.system_internals_partitions p ON u.container_id = p.partition_id
WHERE p.object_id = OBJECT_ID('tempdb..#temp');
GO
INSERT INTO #temp
SELECT TOP ( 328168 )
'FillText'
FROM sys.system_internals_partition_columns a ,
sys.system_internals_partition_columns;
GO
SELECT 'A' A , *
FROM sys.dm_db_session_space_usage
WHERE session_id = @@SPID;

SELECT 'A' A , u.* , p.rows
FROM tempdb.sys.system_internals_allocation_units u
JOIN tempdb.sys.system_internals_partitions p ON u.container_id = p.partition_id
JOIN #aus t ON u.allocation_unit_id = t.allocation_unit_id;
GO
DROP TABLE #temp;
GO
SELECT 'B' B , *
FROM sys.dm_db_session_space_usage
WHERE session_id = @@SPID;

SELECT 'B' B , u.*
FROM tempdb.sys.system_internals_allocation_units u
JOIN #aus t ON u.allocation_unit_id = t.allocation_unit_id;
IF ( @@ROWCOUNT > 0 )
BEGIN
WAITFOR DELAY '00:00:06' ;
SELECT 'C' C , *
FROM sys.dm_db_session_space_usage
WHERE session_id = @@SPID;

SELECT 'C' C , *
FROM tempdb.sys.system_internals_allocation_units u
JOIN #aus t ON u.allocation_unit_id = t.allocation_unit_id;
END ;
GO

DROP TABLE #aus;
[/sql]

This script creates the table #temp, fills it with a few thousand rows and then drops it again. If you run this script you will get the following output back:

Screenshot showing a direct allocation_unit drop

The first result-set shows that the current session has 1019 pages currently reserved in tempdb. The second result-set tells us that the table #temp contains 328168 rows that are stored in a single allocation unit of type "IN_ROW_DATA" with 1018 pages.

The third and forth result-set were captured after the table was dropped. Here we can see that 1017 pages were deallocated in tempdb for the current connection and that the allocation_unit is now gone. (That not all pages of #temp show as being dropped in the count is due to another optimization that is not part of today's discussion.)

If we change the number of rows inserted (line 14 in above script) from 328168 to 328169 we get this output instead:

Screenshot showing a deferred allocation_unit drop

This time the allocation_unit has 1026 pages, so we would expect the deferred drop to take place. Result-set 4 now shows, directly after the drop table took place, that the allocation_unit is still there, but it's type changed to "DROPPED". The link to the table it once belonged to is also removed, as is visible in the container_id column.

If the above script sees that the allocation_unit still exists after the drop (see line 37) it waits for a few seconds and then looks again. In result-set 6 you can now see that the allocation_unit got removed. However the current session still shows 1027 allocated and no deallocated pages (result-set 5).

Cleaning Up

To see the cleanup process in action you can use the following script:

[sql] CREATE TABLE #temp
(
id INT IDENTITY ,
name VARCHAR(100)
);
GO
SELECT allocation_unit_id
INTO #aus
FROM tempdb.sys.system_internals_allocation_units u
JOIN tempdb.sys.system_internals_partitions p ON u.container_id = p.partition_id
WHERE p.object_id = OBJECT_ID('tempdb..#temp');
GO
INSERT INTO #temp
SELECT TOP ( 328169 )
'FillText'
FROM sys.system_internals_partition_columns a ,
sys.system_internals_partition_columns;
GO
SELECT 'A' A , *
FROM sys.dm_db_session_space_usage
WHERE session_id = @@SPID;

SELECT 'A' A , u.* , p.rows
FROM tempdb.sys.system_internals_allocation_units u
JOIN tempdb.sys.system_internals_partitions p ON u.container_id = p.partition_id
JOIN #aus t ON u.allocation_unit_id = t.allocation_unit_id;
GO
DROP TABLE #temp;

SELECT IDENTITY(INT,1,1) Id,SYSDATETIME()now, u.allocation_unit_id , data_pages , type_desc
INTO #siau
FROM tempdb.sys.system_internals_allocation_units u
JOIN #aus t ON u.allocation_unit_id = t.allocation_unit_id ;

SELECT IDENTITY(INT,1,1) Id,SYSDATETIME()now, *
INTO #ssu
FROM sys.dm_db_session_space_usage
WHERE user_objects_dealloc_page_count>0 OR session_id = @@SPID
ORDER BY session_id;

SELECT IDENTITY(INT,1,1) Id,SYSDATETIME()now, tsu.*,r.status,r.command
INTO #tsu
FROM sys.dm_db_task_space_usage tsu
JOIN sys.dm_exec_requests r
ON tsu.session_id = r.session_id
WHERE user_objects_dealloc_page_count>0
ORDER BY session_id;

DECLARE @AuId BIGINT = (SELECT TOP(1) allocation_unit_id FROM #aus);
DECLARE @Stop DATETIME2 = DATEADD(SECOND,10,SYSDATETIME());
WHILE (SYSDATETIME()<@Stop)
BEGIN
INSERT INTO #siau
SELECT SYSDATETIME()now, u.allocation_unit_id , data_pages , type_desc
FROM tempdb.sys.system_internals_allocation_units u
WHERE u.allocation_unit_id = @AuId;

INSERT INTO #ssu
SELECT SYSDATETIME()now, *
FROM sys.dm_db_session_space_usage
WHERE user_objects_dealloc_page_count>0 OR session_id = @@SPID
ORDER BY session_id;

INSERT INTO #tsu
SELECT SYSDATETIME()now, tsu.*,r.status,r.command
FROM sys.dm_db_task_space_usage tsu
JOIN sys.dm_exec_requests r
ON tsu.session_id = r.session_id
AND tsu.request_id = r.request_id
WHERE user_objects_dealloc_page_count>0
ORDER BY session_id;
END

SELECT MIN(now) MinNow,MAX(now) MaxNow,type_desc,data_pages,allocation_unit_id
FROM #siau
GROUP BY type_desc,data_pages,allocation_unit_id
ORDER BY MIN(Id) ASC;

SELECT MIN(now) MinNow,MAX(now) MaxNow,session_id,database_id,user_objects_alloc_page_count,user_objects_dealloc_page_count,internal_objects_alloc_page_count,internal_objects_dealloc_page_count
FROM #ssu
GROUP BY session_id,database_id,user_objects_alloc_page_count,user_objects_dealloc_page_count,internal_objects_alloc_page_count,internal_objects_dealloc_page_count
ORDER BY session_id ASC, MIN(Id) ASC;

SELECT MIN(now) MinNow,MAX(now) MaxNow,status,command,session_id,request_id,exec_context_id,database_id,user_objects_alloc_page_count,user_objects_dealloc_page_count,internal_objects_alloc_page_count,internal_objects_dealloc_page_count
FROM #tsu
GROUP BY status,command,session_id,request_id,exec_context_id,database_id,user_objects_alloc_page_count,user_objects_dealloc_page_count,internal_objects_alloc_page_count,internal_objects_dealloc_page_count
ORDER BY session_id ASC, MIN(Id) ASC;

GO
DROP TABLE #ssu;
DROP TABLE #tsu;
DROP TABLE #aus;
DROP TABLE #siau;
[/sql]

This script again creates the same table #temp that was used by the previous script, fills it and drops it. After dropping the table it records for ten seconds information from the allocation_unit DMV and the two space_usage DMVs. After those ten seconds it produces this output:

Screenshot showing a deferred allocation_unit drop in more detail

The first two result-sets show us the same information about the table that we have seen before. The other three result-sets contain grouped information of the three recorded DMVs. Each is grouped by all columns but the time of capture. That means, each row shows when those values where visible first and last, so you can see the progression of events.

The third result-set tells us that right after dropping the table the allocation_unit was marked as dropped. It at this point also shows a size of 0 pages. Just a little under a second later, at 17:18:06.1417576, the allocation_unit was seen last. At 17:18:06.1387573, just barely a millisecond after dropping the table, session 8 suddenly appears and starts to deallocate pages – pages it clearly did not allocate – until it, at 17:18:06.1417576, has deallocated all 1025 pages that belonged to our allocation_unit. That is exactly the same time when the allocation_unit itself was seen last. That this process started right after dropping the table is a coincidence and you might see several seconds pass before it starts, when you try this out yourself.

During the whole time of the recording from the table drop to ten seconds later, the information returned by sys.dm_db_session_space_usage DMV did not change at all. So, while the cleanup process can be seen to deallocate pages, that information never gets rolled up into the session space DMV. The session that originally created the table also never gets updated.

Versions

I have tested this in the following versions of SQL Server:

Microsoft SQL Server 2005 - 9.00.5057.00 (X64)
Microsoft SQL Server 2008 (SP3) - 10.0.5500.0 (X64)
Microsoft SQL Server 2008 R2 (SP1) - 10.50.2500.0 (X64)
Microsoft SQL Server 2012 RC0 - 11.0.1750.32 (X64)

All SQL Server versions listed above show the same behavior.

Conclusion

I consider this behavior a bug in SQL Server. As the last example showed, the allocation_unit gets cleaned up by a background process. At that point the link to the original table and with it the originating session is gone, so the background process cannot update the session space usage information. However, right at the time of the drop, when the allocation_unit is marked as DROPPED, it also shows a size of 0 pages. At this point the task space usage information for the request executing the drop should have been updated to also reflect the drop. The pages could at the same time appear in the allocated column of one of the "TASK MANAGER" requests that execute the background cleanup tasks to make sure that the total of all allocated and deallocated pages in those two DMVs always matches the current state of tempdb.

I created a Connect item with Microsoft for this behavior here:. If you agree that this should be addressed, please go there and vote for it.

Categories: General, SQL Server Internals, TempDb

Leave a Reply