Purging a Table based on a Fixed Number of Rows to Keep

2012-02-07 - Archiving and Purging, General, Performance

Purging old data out of the database usually is driven by a date. If the business does not need records that are older than a Year, and if your records have a date column, you can purge by deleting all records from that table that have a date that is older ("<") than DATEADD(year,-1,GETDATE())

Sometimes however the purging is driven by a fixed number of rows that need to be kept. In this post I am going to show how to do this type of purging most efficiently.

For this article I am assuming that the table that needs to be purged has a clustered index on an IDENTITY column. The final solution I am going to present does not rely on that fact. However if the order of rows for the purging process is not given by the clustered index, another method might be more efficient.

To start let us create a table and insert 10000 rows into it:

[sql] IF OBJECT_ID('dbo.ToBePurged') IS NOT NULL DROP TABLE dbo.ToBePurged;
CREATE TABLE dbo.ToBePurged(Id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, PlaceHolderForOtherColumns CHAR(400));
INSERT INTO dbo.ToBePurged(PlaceHolderForOtherColumns) SELECT TOP(10000) 'Fill' FROM sys.system_internals_partition_columns a,sys.system_internals_partition_columns b;
[/sql]

One of the ways to go about this purge is to find the maximum IDENTITY value in the table and subtract the number of rows to keep from it. That will give you the largest IDENTITY value that needs to be deleted:

[sql] DECLARE @NoRowsToKeep INT = 4864;
DELETE dbo.ToBePurged WHERE Id <= (SELECT MAX(Id) - @NoRowsToKeep FROM dbo.ToBePurged);
[/sql]

This looks neat, however there is a problem: If there is a gap in the IDENTITY values in the table you will end up deleting to many rows:

[sql] IF OBJECT_ID('dbo.ToBePurged') IS NOT NULL DROP TABLE dbo.ToBePurged;
CREATE TABLE dbo.ToBePurged(Id INT IDENTITY(13,5) PRIMARY KEY CLUSTERED, PlaceHolderForOtherColumns CHAR(400));
INSERT INTO dbo.ToBePurged(PlaceHolderForOtherColumns) SELECT TOP(10000) 'Fill' FROM sys.system_internals_partition_columns a,sys.system_internals_partition_columns b;
GO
DECLARE @NoRowsToKeep INT = 4864;
DELETE dbo.ToBePurged WHERE Id <= (SELECT MAX(Id) - @NoRowsToKeep FROM dbo.ToBePurged);
SELECT COUNT(1) Cnt, @NoRowsToKeep NoRowsToLeave FROM dbo.ToBePurged;
[/sql]

In this example I created the table with an IDENTITY step size greater than one. This might seem arbitrary, however keep in mind that the increment of the IDENTITY value for a table never gets rolled back. So if you ever had a failing insert you also have gaps.

The result of the last query in the above example shows that there are only 973 rows left in the table instead of the expected 4864. So we need to get a little more sophisticated:

[sql] DECLARE @NoRowsToKeep INT = 4864;
SET STATISTICS IO ON;
DELETE dbo.ToBePurged WHERE Id NOT IN (SELECT TOP(@NoRowsToKeep) Id FROM dbo.ToBePurged ORDER BY Id DESC);
SET STATISTICS IO OFF;
SELECT COUNT(1) Cnt, @NoRowsToKeep NoRowsToLeave FROM dbo.ToBePurged;
[/sql]

This, and all following examples assume that the table was recreated as in the previous example. That ensures that the numbers we are going to look at are comparable.

This solution reads a total of 21412 pages while using two Clustered Index Scan RelOps and a Sort:

Table 'ToBePurged'. Scan count 2, logical reads 11060, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 1, logical reads 10352, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Execution Plan 1 - click to enlarge

Execution Plan 1 - click to enlarge

We can replace the second Scan RelOp with a Seek RelOp:

[sql] DECLARE @NoRowsToKeep INT = 4864;
SET STATISTICS IO ON;
DELETE dbo.ToBePurged WHERE Id < (SELECT MIN(Id) FROM (SELECT TOP(@NoRowsToKeep) Id FROM dbo.ToBePurged ORDER BY Id DESC)X);
SET STATISTICS IO OFF;
SELECT COUNT(1) Cnt, @NoRowsToKeep NoRowsToLeave FROM dbo.ToBePurged;
[/sql]

This solution reads a total of 211156 pages while using a Clustered Index Scan and a Clustered Index Seek RelOp without a Sort:

Table 'ToBePurged'. Scan count 2, logical reads 10804, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 1, logical reads 10352, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Execution Plan 2 - click to enlarge

Execution Plan 2 - click to enlarge

If you look closely, the seek is still a scan, it just starts in the middle of the table instead of at the beginning. To find the starting point, SQL Server executes a seek operation, hence the Clustered Index Seek RelOp in the plan.

The biggest issue with both solutions is however the need of a spool. SQL Server uses an Eager Spool RelOp when it needs to delete rows from a table based on the results of a complex (enough) query. So how can we simplify this query?

When SQL Server executes the previous query, it actually does one complete table scan: The first Scan RelOp that is followed by a top reads through the first 4864 rows to determine the MIN(Id) value. The Seek RelOp then finds the first row that has an Id value smaller than that MIN value and continues to read through all remaining rows from there. The work performed is therefore one complete scan through the table.

It should reduce the complexity of this query significantly, if we could do both steps with one Clustered Index Scan RelOp. We basically want SQL Server to read through all the rows in one ordered swoop and start deleting once we are past the rows we need to keep. We can achieve that with the ROW_NUMBER() function:

[sql] DECLARE @NoRowsToKeep INT = 4864;
SET STATISTICS IO ON;
DELETE tbp
FROM (SELECT ROW_NUMBER() OVER(ORDER BY Id DESC) rn FROM dbo.ToBePurged) tbp
WHERE tbp.rn > @NoRowsToKeep;
SET STATISTICS IO OFF;
SELECT COUNT(1) Cnt, @NoRowsToKeep NoRowsToLeave FROM dbo.ToBePurged;
[/sql]

This solution now reads only a total of 10801 pages using only one Clustered Index Scan RelOp:

Table 'ToBePurged'. Scan count 1, logical reads 10801, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Execution Plan 3 - click to enlarge

Execution Plan 3 - click to enlarge

This query is making use of the fact, that SQL Server knows to delete the rows in the underlying table even so the FROM clause is referencing a sub-select. Because the sub-select does not reference any columns of the table, you do not need to worry about name collisions either. The above query works even, if the table contains a column with the name 'rn'.

Categories: Archiving and Purging, General, Performance

Leave a Reply