Purging old data out of the database usually is driven by a date. If the business does not need records that are older than a Year, and if your records have a date column, you can purge by deleting all records from that table that have a date that is older ("<") than DATEADD(year,-1,GETDATE())
Sometimes however the purging is driven by a fixed number of rows that need to be kept. In this post I am going to show how to do this type of purging most efficiently.
For this article I am assuming that the table that needs to be purged has a clustered index on an IDENTITY column. The final solution I am going to present does not rely on that fact. However if the order of rows for the purging process is not given by the clustered index, another method might be more efficient.
To start let us create a table and insert 10000 rows into it:
One of the ways to go about this purge is to find the maximum IDENTITY value in the table and subtract the number of rows to keep from it. That will give you the largest IDENTITY value that needs to be deleted:
This looks neat, however there is a problem: If there is a gap in the IDENTITY values in the table you will end up deleting to many rows:
In this example I created the table with an IDENTITY step size greater than one. This might seem arbitrary, however keep in mind that the increment of the IDENTITY value for a table never gets rolled back. So if you ever had a failing insert you also have gaps.
The result of the last query in the above example shows that there are only 973 rows left in the table instead of the expected 4864. So we need to get a little more sophisticated:
This, and all following examples assume that the table was recreated as in the previous example. That ensures that the numbers we are going to look at are comparable.
This solution reads a total of 21412 pages while using two Clustered Index Scan RelOps and a Sort:
Table 'ToBePurged'. Scan count 2, logical reads 11060, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 1, logical reads 10352, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
We can replace the second Scan RelOp with a Seek RelOp:
This solution reads a total of 211156 pages while using a Clustered Index Scan and a Clustered Index Seek RelOp without a Sort:
Table 'ToBePurged'. Scan count 2, logical reads 10804, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 1, logical reads 10352, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
If you look closely, the seek is still a scan, it just starts in the middle of the table instead of at the beginning. To find the starting point, SQL Server executes a seek operation, hence the Clustered Index Seek RelOp in the plan.
The biggest issue with both solutions is however the need of a spool. SQL Server uses an Eager Spool RelOp when it needs to delete rows from a table based on the results of a complex (enough) query. So how can we simplify this query?
When SQL Server executes the previous query, it actually does one complete table scan: The first Scan RelOp that is followed by a top reads through the first 4864 rows to determine the MIN(Id) value. The Seek RelOp then finds the first row that has an Id value smaller than that MIN value and continues to read through all remaining rows from there. The work performed is therefore one complete scan through the table.
It should reduce the complexity of this query significantly, if we could do both steps with one Clustered Index Scan RelOp. We basically want SQL Server to read through all the rows in one ordered swoop and start deleting once we are past the rows we need to keep. We can achieve that with the ROW_NUMBER() function:
This solution now reads only a total of 10801 pages using only one Clustered Index Scan RelOp:
Table 'ToBePurged'. Scan count 1, logical reads 10801, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
This query is making use of the fact, that SQL Server knows to delete the rows in the underlying table even so the FROM clause is referencing a sub-select. Because the sub-select does not reference any columns of the table, you do not need to worry about name collisions either. The above query works even, if the table contains a column with the name 'rn'.