Merge Wonders – Insert or Use

2013-02-05 - General, Hints, MERGE, Performance, T-SQL Statements

Introduction

When working with clients I often see code that follows the insert-or-use pattern. Let's assume we have a dbo.Product table:

[sql] CREATE TABLE dbo.Product
(
Id INT NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
Name NVARCHAR(50) NOT NULL,
ProductNumber NVARCHAR(25)
);
[/sql]

Often the pattern is implemented like this:

[sql] CREATE PROCEDURE dbo.GetOrCreateProduct
@ProductName NVARCHAR(50),
@ProductNumber NVARCHAR(25),
@Id INT OUTPUT
AS
BEGIN
IF EXISTS(SELECT 1 FROM dbo.Product WHERE ProductNumber = @ProductNumber)
BEGIN
SELECT @Id = Id FROM dbo.Product WHERE ProductNumber = @ProductNumber;
END
ELSE
BEGIN
INSERT INTO dbo.Product(Name, ProductNumber)VALUES(@ProductName, @ProductNumber);
SELECT @id = @@IDENTITY;
END;
END;
[/sql]

There are several issues with this implementation. To start the table is accessed twice in all cases. If there is no index on ProductNumber, that will cause disastrous performance, but even with an index, we should avoid double work whenever possible. Even more important, in the time between the first and the second read, someone else could have inserted a row already. Depending on if there is a unique constraint on ProductNumber this will then either cause a duplicate row to be created or it will cause this code to fail. Also, there is a whole slew of issues with @@IDENTITYitself that I am not going to cover today. Refer back to my Identity Crisis post for more information.

The Merge Statement to the rescue

There are several options to deal with the problems described above. One of the more elegant ones is the use of the MERGE statement.

The MERGE statement that was introduced in SQL Server 2008 is intended to be used in an INSERT or UPDATE situation. So how does it help in this situation, where no UPDATE is involved?

Let's address the issues of the solution above one at a time to see how the MERGE statement can help. Because of the issues described in the Identity Crisis post, it is recommended to use the output clause to get the value of the identity column of a newly inserted row like this:

[sql] CREATE PROCEDURE dbo.GetOrCreateProduct
@ProductName NVARCHAR(50),
@ProductNumber NVARCHAR(25),
@Id INT OUTPUT
AS
BEGIN
DECLARE @Ids TABLE(Id INT);

INSERT INTO dbo.Product(Name, ProductNumber)
OUTPUT(INSERTED.IDENTITYCOL)INTO @Ids
VALUES(@ProductName, @ProductNumber);

SELECT @Id = Id FROM @Ids;
END;
[/sql]

The double step of storing the identity value in a table variable and then reading it back out to use it is necessary, because SQL Servers OUTPUT clause cannot directly be used to set the value of a variable.

This takes care of the insert portion, but we don't want to insert a new row if one exists already. That is where MERGE gets introduced:

[sql] CREATE PROCEDURE dbo.GetOrCreateProduct
@ProductName NVARCHAR(50),
@ProductNumber NVARCHAR(25),
@Id INT OUTPUT
AS
BEGIN
DECLARE @Ids TABLE(Id INT);

MERGE dbo.Product AS p
USING (VALUES(@ProductName, @ProductNumber))n(Name,ProductNumber)
ON p.ProductNumber = n.ProductNumber
WHEN NOT MATCHED THEN
INSERT(Name, ProductNumber)
VALUES(n.Name, n.ProductNumber)
OUTPUT(INSERTED.IDENTITYCOL)INTO @Ids
;

SELECT @Id = Id FROM @Ids;
END;
[/sql]

Now a new row is only inserted no matching row exists. If a matching row exists the MERGE statement does nothing. That however means that in that case the OUTPUT clause does not return any rows, so we won't be able to retrieve the correct Id. To get the OUTPUT clause to return the required information, we just need to add an UPDATE into the mix. If you have a column that you would like to be updated on ever access (like a column with the time of the last access) this is easily done. But what can we do if we do not actually want to change a column value? We can just assign a column to itself as in SET p.ProductNumber = p.ProductNumber. SQL Server is clever enough to not actually execute the write if no changes happened. But this is a little dangerous, because the next person looking at this code might think that this was an oversight and just take out the UPDATE that clearly isn't doing anything. So I like to be a little more obvious about this and instead use the ability of the UPDATE statement to set variables like this:

[sql] CREATE PROCEDURE dbo.GetOrCreateProduct
@ProductName NVARCHAR(50),
@ProductNumber NVARCHAR(25),
@Id INT OUTPUT
AS
BEGIN
DECLARE @Ids TABLE(Id INT);
DECLARE @dummy INT;

MERGE dbo.Product AS p
USING (VALUES(@ProductName, @ProductNumber))n(Name,ProductNumber)
ON p.ProductNumber = n.ProductNumber
WHEN MATCHED THEN
UPDATE SET @dummy = @dummy
WHEN NOT MATCHED THEN
INSERT(Name, ProductNumber)
VALUES(n.Name, n.ProductNumber)
OUTPUT(INSERTED.IDENTITYCOL)INTO @Ids
;

SELECT @Id = Id FROM @Ids;
END;
[/sql]

While it is still a little obfuscated what is happening here, at least it should be clear that the line of code with the UPDATE was not an accident. You could even be more verbose and write something like this: UPDATE SET @this_is_a_required_update = 1

Danger, Concurrency!

The biggest problem with the original code is that it did not prevent concurrent inserts of the same value. If two connections at the same time execute the procedure for the same product, they both will not find the product at about the same time and then insert the product also at the same time. If there is a unique constraint on the column this will cause unexpected errors. If there is no unique constraint this will cause row duplicates, which usually is even worse.

The MERGE statement was written for cases where a row might be inserted if it does not yet exists. So you would expect that it could handle this case automatically. However, under the covers MERGE is just a lookup potentially followed by an insert. This is probably due to the MERGE statement being targeted at whole table synchronizations. The name itself hints at that. In that case high frequency calls with the same values are highly unlikely and any measures to prevent the side-effects of such will unnecessarily take additional resources. That means that in our case we need to take one additional step:

[sql] CREATE PROCEDURE dbo.GetOrCreateProduct
@ProductName NVARCHAR(50),
@ProductNumber NVARCHAR(25),
@Id INT OUTPUT
AS
BEGIN
DECLARE @Ids TABLE(Id INT);
DECLARE @dummy INT;

MERGE dbo.Product WITH(HOLDLOCK) AS p
USING (VALUES(@ProductName, @ProductNumber))n(Name,ProductNumber)
ON p.ProductNumber = n.ProductNumber
WHEN MATCHED THEN
UPDATE SET @dummy = @dummy
WHEN NOT MATCHED THEN
INSERT(Name, ProductNumber)
VALUES(n.Name, n.ProductNumber)
OUTPUT(INSERTED.IDENTITYCOL)INTO @Ids
;

SELECT @Id = Id FROM @Ids;
END;
[/sql]

The only difference to the previous version is the WITH(HOLDLOCK) table hint right after the table name. The HODLOCK table hint causes SQL Server to access that one table as if the current isolation level was set to serializable. That does prevent any insert collisions effectively. However, HOLDLOCK makes use of range locks. Range locks require the search argument to be indexed, otherwise SQL Server will effectively have to take a table lock, as there is nothing of which a range could be taken. If you set the transaction isolation level to serializable and run a query that requires a table scan, SQL Server will actually take a table lock instead of many range locks. However, this is not the case when using the MERGE statement together with the HOLDLOCK table hint. SQL Server will instead take a range lock on every row. That has the same effect as a table lock but at a much higher cost. So it is imperative that you have appropriate indexes in place when using this solution. But that index is one that you should have anyway if you are executing this query often.

Summary

While not directly designed for this use case, the SQL Server MERGE statement gives us an easy and – if used correctly – safe way to deal with an Insert or Use scenario. In this scenario a row's primary key needs to be retrieved if the row exists, otherwise the row needs to be created and the newly generated identity value needs to be returned.

While there are many ways to solve this problem and to circumvent all the pitfalls associated with it, MERGE offers one of the most elegant solutions.

Categories: General, Hints, MERGE, Performance, T-SQL Statements

Leave a Reply