Home » Blog » A Join A Day – How to Test a Join

A Join A Day – How to Test a Join

2012-12-11 - A Join A Day, General, Series, T-SQL Tuesday, TDD, tSQLt

Introduction

This is the eleventh post in my A Join A Day series about SQL Server Joins. But it is also my own contribution to this month's T-SQL Tuesday. Make sure to let me know how I am doing or ask your burning join related questions by leaving a comment below.

Before diving into semi joins over the next few days I would like to take today to talk about a rarely written about topic: How to best test a query that includes a join.

When unit testing T-SQL code, two of the most burning questions are: "How do I identify the first test?" and "How do I know that I have written enough tests?"

These questions are in fact so important that often in my classes a student brings them up well before we have covered the material necessary to talk about them. That also means that this is not a beginner article. While it is not required, to get the most out of this article you should have tried unit-testing your database code before.

Heuristics

The real answer to either of the two questions is not "It depends". However, it is a similarly dissatisfying "With Experience". But there are a few guidelines that you can use while you are working on collecting that experience. Those guidelines are called test case heuristics. A few of them are introduced here: Test Case Heuristics. Today I am going to focus on how to use the 0-1-some heuristics in the special case of testing a join.

0-1-some

The "0-1-some" heuristic helps us with the selection of test cases when the code is dealing with sets of some sort. A table in a database is a prime example of a set. So every time you are dealing with data in a table, this is one of the heuristics to look at.

The 0-1-some heuristic tells us that we need to look at three cases: The empty set, a set containing one element and a set containing "some" elements. In the case of a database table a row would be considered an "element", so we need to have a test for an empty table, a table with one row and a table with a few rows.

When we are dealing with joins we have two sets as input, so we have to adapt the heuristic a little. Let's look at an example.

Join Testing

In this example I am going to follow Test Driven Design (TDD). If you are not bought into that type of testing yet, that is okay. You still can use what we discuss here today in other types of unit-testing. All examples of this article are using the tSQLt database unit testing framework. tSQLt is open source and you can download it at http://tSQLt.org. There you can also find the documentation and a quick start guide. For a more in-depth walk through check out Test Driven Database Development with tSQLt by Greg Lucas (B|T).

We are going to write a report view against the dbo.Customers and the dbo.Orders tables. To keep this exercise a little simpler I created my own tables instead of using the AdventureWorks provided ones:

[sql] CREATE TABLE dbo.Customers(
Id INT PRIMARY KEY CLUSTERED,
Name NVARCHAR(50)
);

CREATE TABLE dbo.Orders(
Id INT PRIMARY KEY CLUSTERED,
CustomerId INT CONSTRAINT [FK: dbo.Orders -> dbo.Customers] REFERENCES dbo.Customers(Id),
OrderDate DATETIME2,
TotalAmount DECIMAL(13,4)
);
[/sql]

The report is supposed to return a row for each customer together with the customer's order count and the total amount of all orders. If you want to follow along, make sure to install tSQLt into the database that you created above tables in first. You can get it at http://tSQLt.org.

The First Test

Getting started is always particularly difficult. Once you are in the flow of things, finding the next test becomes a lot easier. That is true even when you are experienced in TDD. So we should try to make that first step as easy as possible. One of the rules in TDD is to always do the next simplest thing. The easiest test to write is probably on that checks that the view exists. However, I tend to add just a little more juice to that first test and assert that I can use the view. The simplest way to do that is to test that the view is not returning a row when the tables are empty. This is incidentally also the first test that the "0-1-some" heuristic suggests. In tSQLt that test looks like this:

[sql] EXEC tSQLt.NewTestClass 'OrderReportTests';
GO
CREATE PROC OrderReportTests.[test: no rows are returned when tables are empty] AS
BEGIN
EXEC tSQLt.FakeTable 'dbo.Customers';
EXEC tSQLt.FakeTable 'dbo.Orders';

SELECT *
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

EXEC tSQLt.AssertEqualsTable '#expected','#actual';
END
GO
[/sql]

While you usually should avoid to use a SELECT *, in this test it is actually appropriate. Not relying on a fixed list of columns makes this test more robust, as it won't break just because we change a column name or remove a column.

Now we need to compile and execute this test. To execute all tests in a database using tSQLt you just have to execute EXEC tSQLt.RunAll;. Because you want to run the tests as often as possible with minimal disruption to your workflow it is a good idea to define a keyboard shortcut for this command:

Executing our first test produces an error:

The above test failure is expected, because the view does not exist at this point. Let's fix that:

[sql] CREATE VIEW dbo.OrderReport
AS
SELECT TOP(0) 0 CustomerId;
[/sql]

The test did not tell us anything about what columns that view is going to have, so I just picked one that seemed to make sense. The next test will most likely tell us if I was right with that guess, so I did not sweat about it too much.

With this simple test we have done a few important things. We have defined the name of the object and we have shown how we want to use it – we want to be able to write a select statement against it. So this simple test not only covers the first case of the "0-1-some" heuristic, by testing the object against empty tables. We also made some important design decisions while writing it.

After creating the view as shown above, we can rerun the test. This time it will pass:

The TDD cycle always goes through a failing test first. The second phase is to make the test pass with minimal work. The third phase is called refactoring. The word "to refactor" describes a process of cleaning up the design of software without (!) changing its functionality. The most important things to look out for here are to use good names and to reduce duplication.

So far things look pretty good, so I won't do any refactoring at this time. Let's move on to the next test instead.

One Customer

The "0-1-some" heuristic tells us that we need to have three test cases for every set we are dealing with. When dealing with multiple sets we need to test each combination of cases. We have two input sets in our case so we need to cover nine test scenarios.

The next smallest step is to design what should happen if there is one customer that has not ordered anything yet:

[sql] CREATE PROC OrderReportTests.[test: one row is returned for single customer without orders] AS
BEGIN
EXEC tSQLt.FakeTable 'dbo.Customers';
EXEC tSQLt.FakeTable 'dbo.Orders';

INSERT INTO dbo.Customers(Id,Name)VALUES(13,'John');

SELECT CustomerId, CustomerName, OrderCount, TotalAmount
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

INSERT INTO #expected(CustomerId, CustomerName, OrderCount, TotalAmount)
VALUES(13, 'John', NULL, NULL);

EXEC tSQLt.AssertEqualsTable '#expected','#actual';

END
[/sql]

Now we have defined that the view should have at least for columns. We also have defined that a customer without orders still will be returned with the order related columns all valued NULL.

However, this test does not even compile:

The reason is that SQL Server does see that the referenced view exists, however without the referenced columns. In any situation where we have a test that does not compile, we should take the smallest step that gets that test to compile without worrying about making it pass:

[sql] IF OBJECT_ID('dbo.OrderReport') IS NOT NULL DROP VIEW OrderReport;
GO
CREATE VIEW dbo.OrderReport
AS
SELECT TOP(0) 0 CustomerId, '' CustomerName, 0 OrderCount, 0 TotalAmount;
[/sql]

By just providing some columns with the referenced names we are now able to compile and run the test and see it failing for the right reason:

Now let's come up with the smallest step to make this new test pass while not breaking the previous one:

[sql] IF OBJECT_ID('dbo.OrderReport') IS NOT NULL DROP VIEW OrderReport;
GO
CREATE VIEW dbo.OrderReport
AS
SELECT Id AS CustomerId, Name AS CustomerName, NULL OrderCount, NULL TotalAmount
FROM dbo.Customers;
[/sql]

None of the existing tests forced us yet to reference the Orders table, so we should refrain from adding this functionality at this point. The above code will make both tests pass:

Now we have some duplication to clean up. Both tests fake the Orders as well as the Customers table. These two lines will most likely be in every other test we are going to write today. Moving that code into a helper method reduces the amount of work we have to do, if something around this task changes. If we have to add a third table to the view later on, we can just change the single helper method instead of having to change every single test. Let's call this helper method OrderReportTests.FakeAllTables:

[sql] EXEC tSQLt.NewTestClass 'OrderReportTests';
GO
CREATE PROCEDURE OrderReportTests.FakeAllTables
AS
BEGIN
EXEC tSQLt.FakeTable 'dbo.Customers';
EXEC tSQLt.FakeTable 'dbo.Orders';
END;
GO
CREATE PROC OrderReportTests.[test: no rows are returned when tables are empty] AS
BEGIN
EXEC OrderReportTests.FakeAllTables;

SELECT *
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

EXEC tSQLt.AssertEqualsTable '#expected','#actual';
END
GO
CREATE PROC OrderReportTests.[test: one row is returned for single customer without orders] AS
BEGIN
EXEC OrderReportTests.FakeAllTables;

INSERT INTO dbo.Customers(Id,Name)VALUES(13,'John');

SELECT CustomerId, CustomerName, OrderCount, TotalAmount
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

INSERT INTO #expected(CustomerId, CustomerName, OrderCount, TotalAmount)
VALUES(13, 'John', NULL, NULL);

EXEC tSQLt.AssertEqualsTable '#expected','#actual';

END
[/sql]

After every refactoring step we need to make sure, we did not break anything by running the tests:

That looks good, so let's move on to the next test. We covered the "1 customer 0 orders" case. The next one could either be "some - 0" or "1 - 1". I feel more like "1 - 1" for now:

[sql] CREATE PROC OrderReportTests.[test: Count and Total are correct for single order] AS
BEGIN
EXEC OrderReportTests.FakeAllTables;

INSERT INTO dbo.Customers(Id,Name)VALUES(13,'John');
INSERT INTO dbo.Orders(Id, CustomerId, TotalAmount)VALUES(1, 13, 23.98);

SELECT CustomerId, CustomerName, OrderCount, TotalAmount
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

INSERT INTO #expected(CustomerId, CustomerName, OrderCount, TotalAmount)
VALUES(13, 'John', 1, 23.98);

EXEC tSQLt.AssertEqualsTable '#expected','#actual';

END
GO
[/sql]

As always, the first step is to make sure that the new test is failing for the right reason:

It fails as expected, as we are not referencing the Orders table yet. So, again let's find the simples way to make all now three tests pass:

[sql] IF OBJECT_ID('dbo.OrderReport') IS NOT NULL DROP VIEW OrderReport;
GO
CREATE VIEW dbo.OrderReport
AS
SELECT c.Id AS CustomerId, c.Name AS CustomerName, o.Id OrderCount, o.TotalAmount
FROM dbo.Customers AS c
OUTER APPLY dbo.Orders AS o;
[/sql]

I opted for the outer apply command, as it allows me to not worry about the join condition yet. It acts like a cross join with the difference that the left side rows are always returned, even if the right side table is empty. That makes all three tests pass:

Currently all our method names look good and I don't see any duplication, so let's move on. With the direction we chose for the third test, it makes sense for the fourth on to cover the "1 - some" case:

[sql] CREATE PROC OrderReportTests.[test: aggregates over multiple orders for single customer] AS
BEGIN
EXEC OrderReportTests.FakeAllTables;

INSERT INTO dbo.Customers(Id,Name)VALUES(13,'John');
INSERT INTO dbo.Orders(Id, CustomerId, TotalAmount)VALUES
(113, 13, 23.98),
(213, 13, 14.52),
(313, 13, 11.48);

SELECT CustomerId, CustomerName, OrderCount, TotalAmount
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

INSERT INTO #expected(CustomerId, CustomerName, OrderCount, TotalAmount)
VALUES(13, 'John', 3, 49.98);

EXEC tSQLt.AssertEqualsTable '#expected','#actual';

END
[/sql]

As always, first check that it fails:

This forces us now to actually think about aggregation. But at this point that is not too hard to add:

[sql] IF OBJECT_ID('dbo.OrderReport') IS NOT NULL DROP VIEW OrderReport;
GO
CREATE VIEW dbo.OrderReport
AS
SELECT c.Id AS CustomerId, c.Name AS CustomerName, oo.OrderCount, oo.TotalAmount
FROM dbo.Customers AS c
OUTER APPLY (
SELECT COUNT(1) AS OrderCount, SUM(TotalAmount) AS TotalAmount
FROM dbo.Orders AS o
) AS oo;
[/sql]

Let' see if that worked.

Oops. Now the second test is failing because we expected a NULL in the count column but got a 0. But let's think about that for a second. Getting a count of 0 makes a lot more sense when there are 0 orders for a given customer than getting a count of NULL which would really mean: "I don't know if there are orders…"

So, the fourth test showed us that we made a bad design decision in test two. Let's correct that:

[sql] CREATE PROC OrderReportTests.[test: one row is returned for single customer without orders] AS
BEGIN
EXEC OrderReportTests.FakeAllTables;

INSERT INTO dbo.Customers(Id,Name)VALUES(13,'John');

SELECT CustomerId, CustomerName, OrderCount, TotalAmount
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

INSERT INTO #expected(CustomerId, CustomerName, OrderCount, TotalAmount)
VALUES(13, 'John', 0, NULL);

EXEC tSQLt.AssertEqualsTable '#expected','#actual';

END
[/sql]

With that change in place all tests are passing again:

Did you notice that we had to change only one test to correct our mistake? That is something you should always aim for when writing tests. If a business rule changes, you want the number of tests that need to change to be minimal. One way to achieve this goal is to only include the data in the test that is pertinent to the current test scenario.

After having all tests pass, let's look at the design again. It still looks fairly clean. There is maybe a better way to name the table aliases, but for me the pain is currently not big enough to change those. However, if you had columns that don't use a column alias yet, this would be the time to fix those. The bigger the query grows the harder it gets to deal with those "optical orphans" at least as a human.

Some Customers

Now that we have "0 - 0", "1 - 0", "1 - 1" and "1 - some" covered it is time to look at the "some customers" cases:

[sql] CREATE PROC OrderReportTests.[test: aggregates multiple customers correctly] AS
BEGIN
EXEC OrderReportTests.FakeAllTables;

INSERT INTO dbo.Customers(Id,Name)VALUES(13,'John'),(42,'Jules'),(57,'Lex');
INSERT INTO dbo.Orders(Id, CustomerId, TotalAmount)VALUES
(113, 13, 23.98),
(213, 13, 14.52),
(313, 13, 11.48),
(142, 42, 39.87);

SELECT CustomerId, CustomerName, OrderCount, TotalAmount
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

INSERT INTO #expected(CustomerId, CustomerName, OrderCount, TotalAmount) VALUES
(13, 'John', 3, 49.98),
(42, 'Jules', 1, 39.87),
(57, 'Lex', 0, NULL);

EXEC tSQLt.AssertEqualsTable '#expected','#actual';

END
[/sql]

This test creates tree customers and assigns 0, 1 and some orders to them respectively. This effectively combines all the "some - *" cases into one. This is a fairly big step and if you feel it is too big to take at once you can add smaller tests in between. However, this combination is still an important test to see that the view can handle such a mixed environment correctly.

Running the test shows it failing, as expected:

This is caused by the missing join condition that I did not want to think about earlier, so let's add that in:

With that in place all tests will pass again:

At this point you might feel the urge to improve the performance by replacing the APPLY with a join. However, I would not jump to conclusions just yet. This query can actually be faster than an equivalent join given the right circumstances, so I would give it a try in your performance test environment first. If it misbehaves, it is easy to change and all the tests will instantly reassure you that you did not inadvertently break anything.

Other Scenarios

There are two cases left that we did not cover yet: "0 - 1" and "0 - some"

We have a foreign key defined between the two tables so these two scenarios cannot happen in real life. However, if someone later on decides that they would like to allow the concept of an orphaned order and therefore removes the foreign key, it is not unlikely that you will overlook this reporting view while hunting down all the CRUD procedures that would be affected by the change. Therefore it is a good idea to document that we are relying on that foreign key to exist.

You however do not want that information hidden in some comment in the code or in some document that nobody will remember to look at the critical time. Instead you want to write an additional automated test that enforces the existence of the foreign key:

[sql] CREATE PROC OrderReportTests.[test: FK exists] AS
BEGIN
EXEC tSQLt.AssertObjectExists 'dbo.[FK: dbo.Orders -> dbo.Customers]',
'The dbo.OrderReport view relies on functionality provided by the FK, but ';
END
[/sql]

This test will pass right away:

This is actually a situation you want to avoid. Now you do not know if the test passed because the FK exists, or because the test did not look for the right thing. That is a mistake that is easy to make, so it is a good practice to make sure the test works by forcing a failure. In this case we can just change the expected name of the FK slightly for example by taking out the "s" at the end. That will produce the expected failure:

Don't forget to change the test back when you are done.

The above test only documents that the view might produce unwanted results when the foreign key is removed. It just acts as a reminder. As such it belongs to the group of tests covering the current view.

It does not test that the foreign key is working correctly. As this FK is an important part of your product, you do want to write tests for that. Those tests should probably live in their own schema. I am not going to cover how to test constraints in this article however. There are a few examples on the tSQLt.org site.

For completeness I am including here the full set of tests, all in one T-SQL script:

SELECT *
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

EXEC tSQLt.AssertEqualsTable '#expected','#actual';
END
GO
CREATE PROC OrderReportTests.[test: one row is returned for single customer without orders] AS
BEGIN
EXEC OrderReportTests.FakeAllTables;

INSERT INTO dbo.Customers(Id,Name)VALUES(13,'John');

SELECT CustomerId, CustomerName, OrderCount, TotalAmount
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

INSERT INTO #expected(CustomerId, CustomerName, OrderCount, TotalAmount)
VALUES(13, 'John', 0, NULL);

EXEC tSQLt.AssertEqualsTable '#expected','#actual';

END
GO
CREATE PROC OrderReportTests.[test: Count and Total are correct for single order] AS
BEGIN
EXEC OrderReportTests.FakeAllTables;

INSERT INTO dbo.Customers(Id,Name)VALUES(13,'John');
INSERT INTO dbo.Orders(Id, CustomerId, TotalAmount)VALUES(1, 13, 23.98);

SELECT CustomerId, CustomerName, OrderCount, TotalAmount
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

INSERT INTO #expected(CustomerId, CustomerName, OrderCount, TotalAmount)
VALUES(13, 'John', 1, 23.98);

EXEC tSQLt.AssertEqualsTable '#expected','#actual';

END
GO

CREATE PROC OrderReportTests.[test: aggregates over multiple orders for single customer] AS
BEGIN
EXEC OrderReportTests.FakeAllTables;

INSERT INTO dbo.Customers(Id,Name)VALUES(13,'John');
INSERT INTO dbo.Orders(Id, CustomerId, TotalAmount)VALUES
(113, 13, 23.98),
(213, 13, 14.52),
(313, 13, 11.48);

SELECT CustomerId, CustomerName, OrderCount, TotalAmount
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

INSERT INTO #expected(CustomerId, CustomerName, OrderCount, TotalAmount)
VALUES(13, 'John', 3, 49.98);

EXEC tSQLt.AssertEqualsTable '#expected','#actual';

END
GO
CREATE PROC OrderReportTests.[test: aggregates multiple customers correctly] AS
BEGIN
EXEC OrderReportTests.FakeAllTables;

SELECT CustomerId, CustomerName, OrderCount, TotalAmount
INTO #actual
FROM dbo.OrderReport;

SELECT TOP(0) * INTO #expected FROM #actual;

INSERT INTO #expected(CustomerId, CustomerName, OrderCount, TotalAmount) VALUES
(13, 'John', 3, 49.98),
(42, 'Jules', 1, 39.87),
(57, 'Lex', 0, NULL);

EXEC tSQLt.AssertEqualsTable '#expected','#actual';

END
GO

CREATE PROC OrderReportTests.[test: FK exists] AS
BEGIN
EXEC tSQLt.AssertObjectExists 'dbo.[FK: dbo.Orders -> dbo.Customers]',
'The dbo.OrderReport view relies on functionality provided by the FK, but ';
END
GO
[/sql]

Summary

In this article we walked through developing a view containing a single join. We followed Test-Driven-Design and throughout the example we saw how the "0-1-some" heuristic can guide us in selecting the right test cases. We ended up with only six test cases providing a complete coverage of the functionality of this join.