A Join A Day – A Self-Join Experience

2012-12-10 - A Join A Day, Fundamentals, General, Series

Introduction

This is the tenth post in my A Join A Day series about SQL Server Joins. Make sure to let me know how I am doing or ask your burning join related questions by leaving a comment below.

A self-join is not something that is provided by a special join command. It's just a term for the pretty mundane idea of joining a table to itself. But rest assured – there is more behind it…

Self-Join Example

A self-join is often used when dealing with hierarchical data. For example if you are tasked to write a report of all employees and you need to list the manager for each of them you have to join the table with all employees to itself. SQL Server 2008 introduced a new data type to handle hierarchies: HIERARCHYID. The Employees table in AdventureWorks2008R2 uses that data type to define the company structure. So, to identify any given employee's manager we have to use the GetAnchestor method on the OrganizationNode column. We can then use that value to join back to the Employee table like this:

SELECT  
  Empl.LoginID AS EmplLogin, Empl.OrganizationLevel AS EmplLevel, Empl.JobTitle AS EmplJob,
  Mngr.LoginID AS MngrLogin, Mngr.OrganizationLevel AS MngrLevel, Mngr.JobTitle AS MngrJob
FROM HumanResources.Employee AS Empl
INNER JOIN HumanResources.Employee AS Mngr
ON Empl.OrganizationNode.GetAncestor(1) = Mngr.OrganizationNode;

The Employee table does not contain the employees actual name. We could use the ufnGetContactInformation function to get that information using a CROSS APPLY (see A Join A Day – The Cross Apply for details). However, for this simple example let's assume that the employee's and manager's login names are enough. The result looks like this:

A self-join example

Self-Join Syntax

Any time you join a table to itself you are creating a self-join. There is no special syntax around this concept and you can use all the join commands we have talked about before. Additionally there is one type of self-join that does happen implicitly and does not require any join command to be spelled out at all. More about that in the next section.

Self-Join Operator

There is also no special operator that would handle self-joins. Any of the join operators we have seen before can be used in a self-join plan. The execution plan of above employee query looks like this:

Execution plan for self-join query on hierarchyid

You can clearly see a merge join operator retrieving rows from two scans of the Employee table. Because of the special properties of the HIERARCHYID data type, SQL Server decided that the cheapest way to execute this query is to sort both inputs on the OrganizationNode column and use the merge join algorithm. The four Compute Scalar operators are in the execution plan to calculate the OrganizationLevel which is actually a computed column.

The above example is a pretty straight forward self-join query with no surprises. However, there are two other types of self-joins. The first one you might have heard of or seen before. It happens when SQL Server decides to use an index that is not covering. A covering index is an index that contains all columns that the current query is asking for. If a non-covering index is used, SQL Server has to get the remaining columns from the base table. Let's look at an example:

SELECT * FROM Person.Person WHERE LastName = 'Blue'; 

The execution plan of this simple query is shown below.

A key-lookup execution plan

SQL Server is using the index on LastName to find the 30 Persons with a last name of "Blue". Those rows are then joined with a Nested Loops (Inner Join) operator to the base table to get the remaining columns for each of them. The operator to access the base table is a Key Lookup operator in this case. A key lookup is however just a seek into the clustered index. So under the covers this is a real join operation.

The other implicit join I would like to show you is rarely seen. If a query asks for a subset of the columns of a table and there is no single index that covers that query, but there are two indexes that together cover all the columns, SQL Server might decide to join those two indexes together instead of accessing the base table. To see an example of this, we first have to add an additional index to the Person table.

CREATE INDEX [IX_Person_LastName_Suffix_Title] ON Person.Person(LastName,Suffix,Title);

After that we rerun the above query again, slightly modified to include only columns from this and one other index:

SELECT  Title,
        FirstName,
        MiddleName,
        LastName,
        Suffix
FROM Person.Person WHERE LastName = 'Blue';

To not pollute your copy of AdventureWorks2008R2, drop the index after you are done.

DROP INDEX Person.Person.IX_Person_LastName_Suffix_Title;

This query uses the following execution plan:

An index to index join

SQL Server is accessing two non-clustered indexes and joins the results with a Hash Match (Inner Join) operator. The underlying clustered index is never touched.

Summary

A self-join is not a new command. It just describes a query that joins a table to itself. There are two forms of self-joins. The explicit self-join is often used with tables that contain some kind of hierarchy. The implicit self-join on the other hand is used by SQL Server to reduce query cost. It usually joins an index to the base table using a lookup operator. But it can also happen that two non-clustered indexes are joined with each other.

A Join A Day

This post is part of my December 2012 "A Join A Day" blog post series. You can find the table of contents with all posts published so far in the introductory post: A Join A Day – Introduction. Check back there frequently throughout the month.

Categories: A Join A Day, Fundamentals, General, Series

Leave a Reply