This is the second post in my A Join A Day series about SQL Server Joins. Make sure to let me know how I am doing or ask your burning join related questions by leaving a comment below.
In the SQL language, joins are used to combine the data from one table with the data of another table based on a condition. The most common join type is the inner join. An inner join can be used to retrieve all records from one table and for each of them all related records from another table.
Let's look at an example. Information about orders in the AdventureWorks2008R2 database is split into two tables: Sales.SalesOrderHeader and Sales.SalesOrderDetail. If you want to return for each order all order line items together with information about the order they are part of you need to pull related records, meaning records belonging to the same order, from both tables into one result set. That is exactly what the join operator is doing:
SELECT TOP (100) soh.SalesOrderID, soh.OrderDate, soh.ShipDate, sod.SalesOrderDetailID, sod.OrderQty, sod.ProductID, sod.UnitPrice FROM Sales.SalesOrderHeader AS soh INNER JOIN Sales.SalesOrderDetail AS sod ON soh.SalesOrderID = sod.SalesOrderID ORDER BY soh.SalesOrderID DESC, sod.SalesOrderDetailID;
This query takes each row from the SalesOrderHeader table and then finds all rows in the SalesOrderDetail table that match on the value in the SalesOrderId column. It then returns three columns from the OrderHeader table and four from the OrderDetail table for each matched pair:
In the result each row from the SalesOrderHeader table is repeated multiple times to be paired with each matching row from the SalesOrderDetail table.
Most often an inner join is used to link up tables that have some kind of parent child relationship. However you can join on any kind of column. To show what happens if non unique values are involved in a join condition I have created two tables:
CREATE TABLE dbo.TblA(TblA_Val INT, TblA_Desc VARCHAR(MAX)); CREATE TABLE dbo.TblB(TblB_Val INT, TblB_Desc VARCHAR(MAX)); INSERT INTO dbo.TblA(TblA_Val, TblA_Desc) VALUES (1,'A-1'), (2,'A-2'), (3,'A-3 (1st)'), (3,'A-3 (2nd)'), (4,'A-4 (1st)'), (4,'A-4 (2nd)'), (5,'A-5'); INSERT INTO dbo.TblB(TblB_Val, TblB_Desc) VALUES (2,'B-2'), (3,'B-3'), (4,'B-4 (1st)'), (4,'B-4 (2nd)'), (5,'B-5 (1st)'), (5,'B-5 (2nd)'), (6,'B-6');
Let's look at the result of the following inner join query:
SELECT * FROM dbo.TblA AS A INNER JOIN dbo.TblB AS B ON A.TblA_Val = B.TblB_Val;
The result is shown below:
As you can see, any row from TblA gets returned as often as there are matching rows in TblB. There is one row with TblA_Val = 5 in TblA. However, there are two rows with TblB_Val = 5 in TblB. That means there are two rows for these in the result with the values from TblA repeated.
The same is true in the other direction too as you can quickly see when looking at the rows with Val = 3. When multiple rows with the same value exist on both sides, every row from the one side will get matched with every row from the other side. Val = 4 shows an example of that.
The ON keyword specifies the condition that must be fulfilled for rows to be considered a match. It is comparable to the WHERE clause. Like the WHERE clause it specifies a condition to filter out rows in the final result. However, there is an important difference. The condition in the ON clause is validated during the time of the join operation whereas the condition in the where clause is validated at the end of the query execution. In the case of a simple inner join the difference is not significant. In more complex join scenarios like outer joins or semi joins however it can make a huge difference where the condition is placed. Check out the next few posts in this series to see examples.
The INNER keyword in above query is optional and most people leave it out. The ON clause allows to include multiple checks just like the WHERE clause so if you need to match rows based on the values of more than one column just use AND or OR to combine multiple logical conditions. Full details about possible syntax forms of joins can be found here: http://msdn.microsoft.com/en-us/library/ms177634.aspx.
The order of the two tables does not matter for an inner join. In fact, SQL Server might decide to change the order during creation of the execution plan if it thinks the result will use fewer resources during execution.
You can have as many join clauses in a single query as you like. So to get the name of the product in above query instead of the rather meaningless ProductId, you could use just that value to join to the Product table:
SELECT TOP (100) soh.SalesOrderID, soh.OrderDate, soh.ShipDate, sod.SalesOrderDetailID, sod.OrderQty, p.Name AS ProductName, sod.UnitPrice FROM Sales.SalesOrderHeader AS soh INNER JOIN Sales.SalesOrderDetail AS sod ON soh.SalesOrderID = sod.SalesOrderID INNER JOIN Production.Product AS p ON sod.ProductID = p.ProductID ORDER BY soh.SalesOrderID DESC, sod.SalesOrderDetailID;
The number of tables in a single select is not limited by SQL Server. However, it is limited by the available resources (see http://msdn.microsoft.com/en-us/library/ms143432.aspx). The more tables are involved in a query the harder it gets for SQL Server's optimizer to find a good execution plan, which can lead to suboptimal plans or in extreme cases even compilation errors. I often find queries at clients that join to some tables unnecessarily. Removing those tables from a query can greatly help with performance.
To execute joins, SQL Server has three operators at its disposal: The Nested Loop Join, the Hash Join and the Merge Join. Which one SQL Server picks for a query depends on its estimates for the number of rows coming from each of the tables. What the differences between these operators exactly are will be discussed in more detail later in this series. For now let's just look at the execution plan of the example query above:
The operator responsible for doing the heavy lifting in the join process is right in the middle, the "Nested Loops (Inner Join)" operator. It is pulling rows from both tables to combine them based on the join condition and then passing the combined rows on.
Here you can directly see which of the three algorithms is used (Nested Loops) and which logical operation was requested (Inner Join).
If you look at the execution plan for the query above that also joined to the product table, you will see that there is a separate join operator for each JOIN statement in the query:
The right most join operator takes the SalesOrderHeader table and joins it with the SalesOrderDetail table. The left join operator takes the result of the first and joins it with the Product table. There are two things to note about this. First, any join operator can only work with two tables as input. There is no join operator that can handle more than two tables. Second, the input of a join operator does not need to be a table. Instead any execution plan operator can act as a row source on either input of the join operator. For that reason I am going to talk about "row source" instead of "table" when appropriate for the rest of this series.
The INNER JOIN is the most commonly used join operation. It takes rows from two row sources and combines them into a single result set matching rows together based on a specified condition. SQL Server has three different join algorithms at its disposal. It selects one of those based on the operation requested (inner join) and the number of rows it estimates to come from each row source.
This post is part of my December 2012 "A Join A Day" blog post series. You can find the table of contents with all posts published so far in the introductory post: A Join A Day – Introduction. Check back there frequently throughout the month.