This is the twenty-seventh post in my A Join A Day series about SQL Server Joins. Make sure to let me know how I am doing or ask your burning join related questions by leaving a comment below.
The second join algorithm we are going to dissect further is the Merge Join algorithm: Today we are going to look at which situations can and which situations cannot be handled by the Merge Join operator.
As we did for the Loop Join algorithm, we are going to look at all nine logical joins, each as equi- and as nonequi-join. The following is a list of example queries for each combination together with their execution plans. We are also going to look only at plain equi-join or nonequi-join scenarios and not include mixed cases, as those are handled similarly to equi-joins.
Both Semi Nonequi-Join queries do not force a table order. Therefore the above two queries show that all four semi nonequi-join operations are not possible using the Merge Join operator.
The most obvious shortfall of the Merge Join algorithm is related to nonequi-joins. The Merge Join algorithm requires at least one equality column pair in the join condition, as otherwise sorting "in match order" isn't possible. As there is no equality column in the join condition of a cross join either (it doesn't have a join condition after all), it seems obvious that the Merge Join algorithm can't handle that case either. But hold on — SQL Server's Merge Join operator is able to deal with a many-to-many full outer merge join scenario. Therefore a full outer merge nonequi-join situation can be handled as well.
However, if you look at the properties for the Merge Join operator in the above image for the Full Merge Nonequi-join example query, you see that there is no join condition. The operator first does a complete cross join and then filters the rows afterwards using the residual condition. So that means, the Merge Join operator can handle a cross join too. But the optimizer team decided to not make it available in any other circumstance other than the many-to-many full outer join scenario. The reason for this decision is that it is a very expensive operation for the Merge Join operator to execute a cross join. Let's look at this example:
SET STATISTICS IO ON; SET STATISTICS TIME ON; GO SELECT COUNT_BIG(1),MAX(.1*sod.SalesOrderID+.1*soh.SalesOrderID) FROM Sales.SalesOrderHeader AS soh FULL MERGE JOIN Sales.SalesOrderDetail AS sod ON soh.SalesOrderID != sod.SalesOrderID; GO SET STATISTICS TIME OFF; SET STATISTICS IO OFF; GO SET STATISTICS IO ON; SET STATISTICS TIME ON; GO SELECT COUNT_BIG(1),MAX(.1*sod.SalesOrderID+.1*soh.SalesOrderID) FROM Sales.SalesOrderHeader AS soh FULL LOOP JOIN Sales.SalesOrderDetail AS sod ON soh.SalesOrderID != sod.SalesOrderID; GO SET STATISTICS TIME OFF; SET STATISTICS IO OFF;
It is the same query used in the examples above, just with an added aggregate to not have to return all those rows to the client. The first query is forcing the Merge Join operator; the second one is using the Loop Join operator:
These execution plans are actual plans captured with SQL Sentry Plan Explorer and show the actual row counts on the arrows connecting the operators.
The STATISTICS IO output is shown below:
Both algorithms use a Worktable, but they seem to differ in size. The Worktable access happens inside of the Merge Join operator in the first query. The second query exposes it through Table Spool operators in the execution plan. While we can't directly access Worktables, we can see the number of reads executed against them. The Merge Join worktable requires about 20% more reads.
The above queries did not only capture IO statistics but also time statistics:
That is a factor of almost 2.4 that the loop join query is faster. So why would you use the Merge Join operator at all to execute a full outer many-to-many join?
The big difference between the two plans is that the loop join plan has to be broken down into two separate sections, each of which has to read the data from all the tables. If that read is very expensive, for example because the tables reside on a remote server, it can be advantageous to not have to do it twice. In a case like that the optimizer might chose to use the Merge Join operator instead of two Loop Join operators.
This post is part of my December 2012 "A Join A Day" blog post series. You can find the table of contents with all posts published so far in the introductory post: A Join A Day – Introduction. Check back there frequently throughout the month.