This is the twelfth post in my A Join A Day series about SQL Server Joins. Make sure to let me know how I am doing or ask your burning join related questions by leaving a comment below.
Today's topic is the "left semi join". "Semi" is a Latin prefix that means "half". So, a semi join is a half join. Half in this context means, that the result contains only rows of one of the two row sources: The left semi join returns only data from the left row source. However, the "left" prefix does not mean that the left semi join is related to the left outer join. "Left Outer" and "Left Semi" are totally independent concepts. Let's look at an example.
The most common use case in which SQL Server's optimizer might decide to use a Left Semi Join operator is an EXISTS() query like this:
This query takes each row from the BusinessEntity table and checks if there is at least one address in the BusinessEntityAddress table. Only entities that have at least on address are included in the result. An entity that has two addresses is still only returned once. In fact, depending on the algorithm SQL Server chooses for the left semi join, it might even stop looking for additional matches to a left side row as soon as it has found the first one.
The query above query returns this result:
Of the 20777 business entities in the table 19579 have at least one address and are therefore included in the result. Even though 35 of them have two addresses, all entities are included only once.
Let's take a look at the same TblA – TblB example that we have seen in earlier articles like The Left Outer Join. Remember, the tables look like this:
Now let's run this EXISTS()-query against those tables:
That query produces the following result:
As you can see, every TblA row is included with one exception: TblA_Val = 1 is missing because a row with TblB_Val = 1 does not exist in TblB. You can also see that every TblA row is included only once even so there are multiple rows in TblB for values 4 and 5. Finally, even though the query is a SELECT * FROM query, only columns from TblA are included in the result.
Each row from the left row source is included at most once. However, if there are true duplicate rows, all of them would be included. A semi join does not eliminate existing duplicates.
Any join query that requests rows form the left row source based on the existence of rows in the right row source without including data from the right row source in the final result and without duplicating rows from the left row source is a logical left semi join. The EXISTS() clause we saw in the above examples is the most prominent way to write a left semi join query.
The EXISTS() clause is typically used in the WHERE clause of a query but you can use it in other places like a CASE statement too. You can place any valid sub-query inside the parenthesis. EXISTS() evaluates to TRUE if the sub-query returns at least on row.
As we will see in the next section, SQL Server does not always use a physical left semi join operator in these cases.
There is no command for the left semi join. The SQL Server optimizer has the Left Semi Join operator as one option to choose from, and when the query allows for its use, a left semi join might or might not get chosen. There is very little we can do to influence that choice.
The operator of the first example query above is indeed a Left Semi Join operator:
However, for the second example we get a totally different plan:
Instead of using a Left Semi Join operator, SQL Server decided to use the combination of a Stream Aggregate operator and a normal Inner Join operator to achieve the same result.
The typical use case for a logical left semi join is a query with an EXISTS() clause. However, even though SQL Server has a physical Left Semi Join operator, it might not always use it.
This post is part of my December 2012 "A Join A Day" blog post series. You can find the table of contents with all posts published so far in the introductory post: A Join A Day – Introduction. Check back there frequently throughout the month.