The Underestimated Complexity of a Table Compare Algorithm

2008-10-27 - General

One of the most important parts of a testing framework for SQL code is a table or resultset compare function. "That is easy to implement", you might think. But is it? Let's have a closer look.

Requirements

The requirements for a table compare function usable in a testing framework are easily stated:

  • Be able to accurately identify matching and non-matching tables.
  • Provide an output of the result that is helpful for identifying the problem.

The first one seems trivial, so let us start with looking at the second requirement. What should the output look like?

Assume we want to compare these two tables:

1 a aa
2 b bb
3 c cc
4 d dd

Tbl A

and

1 a aa
2 b xx
5 c cc
4 z dd

Tbl B

A possible ouput high-lighting the non-matching fields is this:

1 a aa
2 b xx
5 c cc
4 z dd

But it does not include the expected values. That can be done with something like this:

1 a aa
2 b bb
xx
3
5
c cc
4 d
z
dd

This design does not easily handle missing or unexpected rows. One possible extension to this design to accomodate these as well is to add an indicator to each row clearly identifying if it is matching (=), missing (-), unexpected (+), or unmatching (≠).

So if we compare Tbl A above with this one,

2 b bb
5 c cc
4 z dd
9 a aa

Tbl C

we get:

- 1 a aa
= 2 b bb
3
5
c cc
4 d
z
dd
+ 9 a aa

This looks like a workable solution in theory. Let us have a look at how you would implement something like this.

Obstacles

The above examples all quietly make an assumtion that we really cannot make in real life. They all assume the tables to be ordered.

The Relational Database Model, which most modern dbms are based on, talks about relations that are defined over domains. Each relation is a set of n-tuples of attributes. These are generaly displayed as tables with n columns and a number of rows. The columns represent the domains, and the rows represent the tupels. This is a problematic approach, as writing a set down as a table defines an abitrary order on the set. A set, in mathematics, is a collection of things with no order defined. It might not even be possible to define an order on a given set. That being said, if you ask your dbms to return a relation, it will do so in the form of a resultset. This resultset will consist of rows that get delivered to you in some order. You can influence the order the rows come back in, by using an explicit ORDER BY statment. However, if you don't, they might come back in any order. This is true even if you have defined a Clustered Index on the table.

To cite Itzik Ben-Gan:
"The most important advice that I hope that you will carry after reading this [...] goes back to the standard. ANSI SQL says that a query without an ORDER BY clause is not guaranteed to return the data in any particular order. If you want to guarantee that the data will be returned in a particular order, specify an ORDER BY clause. Regardless of what you know or think you know about the internals of [your dbms] this should be your work-premise. There may be access methods that you're not aware of that are used in special circumstances, and new access methods might be introduced in future versions of [your dbms]"
(Quaere Verum - Clustered Index Scans - Part I, II, III).

What does that mean for us? Looking at the last example of comparing Tbl A and Tbl C, there are now four different equally likely outcomes (as well as several not so likely ones):

- 1 a aa
+ 9 a aa
= 2 b bb
- 3 c cc
+ 5 c cc
4 d
z
dd
1
9
a aa
= 2 b bb
- 3 c cc
+ 5 c cc
4 d
z
dd
- 1 a aa
+ 9 a aa
= 2 b bb
3
5
c cc
4 d
z
dd
1
9
a aa
= 2 b bb
3
5
c cc
4 d
z
dd

If you think about it further, you will see that you really do not know if the line with the "9" was meant to be the one containing the "1". The same is true for the "3"/"5" pair and even the "d"/"z" pair. So the solution really should look like this:

= 2 b bb
- 1 a aa
- 3 c cc
- 4 d d
+ 5 c cc
+ 4 z dd
+ 9 a aa

After all this you might be tempted to say, let's just require the queries for comparison to contain an ORDER BY clause. Besides the obvious problems this causes -- think for example about comparing the result set of a stored procedure that does not have an ORDER BY clause specified -- this trick does not get us home free.

Ordered Problems

Let us compare the now well-known Tbl A with this one:

1 a aa
4 c bb
4 d dd

Tbl D

If we go strictly ordered, we get this result:

= 1 a aa
2
4
b
c
bb
3
4
c
d
cc
dd
- 4 d dd

This is clearly not what we want, as the row "4,d,dd" is marked as missing, even though it is there. To get to a better solution, we could take our ordered input and process it out of order, to look at matching rows first. But even then we still have some uncertainty. The possible outcomes are:

= 1 a aa
= 4 d dd
2
4
b
c
bb
- 3 c cc

or

= 1 a aa
= 4 d dd
- 2 b bb
3
4
c cc
bb

Both are equally likely or unlikely. We really can not say which one is the right one. That means the output should look like this:

= 1 a aa
= 4 d dd
- 2 b bb
- 3 c cc
+ 4 c bb

But this is exactly what we would have gotten if we would not have forced the ORDER BY in the first place.

Probabilistics

Based on what we know, or even more so what we do not know about the data we want to compare, the only correct solution is to look at rows atomically and report either a match, a missing, or an unexpected for each of them.

But, looking back at the comparison between Tbl A and Tbl C the output

1
9
a aa
= 2 b bb
3
5
c cc
4 d
z
dd

seems much more natural, than the output

= 2 b bb
- 1 a aa
- 3 c cc
- 4 d d
+ 5 c cc
+ 4 z dd
+ 9 a aa

So what would we have to do to get there?

First, we would have to give up the urge to always be right. Because we do not know which row was intended to be which, any assumption we make will be just a guess.

The second thing we would have to do is to find the most likely candidate. The most likely candidate is the one with the smallest overall distance. Distance here can be measured in several ways, depending on how much effort you want to expend. For the purpose of this discussion I will stay with distance measured in number of mismatching fields. But you could also include a measurement of the distance of the values in each field, for example, the absolute difference for numeric values.

To find the closest match, you have to first create a matrix that contains the distance (number of mismatching fields) from each expected row to each actual row. With that matrix you can then calculate the overall distance of all possible pairings and pick the one with the lowest number. If you end up finding more then one minimal solution, you have to pick one randomly.

This algorithm is of O(n2) complexity, and therefore might take a long time to complete. Testcases are generaly intended to be executed as often as possible and an extended execution time gets in the way of that.

There are several ways to improve the search speed of this algorithm if you give up the aim for being always correct. (Wait, didn't we give this up earlier?) They involve using clever ways to minimize the number of comparisons you have to make but might lead you to a sub-optimal solution. But as we do not even know if the optimal solution is the right one, any sub-optimal solution is probably good enough as well.

The easiest way to go is to pick the closest match for each row without regards to the other rows. To make this work, you have to find all matching rows first. Otherwise you might end up with rows without a match, even though they really did match. You also might throw the pairing off completely as the following example shows.

Compare Tbl A with this one:

1 b bb
2 c cc
3 d dd
4 e ee

Tbl E

This can lead to either of these two outputs:

1 a
b
aa
bb
2 b
c
bb
cc
3 c
d
cc
dd
4 d
e
d
ee

or

- 1 a aa
2
1
b bb
3
2
c cc
4
3
d dd
+ 4 e ee

Which one you get depends entirely on the order in which you process the rows. (There are actually more possible solutions, but these two are enough to get my point across.)

That means, using this shorcut algorithm further increases the risk of not finding the closest match.

Conclusion

We have seen, that the safest solution is, to not try to match missmatching rows together. We also have seen ways to get close to a perfect match.

In my experience it can be very helpful, if the program tells you what the difference is when you have two almost identical rows with many columns. But it also can be very frustrating, being sent on the wrong track by a program that thougt it could think, without knowing all the parameters. It also can be very frustrating to have to wait a long time for a test suite to finish executing, especially if you know that it is spending the time on an operation, which value is questionable in the first place. In my opinion the best solution is, to - as a standard behavior - not match nonmatching rows together but allow the user to request and maybe even influence a match later on. This additional calculation should happen outside of the testcase evaluation so that it does not slow down a test suite execution.

Categories: General

Leave a Reply