Friday, June 26, 2015

Set operations in SQL Server - Part 1: Testing for equality

Sets and set operations are fundamental to relational databases -- both in theory and in practical day-to-day operation.  A table ("relation" in mathematical terms) represents a set (more properly, a multiset, or bag) and the rows of the table ("tuples") represent the members of a set.  It's no surprise then, that SQL contains many operators designed to operate on sets.  Some of these are familiar to anyone who recalls basic set theory from first-year college courses:
  • Union (UNION): Return a combination of two sets, eliminating duplicates
  • Intersection(INTERSECT): Return a set consisting of members present in both of two sets
  • Complement(EXCEPT): Return a set consisting of members present in one set but not the other
  • Symmetric difference (A EXCEPT B UNION B EXCEPT A): Union of the two complements
  • Cartesian product (A CROSS JOIN B): Set of all possible tuples from two sets
  • Cardinality (COUNT(*))
Note that I mentioned that in a relational database, sets are more properly called "multisets" or "bags".  This is because an RDBMS typically allows duplicate rows (that is, when a primary key is not defined).

Testing set equality

One interesting problem is how to determine the equality of two sets in an RDBMS while allowing for duplicate rows.  In set theory, two sets are equivalent if every member of set A is also a member of set B.  That means we must have:

A \ B = B \ A = {}, the empty set, where "\" is the operator for the complement operation.  We can also look at the cardinality:

|A\B| = |B\A| = 0

For tables in an RDBMS with a primary key (which can have no duplicate rows), we can check equality simply:

( SELECT * FROM A EXCEPT SELECT * FROM B ) UNION ( SELECT * FROM B EXCEPT SELECT * FROM A )

which you might recognize as computing the symmetric difference. If this query returns no rows, the two tables are equal. (The parentheses are needed to overcome the default operator precedence in SQL Server. See EXCEPT and INTERSECT (Transact-SQL) for details) For tables in an RDBMS without a primary key (which enforces uniqueness), we need to use the cardinality to account for cases where some rows are duplicated.  In SQL we can write:

SELECT ( SELECT count(*) FROM ( SELECT * FROM A EXCEPT SELECT * FROM B ) [A\B] ) AS [|A\B|] , ( SELECT count(*) FROM ( SELECT * FROM B EXCEPT SELECT * FROM A ) [B\A] ) AS [|B\A|]

but this won't work if the two tables have the same unique rows but different duplicated rows. For example, if we include the above logic in a WITH statement such as this:

WITH A AS ( SELECT * FROM (VALUES (1),(1),(2),(3),(4)) v(n) ) , B AS ( SELECT * FROM (VALUES (1),(2),(2),(3),(4)) v(n) ) -- query from previous example --

we can see that the results indicate equality. That is, the orders of the two complements is indeed 0, but the tables are not equal, since table A has two rows of 1's but table B has two rows of 2's. What is happening here? SQL removes duplicates as part of the set operations UNION, EXCEPT and INTERSECT!

To solve this problem, we need to compare the two tables row-by-row.  However, since there is no primary key, how should we match them up?  One simple way is to impose an ordering on the tables and assign a surrogate key.  In this case we simply use the ROW_NUMBER() function.  e.g.

WITH A(col1) AS ( SELECT * FROM (VALUES (1),(1),(2),(3),(4)) v(n) ) , B(col1) AS ( SELECT * FROM (VALUES (1),(2),(2),(3),(4)) v(n) ) ,[A'] AS ( SELECT *, ROW_NUMBER() OVER(ORDER BY col1) AS rn FROM A ) ,[B'] AS ( SELECT *, ROW_NUMBER() OVER(ORDER BY col1) AS rn FROM B ) SELECT ( SELECT count(*) FROM [A'] a INNER JOIN [B'] b ON a.rn = b.rn AND a.col1 = b.col1 ) AS [|A join B|] , ( SELECT COUNT(*) FROM A ) AS [|A|] , ( SELECT COUNT(*) FROM B ) AS [|B|]

This produces the results:

|A join B| |A| |B| 4 5 5

We can see that, even though A and B have the same cardinality, the JOIN produces a different cardinality, showing us that at some point, the two sets diverge. Hence the two sets are not equal.

No comments:

Post a Comment