Friday, June 26, 2015

Set operations in SQL Server - Part 1: Testing for equality

Sets and set operations are fundamental to relational databases -- both in theory and in practical day-to-day operation.  A table ("relation" in mathematical terms) represents a set (more properly, a multiset, or bag) and the rows of the table ("tuples") represent the members of a set.  It's no surprise then, that SQL contains many operators designed to operate on sets.  Some of these are familiar to anyone who recalls basic set theory from first-year college courses:
  • Union (UNION): Return a combination of two sets, eliminating duplicates
  • Intersection(INTERSECT): Return a set consisting of members present in both of two sets
  • Complement(EXCEPT): Return a set consisting of members present in one set but not the other
  • Symmetric difference (A EXCEPT B UNION B EXCEPT A): Union of the two complements
  • Cartesian product (A CROSS JOIN B): Set of all possible tuples from two sets
  • Cardinality (COUNT(*))
Note that I mentioned that in a relational database, sets are more properly called "multisets" or "bags".  This is because an RDBMS typically allows duplicate rows (that is, when a primary key is not defined).

Testing set equality

One interesting problem is how to determine the equality of two sets in an RDBMS while allowing for duplicate rows.  In set theory, two sets are equivalent if every member of set A is also a member of set B.  That means we must have:

A \ B = B \ A = {}, the empty set, where "\" is the operator for the complement operation.  We can also look at the cardinality:

|A\B| = |B\A| = 0

For tables in an RDBMS with a primary key (which can have no duplicate rows), we can check equality simply:

( SELECT * FROM A EXCEPT SELECT * FROM B ) UNION ( SELECT * FROM B EXCEPT SELECT * FROM A )

which you might recognize as computing the symmetric difference. If this query returns no rows, the two tables are equal. (The parentheses are needed to overcome the default operator precedence in SQL Server. See EXCEPT and INTERSECT (Transact-SQL) for details) For tables in an RDBMS without a primary key (which enforces uniqueness), we need to use the cardinality to account for cases where some rows are duplicated.  In SQL we can write:

SELECT ( SELECT count(*) FROM ( SELECT * FROM A EXCEPT SELECT * FROM B ) [A\B] ) AS [|A\B|] , ( SELECT count(*) FROM ( SELECT * FROM B EXCEPT SELECT * FROM A ) [B\A] ) AS [|B\A|]

but this won't work if the two tables have the same unique rows but different duplicated rows. For example, if we include the above logic in a WITH statement such as this:

WITH A AS ( SELECT * FROM (VALUES (1),(1),(2),(3),(4)) v(n) ) , B AS ( SELECT * FROM (VALUES (1),(2),(2),(3),(4)) v(n) ) -- query from previous example --

we can see that the results indicate equality. That is, the orders of the two complements is indeed 0, but the tables are not equal, since table A has two rows of 1's but table B has two rows of 2's. What is happening here? SQL removes duplicates as part of the set operations UNION, EXCEPT and INTERSECT!

To solve this problem, we need to compare the two tables row-by-row.  However, since there is no primary key, how should we match them up?  One simple way is to impose an ordering on the tables and assign a surrogate key.  In this case we simply use the ROW_NUMBER() function.  e.g.

WITH A(col1) AS ( SELECT * FROM (VALUES (1),(1),(2),(3),(4)) v(n) ) , B(col1) AS ( SELECT * FROM (VALUES (1),(2),(2),(3),(4)) v(n) ) ,[A'] AS ( SELECT *, ROW_NUMBER() OVER(ORDER BY col1) AS rn FROM A ) ,[B'] AS ( SELECT *, ROW_NUMBER() OVER(ORDER BY col1) AS rn FROM B ) SELECT ( SELECT count(*) FROM [A'] a INNER JOIN [B'] b ON a.rn = b.rn AND a.col1 = b.col1 ) AS [|A join B|] , ( SELECT COUNT(*) FROM A ) AS [|A|] , ( SELECT COUNT(*) FROM B ) AS [|B|]

This produces the results:

|A join B| |A| |B| 4 5 5

We can see that, even though A and B have the same cardinality, the JOIN produces a different cardinality, showing us that at some point, the two sets diverge. Hence the two sets are not equal.

Saturday, June 20, 2015

Re-imagining Linked Servers

Introduction


Linked servers are an essential part of the SQL Server developer's tool kit.  They are often necessary for pulling/pushing data to/from remote servers whether those servers run SQL Server, Oracle, DB2, PostgreSQL or any other RDBMS. Linked servers are flexible and can adapt to many different types of "servers" including Excel workbooks and MS-Access databases among others

Creating a linked server is relatively easy.  There is both a GUI method in SSMS (under Server Objects/Linked Servers) and a T-SQL method, using the system stored procedures sp_addlinkedserver and sp_addlinkedsrvlogin

Problem

Securing access to linked servers falls outside the normal SQL Server permissions controls.  For example, to give some user -- say CORP\USER1 -- access to a linked server, you need to add that login id as a linked server login and map it to some login on the linked server.  That means that, for every user who needs to access the linked server, you potentially need to add a new linked server login.

There are options available to mitigate this somewhat.  If the remote user is in the same AD domain as the local user, you can specify that the connection should be made using their credentials.  This cannot work across domains unless those domains are in a trust relationship.  You can also specify a catch-all remote user which is used for connections where there is no explicit mapping specified. Also, you can simply deny access for any users trying to connect that are not in the explicit login list. (I'm using the terms "user" and "login" rather loosely here.  Strictly speaking, linked servers only know about logins since they are a server-level object.)

Solution

I find these options deficient and inflexible.  I do not want to edit the linked server object every time I need to give a new login access or remove an old login.  I'd rather use SQL Server's roles and credentials.  Here's how I envision this working:


  1. When you create a new login for a linked server, you should have additional options for the local login, specifically:
    1. An AD group that is in the Server Logins.  This would mean that any member of the group that can connect to the server (that is, not DENYd), can use the linked server.
    2. A SQL Server user-defined role.  This is the most flexible.  Any member of the role would have access to the linked server.  Note that you can add AD groups to roles if those groups are included in the server login list.
  2. In addition to specifying a specific remote login, you should have the ability to specify a Credential. 

These additional options, should they be added, would make the management of linked servers easier and more consistent with the SQL Server security model.  With these in place, adding logins to the list of those permitted to use the linked server would be a simple matter of adding them to the AD group or add the login directly to the server role.  Also, adding or changing the remote login would be managed as credentials,

Conclusion

I'm convinced that it is time to align linked server management with the SQL Server roles and credentials security model.  Let's see if we can raise this issue for SQL Server Next!

Friday, June 12, 2015

Query performance scuttled by ARITHABORT OFF

Part of my daily routine is to have a look at jobs that ran overnight to see which ones ran longer than seems reasonable.  I know, "seems reasonable" is a little vague!  What I mean is that how long did the job run relative to previous runs and relative to the amount of data processed.

The other day while going through the overnight jobs, I found one that had been running between 90 minutes and 3.5 hours, yet only processed about 25000 rows.  I suddenly got very interested in that job!  What was it doing that could possible take so long?

Digging into SSIS

The job in question runs an SSIS package that loads a table into our data warehouse that is needed by many daily, weekly and monthly reports -- some of which go to C-level executives and above.  One might say this is important!  The job doesn't do anything very special:  it gets data -- previously staged -- from the same server it loads it to.  Sounds simple, right?  Well, actually it is simple.  So, why the long run time?

Digging into the package I discovered that the data flow uses an OLE DB source transformation, specifying Table or View as data access mode and referencing a view.  "OK," I thought, "What's up with that view?"  I opened the view definition in SSMS to look at it.  It involved a join of six tables but nothing suspicious.  Some of the tables could use extra indexes but the amount of data (25,000 rows) was small enough that I figured those indexes wouldn't make much difference.  It turned out that I was right and that the problem was something else entirely.

It's the plan, dummy!

Before trying the run the view in SSMS, I took a look at the estimated execution plan.  Essentially, it was a series of hash joins, which are usually a good choice.  With a little trepidation, I ran the code in the view.  To my surprise, it returned results in under 10 seconds!  So, where did the 3.5 hours come from?  Good question!

Next, I decided to rerun the package to see if it finished in 10 seconds, as my standalone query had.  I waited and waited and waited some more.  Finally I killed it.  Something else was going on.  I recalled that I had installed Adam Machanic's excellent sp_whoisactive stored procedure some time ago.  You can find the current release version here: sp_whoisactive

This procedure has a number  of options to control its output.  I noticed a couple I thought I could use:

--If 1, gets the full stored procedure or running batch, when available --If 0, gets only the actual statement that is currently running in the batch or procedure @get_full_inner_text BIT = 0, --Get associated query plans for running tasks, if available --If @get_plans = 1, gets the plan based on the request's statement offset --If @get_plans = 2, gets the entire plan based on the request's plan_handle @get_plans TINYINT = 0,
So, I started up the package again, and while it was running, ran sp_whoisactive in a second session using those options.  The procedure returned a row for my running package with a link to the query plan.  Using that link, I opened the query plan to see what it looked like.  It didn't look at all like the plan from my first, interactive session.  Instead of hash joins the plan had nested loops!  Given the sizes of the tables involved, no wonder it ran for 3.5 hours!

So, why the nested loops?  Digging further, I discovered that the optimizer had timed out trying to find a good plan, leaving us with what it had at the moment.  Not only that, but the estimated row counts for two of the tables were 1 -- that's right, one!  I knew that those tables had thousands of rows, so nested loops would kill performance.  However, the optimizer had given up before finding a good plan. Why? For that matter, why did it not simply reuse the plan from my (fast) interactive query?

ARITHABORT OFF messes up the plan

The next thing to look at was the settings in use when both plans were generated.  They were the same except for one: ARITHABORT.   The plan that caused the optimizer to time out was generated with SET ARITHABORT OFF in effect. Googling this setting led me to ARITHABORT and performance where it is stated clearly:

Client applications setting ARITHABORT to OFF can receive different query plans making it difficult to troubleshoot poorly performing queries.
Sooooo, I knew the reason.  Now, what about the root cause?  It turns out that the default setting for ARITHABORT in the model database is --- you guessed it! -- OFF.  My database inherited that setting.  I could have changed the database setting (and I still might do that) but I did not want to break any queries that might depend on it, so I changed the package instead.

The fix is in!

I changed the Data Access mode to SQL Statement and wrote:

SET ARITHABORT OFF; SELECT * from dbo.myview;

I deployed the package to the server, deleted the old plans from the cache and reran the package.  It ran in under ten seconds!  Worrying that it might be a fluke, I monitored the package for the next several days and not once did it run longer than a minute.  Problem solved!

Summary

The moral of the story?  Don't always assume that the problem is the way the query is written or how the tables are indexed.  The problem can be subtler than than and you need to look at the query plan and the options used to generate it.