Friday, June 12, 2015

Query performance scuttled by ARITHABORT OFF

Part of my daily routine is to have a look at jobs that ran overnight to see which ones ran longer than seems reasonable.  I know, "seems reasonable" is a little vague!  What I mean is that how long did the job run relative to previous runs and relative to the amount of data processed.

The other day while going through the overnight jobs, I found one that had been running between 90 minutes and 3.5 hours, yet only processed about 25000 rows.  I suddenly got very interested in that job!  What was it doing that could possible take so long?

Digging into SSIS

The job in question runs an SSIS package that loads a table into our data warehouse that is needed by many daily, weekly and monthly reports -- some of which go to C-level executives and above.  One might say this is important!  The job doesn't do anything very special:  it gets data -- previously staged -- from the same server it loads it to.  Sounds simple, right?  Well, actually it is simple.  So, why the long run time?

Digging into the package I discovered that the data flow uses an OLE DB source transformation, specifying Table or View as data access mode and referencing a view.  "OK," I thought, "What's up with that view?"  I opened the view definition in SSMS to look at it.  It involved a join of six tables but nothing suspicious.  Some of the tables could use extra indexes but the amount of data (25,000 rows) was small enough that I figured those indexes wouldn't make much difference.  It turned out that I was right and that the problem was something else entirely.

It's the plan, dummy!

Before trying the run the view in SSMS, I took a look at the estimated execution plan.  Essentially, it was a series of hash joins, which are usually a good choice.  With a little trepidation, I ran the code in the view.  To my surprise, it returned results in under 10 seconds!  So, where did the 3.5 hours come from?  Good question!

Next, I decided to rerun the package to see if it finished in 10 seconds, as my standalone query had.  I waited and waited and waited some more.  Finally I killed it.  Something else was going on.  I recalled that I had installed Adam Machanic's excellent sp_whoisactive stored procedure some time ago.  You can find the current release version here: sp_whoisactive

This procedure has a number  of options to control its output.  I noticed a couple I thought I could use:

--If 1, gets the full stored procedure or running batch, when available --If 0, gets only the actual statement that is currently running in the batch or procedure @get_full_inner_text BIT = 0, --Get associated query plans for running tasks, if available --If @get_plans = 1, gets the plan based on the request's statement offset --If @get_plans = 2, gets the entire plan based on the request's plan_handle @get_plans TINYINT = 0,
So, I started up the package again, and while it was running, ran sp_whoisactive in a second session using those options.  The procedure returned a row for my running package with a link to the query plan.  Using that link, I opened the query plan to see what it looked like.  It didn't look at all like the plan from my first, interactive session.  Instead of hash joins the plan had nested loops!  Given the sizes of the tables involved, no wonder it ran for 3.5 hours!

So, why the nested loops?  Digging further, I discovered that the optimizer had timed out trying to find a good plan, leaving us with what it had at the moment.  Not only that, but the estimated row counts for two of the tables were 1 -- that's right, one!  I knew that those tables had thousands of rows, so nested loops would kill performance.  However, the optimizer had given up before finding a good plan. Why? For that matter, why did it not simply reuse the plan from my (fast) interactive query?

ARITHABORT OFF messes up the plan

The next thing to look at was the settings in use when both plans were generated.  They were the same except for one: ARITHABORT.   The plan that caused the optimizer to time out was generated with SET ARITHABORT OFF in effect. Googling this setting led me to ARITHABORT and performance where it is stated clearly:

Client applications setting ARITHABORT to OFF can receive different query plans making it difficult to troubleshoot poorly performing queries.
Sooooo, I knew the reason.  Now, what about the root cause?  It turns out that the default setting for ARITHABORT in the model database is --- you guessed it! -- OFF.  My database inherited that setting.  I could have changed the database setting (and I still might do that) but I did not want to break any queries that might depend on it, so I changed the package instead.

The fix is in!

I changed the Data Access mode to SQL Statement and wrote:

SET ARITHABORT OFF; SELECT * from dbo.myview;

I deployed the package to the server, deleted the old plans from the cache and reran the package.  It ran in under ten seconds!  Worrying that it might be a fluke, I monitored the package for the next several days and not once did it run longer than a minute.  Problem solved!

Summary

The moral of the story?  Don't always assume that the problem is the way the query is written or how the tables are indexed.  The problem can be subtler than than and you need to look at the query plan and the options used to generate it.

No comments:

Post a Comment