So we had an interesting issue this week which was bugging the hell out of myself and our technical services team that deploy our software (developed by a third party).
The issue was a severe performance degradation on a certain search field in the application.
First off, the code for this site is around 5/6 years old, and was developed by "wet behind the ears", juniors, just out of university, developers with no senior supervision. This was before my time with the company, but the code got the job done, but by no means in an elegant manner.
So, the issue as reported was "When we search on this field, it we get an error page after a while, and it used to be really fast" (or words to that affect).
My first thought was that there was an issue with the indexes after a rebuild.... nope, rebuilt all indexes, and it was still causing issues.
My next thought was that there was an issue in the WebService
layer that the site uses to communicate with the data layer... nope, worked fine on test.
Borked Upgrade? Nope, redid the upgrade (and it worked fine on the test system).
Additional Settings in the web.config causing timeouts? Nope, compared the configs.
DB Connections not being disposed of correctly? Nope, took it into single user mode and back again.
Data issue with the amount of data (mismatched with a smaller dataset in test)... nope, restored the database over the test database...
So in hopes of replicating the issue in a place where I could step through the code, the live database was restored into an isolated place, and the code base copied... nope, search is now working fine on this installation.
An issue with the database engine itself? Nope, restored the database onto the same DB server and App server, search works fine on this installation.
I'd like to mention that by this point, we're on the second day, and I'd dropped everything as I couldn't think about anything else... it was like a drug, and I was completely addicted...
I started to think of completely off the wall ideas like dropping all the data and re-inserting, or restoring data from before the upgrade, I was getting desperate... So we decided that if we didn't find a fix by 4pm (currently around 12pm), we would stop the site, backup the database, restore it to another location, and repoint the site to it. This would solve the problem (as it's the same as we've done to test it), but was by no means a solution. This gave me a new burst of enthusiasm, and I always feel that if you have a fallback plan, and a deadline, the solution presents itself faster.
We started to think about the problem like Dr. Gregory House. I got a couple of people in a room, and we started to write down the symptoms, and what we tried...
- Ruled out the data itself being the issue
- Ruled out the indexes (full text included)
- Ruled out websites
- Ruled out the code
The only thing left is the original database in its current place.
This led to the revelation that the databases on the database server hadn't been offline EVER (well since the new DB server was installed 2 years ago). Therefore everything that was in memory and was not forcibly removed by SQL server, was still in memory. So we had the question, "What is kept in memory, and not recreated exactly when you restored the database?"... the answer is a lot of things, but one of those is the Cached Query Plan.
I found out that one of the things that was part of this update was an addition of over 100,000 postcodes. Whereas previously, the table held less than 50.
So, we had a look at the query cache, and found the use count on the queries in question was massive (in comparison, the site usage on a monthly basis). This gave me the ammunition to try something experimental on a production machine.
The plan was to take the database into single user mode, then take offline for a minute, bring it back online and into multi-user mode again. Thus giving a small amount of downtime, and we could chalk that upto connectivity issues if anyone notices it. We ran a test on another database, and we worked out that it should be 2 minutes maximum.
For me, this didn't work... It felt like a workaround, and surely people have had to do this in the past... surely... So I set about speed reading some articles on Cached Query Plans and how they worked. I came across an article with a DBCC command you can use to clear the plan cache on a per database level (for 2005, apparently for 2008, you can do this on a per-query basis).
The command is:
DBCC FLUSHPROCINDB (<dbid>)
And you can get the dbid from:
SELECT * FROM sys.databases
As soon as the command was run, the search in the site returned, albeit quite slowly (~30 seconds), however the second time, it was less than a second.
So, if you have an issue with a slow part of a site after a large dataload, that you can't replicate after a DB restore, it's likely you could be dealing with a similar issue.
Hope this helps someone...
Note: I do have an idea that updating the stats may have helped, however, I'm not 100% sure. From what I've been reading, it seems that once a query is optimized so far, it stops optimizing and just runs, so I'm not sure if this would cause the cached plans to update themselves...