‘Ingeniously simple tools’, the box
promised. Yeah, right – I’d heard that one before. Half-heartedly, I installed
yet another tool: ANTS
Performance Profiler from Red Gate. As the
progress bar filled up, I wondered if I’d be able to find a solution to my
problem, before time – or my sanity – ran out, or my CEO's patience.
My story begins three weeks earlier, in
Cape Town. As the IT manager and senior developer for a financial institution,
the buck stops with me when there are code problems. We have a small
development team, and as is the nature of small teams, we find ourselves
wearing many hats.
Although the team is small, the company
relies heavily on our technology and especially on our ability to make rapid
changes. This gives us an edge over our competitors who normally have longer
turnaround times with their technology changes.
Within our environment, we have several SQL
Server databases; we have 3rd party software that manages our debtors’ book,
accounts, and collection strategy; on top of that, we have an in-house ASP.NET application
that our 80+ users work with on a daily basis. The application performs all
manner of tasks, from CRM, to account maintenance, journal management, and accounting.
Each month, peak operation is around 100,000 database transactions a day.
Our most mission-critical application is in
daily use, and if it ever goes wrong, heads will roll. It’s a console
application written about eight years ago in .NET Framework 2. Several people
have worked on the code, and over time, quick fixes have been made to address
changing business needs.
The application uploads a collection file to
a bureau that deducts instalment amounts from debtors’ banks accounts and pays
over to our bank account. If payment doesn’t get taken from an account, the
chance of getting that money from the debtor decreases by 25% for every day not
collected, resulting in cash flow problems as well as increased work and
pressure for the Collections team. In a nutshell: don't mess this application up.
The software wakes up every morning at 10am
and collects debtors’ information from various databases. It then compiles
several files, connects to an external bureau via web services, and uploads the
files. The app then polls for a return file and a whole bunch of activity takes
place. The final step is a human process that must authorise the collection
files. All of this must happen before the 10:30 cut off. Miss the cut off, and,
well… you know what hits the fan.
One morning, the application collapsed with
an obscure error indicating a database timeout. And then it happened again. And
again. For weeks, the app would periodically crash, and we were powerless to do
anything – as it talks to three databases, we weren't even sure where the
timeout was occurring. The crashes started taking their toll on my team: in
order to make the tight deadline, we usually had to upload the files by email,
and then manually enter the data into the database.
We had to start the process earlier and
earlier to ensure we could make the cut-off, which in turn put pressure on the Collections
team, as they only had a few hours to make any adjustments and load accounts
for the daily run. The collections manager had a daily blue-purple fit and
ordered his team to come in earlier to cope.
As an IT manager who’s been with the
company over 20 years, it is a terrible thing to sit in an emergency management
meeting, in front of my leaders and peers, and have no choice but to say those words
I detest most in the world: "I don’t know."
To top it all, the largest and most critical
month end run was fast approaching. This run affects people's lives, as our
collectors work on commission. To think of those people not getting paid properly,
because I did not know, weighed heavily on my shoulders. I needed to find an
answer.
Like demons, we attacked and analysed every
single stored procedure and SQL statement, and we found nothing. There
were no bottlenecks in the database: the database loop was working just fine.
But then where in the application was the
database timing out? We went through the code until our eyes watered, but we
could not identify where the bottleneck was. Eventually, after a Google search,
I downloaded some tools to help identify bottlenecks. Most were too complex to
set up, some needed code changes, and some should have been implemented eight
years ago. One seemed to offer a bit more hope – Red
Gate’s .NET profiler called ANTS Performance Profiler.
I installed it and quickly loaded the
project. So far, so good – easy to use and no hiccups. I ran the project and
reviewed the analysis. The interface was awesome, and had a whole bunch of
information that for once made sense. Nice plus: it gave me performance data
for both my .NET code and SQL Server queries, all in one single profiling
session. Neat!
What gave me a jolt of excitement was
seeing something I hadn’t seen anywhere else: a spike in the wall-clock CPU
time, one minute into the process.
Fig 1: ANTS Performance Profiler showing
a spike in the wall-clock CPU time
A little further investigation, and bam –
there it was. The high hit counts and time outs led me straight to the real
culprit: basTools.
Fig 2: The performance profiler pointing
to the basTools.vb module as the root cause of the problem.
Inside the basTools module were some standard
basic tools we use in every project. It hadn’t crossed any of our minds to
check if they might cause a timeout problem in SQL Server.
The problem took all of two seconds to fix,
and to this day I sometimes bang my head against the wall, re-living the
moment.
And what was the problem? When an account
has an alert on it, an automatic email gets sent to an account manager. We had
discontinued one of the product types, and recently re-instated it. This
triggered an alert that called an old "discontinued" mail routine.
Inside the mail routine was – wait for it – a hard coded IP address for a proxy
server. The IP address belonged to a server that no longer existed. The
application was trying to resolve the proxy inside a SQL loop, failing, and
eventually giving up, reporting a timeout.
We pointed the IP address to a new server
(this time, storing the setting in the config file) and ran the project
successfully. To date, touch wood, we haven’t had a single timeout.
In conclusion, ANTS
Performance Profiler saved the day as well as my hide, and my team came out
looking like heroes. The error message looked as if the timeout was
coming from the database. We never once suspected faulty code. The .NET
profiler pinpointed the culprit in just a few minutes (a piece of legacy code
with a relative distant relationship with the database) and we were able to
improve application performance almost immediately.
ANTS Performance Profiler is indeed an ingeniously
simple tool. A tool that does what it promises to do, and does it well.
What I don’t like about ANTS Performance
Profiler though, is that we didn't install the darn thing years ago.