Introduction
Earlier in the year the software
company I work for, released RightCalc, a solution designed to enable insurance companies to centrally manage their prices.
Prior to launch, we needed to be sure that the system could handle the anticipated
traffic whilst still responding to the majority of requests in an acceptable time.
Failure to do so could result in restless customers and lost sales for the insurance
companies. This requirement brought about the development of
ServiceMon.
Background
Before running any performance tests it's crucial to understand what is trying to
be achieved by building up a realistic picture of how the service will be used.
In the case of RightCalc, we were fortunate to get a consensus on performance targets
early on in the development of the system. These are the benchmarks we agreed:
1. Typical load testing
The system must handle a typical load of 3 requests per second
and continue to deliver respectable response times of 50ms or less,
for 99% or more requests. We anticipated this would be sufficient
for the first few months after the system was launched.
2. Soak Testing
Ensure the system is capable of running for a prolonged period of 1 year,
with a sustained load of 1 request/sec, without any manual
intervention. The system should maintain a steady response rate throughout the duration
of the test.
3. Spike testing
Ensure the system can handle brief, intense traffic spikes without causing a loss
of data or a dramatic performance hit. The system must handle occasional, intense
loads of 100 reqs/sec for 30 seconds whilst still
responding to 99% of requests within 100ms.
Introducing ServiceMon
We developed ServiceMon
specifically for the RightCalc project to help monitor critical web pages and web
services and to apply load and gather statistics that would prove the system was
capable of handling the three scenarios above.
When we were unable to find a suitable monitoring solution, it was decided to build
one that worked on Windows, was extensible and simple to configure. We needed an
easily configurable solution to allow us to develop ad hoc tests to investigate
new avenues of research as suggested by results. This while still having the ability
to manage a standard, continuously-running monitoring suite which could be version
controlled, in a similar manner to source code, using standard tools, such as Subversion.
Last week Kaleida made the decision to release
ServiceMon to the community as open-source, under the
GPL license, in the hope that other developers, QA and ops staff could
benefit from the tool.
ServiceMon works by executing
a very simple script containing operations like this:
http-get "http://www.google.com" must-contain "<title>Google</title>"
The script can contain any number of operations but they all follow the structure
of request then response-handler. The request part of
the operation, http-get
, is executed on the tick of a timer (which
occurs every second by default) that then waits for a response to be received. When
the response arrives it is given to the response-handler, in this case
must-contain,
which checks that the HTML contains the specified phrase.
Whilst an operation is waiting to receive a response, the script execution will
continue and subsequent operations will carry on being executed on every new timer
tick.
If the request or response-handler is not successful, a failure
is recorded and the screen changes to reflect this:
There are a number of built in request types which are useful for common testing
scenarios such as monitoring if a web page is available or pinging a server.
To produce performance statistics for a proprietary SOAP web service, we'll need
to write a custom request type. We'll discuss how to do this later (don't worry,
it's very straightforward!) but first we'll take a closer look at the types of statistics
ServiceMon produces.
Response Time Statistics
Producing a response time distribution graph is an excellent method of building up a picture
of your web service's performance characteristics.
First, we make a large number of requests over a prolonged period while the server
is under a stable load. Then, having recorded the time each request took, we find
the appropriate "bucket" and add one to its item count.
Imagine we've made 100 requests and recorded their time taken in "buckets"
10ms wide. We may see results like this:
Response time (ms)
| Count
|
0-10
| 0
|
11-20
| 2
|
21-30
| 49
|
31-40
| 22
|
41-50
| 12
|
51-60
| 8
|
61-70
| 4
|
71-80
| 2
|
81-90
| 1
|
91-100
| 0
|
When these values are plotted, we'll often see this
characteristic curve:
The position of the peak shows the response time most often experienced; the further
to the left, the better.
The shape and scale is also important. A thin peak is ideal, and shows that the
web service delivers consistent response times - most visitors will have a similar
wait. A large spread and extended right-hand tail is sub-optimal and further investigation
is needed to determine why the response time is so variable.
The graph should ideally contain a single peak. The presence of a
secondary peak should be investigated as this may indicate a frequently
occurring background task affecting the processing of a significant proportion of
requests, such as a backup job or re-indexing task.
For some purposes, such as for a
SLA, it is useful to describe the curve quantitatively by
specifying a number of "nines":
- 90% of requests take less than 60ms
- 99% of requests take less than 80ms
- 99.9% of requests take less than 90ms
- 99.99% of requests take less than 90ms
For websites, and web services indirectly consumed by humans, your target response
times should consider how humans
perceive time. Research shows that a response received within 100ms
feels instant, within 1000ms (1 second) is tolerable
but anything longer than 10,000ms (10 seconds) is enough
for the user to lose interest and do something else.
Viewing Statistics with ServiceMon
Now that we've looked at performance measures of a web site or web service, we'll
use ServiceMon to produce
these statistics.
For this example, we'll create a script with one operation - a simple HTTP GET request:
One word of warning: Don't run aggressive performance tests against
web sites or web services that you don't own. It is impolite to bombard
someone else's web server with thousands of requests - in many countries it is illegal
- and can be considered a
denial-of-service attack. At the very least, your IP address will probably
be throttled or blocked.
To begin monitoring, press the "Start" button on ServiceMon. The screen
will automatically change to the "10 foot" status display which is designed
to be viewed at a distance:
The green background and smiley face show that monitoring is active and no errors
have been detected. We can see more detail by viewing the "Responses"
tab which shows each individual response received. However, what is of real interest
is the response time distribution graph viewable through the "Statistics"
tab. This graph is updated every second and, after running for a while, will look
something like this:
This graph, which is now plotted on a logarithmic x-axis to reveal
the detail of the hump, conforms to the expected shape. It shows us at a glance
that the majority of requests were sent, processed and received, within approximately
10ms. This can be confirmed by looking at the "nines" values in the boxes
to the right: 99% of requests were processed in 15ms;
99.9% in 25ms. Notice how the figure for 99.99%
is significantly higher than the others at 191ms? This demonstrates
the importance of obtaining a large enough sample size. This test run has only made
around 2,800 requests and, due to this low number, it only took one slow response
to push the 99.99% figure skywards. We really need many 10s of thousands of requests
before we can obtain trustworthy figures.
This test was performed with no background load on the server and a test load of
2 requests a second (1 per 500ms). To build up a more comprehensive picture of the
performance profile we'll need to repeat the test with different background loads
and different test loads. It is possible to use
ServiceMon for both jobs by creating two scripts which are run in different
instances of the utility. One instance will be used to apply a background load and
then one or more extra instances of ServiceMon will apply the test load and produce
the actual statistics. We used this approach to good effect when testing RightCalc
prior to launch and continue to use it to assess the live service for performance
problems.
Creating a Custom ServiceMon Request Type
So far we've looked at how to build up a performance profile for a website, using
the built-in http-get
request type. But how do we test a SOAP web service
that requires more than just a URL to invoke?
A SOAP message will have a header, which may contain authentication tokens and routing
information, and a body which contains the request data. Every SOAP service
and web method will have its own specific requirements which make it difficult for
a tool like ServiceMon to support in a generic fashion.
Fortunately, ServiceMon is built in an extensible way which allows new request types,
and response-handlers to be easily plugged in to its framework, and then used in
the same manner as the built-in operations.
We'll start by creating a new ServiceMon request which calls an
example web service provided by W3Schools
that simply converts temperatures in Celsius to Fahrenheit.
First we need to create a new Visual Studio Class Library solution:
Then, add a reference to the ServiceMon framework (this will be in the same location
as ServiceMon.exe
, usually at C:\Program Files (x86)\Kaleida\ServiceMon
):
Kaleida.ServiceMonitor.Framework.dll
Finally, create a new public class called GetCelsius
which derives
from PreparedRequest
:
That's all we have to do to create a new request type. Obviously, it doesn't call
the web service yet, but there's enough here for us to test it in ServiceMon.
After building the DLL, copy it to the Operations
folder that is located
in the same place as your ServiceMon.exe
(usually
C:\Program
Files (x86)\Kaleida\ServiceMon\Operations
)
Restart ServiceMon and, to verify that the DLL was successfully discovered, view
the Help tab in ServiceMon:
The new request type will be listed with the name get-celcius
, which
is derived from the name of the class. Notice how it doesn't take any parameters
yet or have a description. We'll get to that later.
You can now use this request as you would any of the built-in requests. Create a
new script and add this line:
get-celsius log-response
When you click the Start button and view the Responses tab you'll see something
this:
03-Oct-2012 13:05:24.329 0ms [no description present] and log response my response
Next we'll go back to our Visual Studio solution to finish things off.
A get-celsius
request isn't particularly useful unless a Fahrenheit
value can be specified. The way we do this is by adding a public constructor to
our class with a string parameter:
public class GetCelsius : PreparedRequest
{
private string fahrenheit;
public GetCelsius(string fahrenheit)
{
this.fahrenheit = fahrenheit;
}
Now we need to add a reference to the web service and complete the implementation
of GetResponse
.
The web service proxy is built using the "Add Service Reference..." option
on the Project menu:
This is the code to add to GetResponse
to actually call the web service
with our Fahrenheit value:
var binding = new BasicHttpBinding(BasicHttpSecurityMode.None);
var endpoint = new EndpointAddress("http://www.w3schools.com/webservices/tempconvert.asmx");
var soapClient = new TempConvertSoapClient(binding, endpoint);
return soapClient.FahrenheitToCelsius(fahrenheit);
Finally, we'll override the Description
property. Here's the finished
class:
And that's it! Copy the new DLL into the Operations folder and change the script
so that the Fahrenheit value is specified:
get-celsius "98.6" must-equal "37"
Click Start and ServiceMon will begin monitoring the tempconvert
web
service, alerting you the moment it fails to return the expected result.
To monitor RightCalc we have a number of
custom request types like this, one for each web service. Our scripts constantly
monitor the web pages and web services on the live system and the different staging
platforms to give us instant notification of any live problems, or potential regressions
in the development pipeline. Our system has been running for over six months and
ServiceMon has proved to be an invaluable tool.
We hope you find ServiceMon useful and please get in touch if you have any questions or would like to help
build new extensions.
The project's home page, latest downloads, and all documentation can be found here:
ServiceMon Home Page
History
First Draft: 6th September 2012
Updated to reflect changes in newer versions of ServiceMon: 17th September 2012
Changed title and updated screenshots taken from v1.0 : 3rd October 2012
Changed license to GPLv3 and included ServiceMon v1.0 source code with article. Removed links to company: 8th October