Background
Visually similar is an optional service in Bing.com shopping site which starts from a clothing or shoes item and finds other items that look similar. If our response time is beyond 750 milliseconds, the requests will timeout on client side. The shopping team is investigating a problem that all requests fail against our service in one center. We suspect that occasional timeouts in our service is causing their server to go haywire. They wrote a test program which repeatedly queries our service, and found 13% of request timeout under normal load of 7 QPS (query per second). I don't want to tune our search algorithm because it was developed by a teammate who left the team. So, I decided to enforce a timeout on the server side.
In the following graph: The response time occasionally was very high.
Analysis
Architecture: Our service has the following architecture: WCF Client at Shopping server -> WCF service/.NET remoting client -> Fan out connection with 3 .NET remoting server, which perform actual search.
WCF client is beyond our control, so I can insert the timeout at the following places, in order of increasing complexity:
- Add timeout configuration in WCF service definition
- Add timeout configuration in .NET remoting client definition
- Add custom timeout code in WCF service
- Add custom timeout code in .NET remoting server
Attempt 1
I first tried #1 since it is easy and most up-stream to the call chain. All I had to do is to add the following to the web.config file:
<bindings>
<basicHttpBinding>
<binding name="Binding1" ...
receiveTimeout="00:00:01"
sendTimeout="00:00:01"
Unfortunately, this did not work. I found the following web forum discussions saying the WCF service timeout will be 22 seconds even if the value is configured to be smaller.
Attempt 2
I then tried #2, because it is next easiest. I just need to add a global configuration to the .NET remoting client. When timeout expires, the .NET remoting call will throw an exception. The following is the code:
IDictionaryt = new Hashtable();
SoapClientFormatterSinkProvider sc = new SoapClientFormatterSinkProvider();
t.Add("timeout", (uint)600);
TcpClientChannel clientChannel = new TcpClientChannel(t, sc);
ChannelServices.RegisterChannel(clientChannel, false);
This is a very elegant programing model, and simple to implement, However, the actual timeout was 485 milliseconds longer than given in the settings. (I tested timeout setting of 200 and 600 milliseconds.) This is the response from the "remoting support" team: "I don’t think it guarantees the exception will be raised the exact time specified in the timeout value only that it will be hit if the timeout value specified is exceeded."
We need accurate timeout values. So, I cannot use this approach.
Attempt 3
I then tried #3 and #4 with the following sample timeout code:
delegate QueryResultsMethodDelegate2(DecimalprodID, ...);
public QueryResultsSearchVisualSimilarRangeWithTimeout(DecimalprodID, ...)
{
MethodDelegate2 dlgt = newMethodDelegate2(client.SearchVisualSimilarRange);
IAsyncResult ar = dlgt.BeginInvoke(prodID, ..., null, null);
if (!ar.AsyncWaitHandle.WaitOne(timeMsec))
{
throw new TimeoutException();
}
return dlgt.EndInvoke(ar);
}
I made the call to be asynchronous. When a call timeout
, I will allow it to run to completion asynchronously, but the main call thread will return immediately after the timeout
.
#3 did not work, and #4 partially worked in that timeout
decreased from 13% to 3%. Also, the problem is worse when there are more requests per second. The following is the response time reported by the test program:
Attempt 4
With #4, it is possible that mess-up in the WCF service code could cause timeout. So, I have to make #3 work. To troubleshoot why #3 did not work, I added tracing with time stamp to the WCF service, and discovered the dynamics of the timeout scenario:
- Large part of perceived call time on the client includes the time when the request is in the WCF request queue. My timeout code doesn’t help in this situation.
- When WCF service has several threads running, it may decide to hold more quests in the queue. This queue behavior is not consistent, and WCF may decide to send a large number of requests to the service code at once. This creates a wave of congestion in the .NET remoting server and slow response time.
- The above 2 factors add together to become a more severe congestion problem.
To solve this problem, I added code in WCF service to drop new requests if the number of concurrent requests is more than a threshold. Our search algorithm is CPU intensive and use multiple threads, so if I allow more requests through while our system is already overloaded, it will cause much larger number of requests to timeout. Furthermore, the ripple effect of the queue discussed above can congest our entire system for a long period of time. So, this should actually increase the overall success rate of our request.
volatile static int concurr = 0;
try
{
Interlocked.Increment(ref concurr);
if (concurr > InitState.Instance.maxConcurr)
{
return newKEXResponse();
}
...
}
finally
{
Interlocked.Decrement(ref concurr);
}
Alternatively, I could investigate ways to solve the problem with different WCF queuing configurations, but the above simple fix did the trick, so there is no need to dig more into WCF for a solution.
Result
The result: after 2 days of work, the combined dropped request & timeout rate observed by client dropped from 13% to below 1%, with timeout rate dropping to 1 in 2000. Also, the system is much more stable at large request per second load. The feature is back online after the fix is deployed.
Lessons Learnt
- Never take documented behaviors or WCF and .NET remoting for granted. Test it with a small program before applying it to product code.
- Ripple effect between WCF queuing and backend service may increase congestion, and a good test program is needed to simulate real traffic and uncover the problem.
- Combination of good testing program, tracing, and problem analysis can sometimes solve problems faster than profiling tools and debugging.