Avoiding the pitfalls of 3rd party components in a large
data processing application
As a software engineer at an Australian defense
contractor, I work on a wide variety of projects on a number of platforms – from
C++/ASM signal processing deployed on underwater processing nodes to C#/WPF
high-speed data visualisation for real-time shipboard systems.
The current project my team and I are
working on involves updating our core acoustic analysis tool written in Borland
C++ in 2008 to utilise C#/WPF. The application reads multi-channel acoustic
data recorded underwater, applies frequency signature correction curves and
visualises it in a LOFAR.
The application works in near-real time,
allowing the operator to analyse a channel of acoustic data immediately after a
recording segment is completed. The faster the analysis process can be
completed, the faster the processed results can be given to the customer.
Each channel of acoustic data can be up to
1GB in size. The transition to newer development architecture has allowed us to
move to 64-bit, and cache large amounts of data in memory at various processing
stages. This vastly improves the responsiveness of the application, and allows
for processing workflows not previously available.
The initial development stage focused on
the core processing window (pictured above), which is also the part of the
application that performs all of the heavy computational work.
During testing on some real world data sets,
I noticed that, when zooming in on the LOFAR or changing any processing
parameters requiring a redraw of the LOFAR, the memory footprint of the
application would increase. After many zoom/processing changes, the application
would also begin to slow considerably.
The generation of the LOFAR is performance
intensive, and involves FFT processing the acoustic data, applying calibration
transfer functions and then decimating for display in the UI bitmap. The
decimation step is key to ensuring that no visual information is lost in the
translation from the raw FFT resolution data (up to 24000x48000 pixels) to the
displayed WriteableBitmap resolution (typically two thirds of total desktop
resolution).
The FFT processing and decimation routines
perform signal processing on large arrays (both managed and unmanaged) using a third-party
C/assembler signal processing library. In addition to this, several moderate-sized
unmanaged arrays are created and destroyed along the way.
I assumed that this was where the problem
was occurring – that we were not freeing our unmanaged memory each time.
However, some simple debugging showed this was not the case. The good thing
about operating with unmanaged memory in C# is that you control and know when
and where it is being allocated/freed!
As I’m relatively new to .NET and managed
memory, I then assumed that I was dealing with a garbage collection issue: all
of the unmanaged memory was being cleaned up, I could see that all our large
cached arrays were being appropriately reset, and references to them were not
stored anywhere. However, my lovely GC.Collect()
hack had minimal effect on the
problem.
The only other potential issue I could
think of was inadvertent lingering references, such as subscribing to Events
and not unsubscribing, or storing references to objects in lists and not
removing them when no longer required. Thus began a manual search through the
codebase…
Cue lots of ‘Find All’ within Visual Studio,
looking for "+=", "Add", "AddRange", etc. The manual search was exhaustive,
time-consuming, and didn’t turn up any obvious problems. I was frustrated and
annoyed that I’d spent so much time on this without being any closer to finding
a solution, at which point I thought, "Bugger it – maybe I’ll try a memory
profiling tool and see what it comes up with."
A Google search gave me several options for
memory profiling tools. As I’d previously used Red Gate’s .NET Reflector, and
found that it was a great tool, I thought I’d give their memory profiler a go.
The ANTS
Memory Profiler installed easily. I’d not used a
memory profiler before, but I found it to be very intuitive. I simply selected
‘Profile Memory’ from the ANTS menu within Visual Studio.
With the application running, I loaded a
section of acoustic data. I took a baseline memory snapshot, then proceeded to
zoom in and out of the LOFAR display numerous times, taking other memory
snapshots along the way. Whilst the snapshots took a reasonable amount of time
to record, I didn’t notice any difference to the application performance whilst
profiling. After enough zoom in/out iterations to cause a significant (500MB+)
amount of memory leaks, I stopped profiling and loaded the report.
Along the top of the report was a graph
detailing the application’s memory usage over time, along with the times at
which my memory snapshots were taken.
I selected snapshot 2, which was taken
after the initial load of the acoustic data and population of the LOFAR. The
‘Largest classes’ pie chart on the lower left showed that the majority of data
was of type Int16[] (acoustic data) and Single[] (FFT data), which was as
expected.
I then selected snapshot 6. It showed that
the size of the acoustic and FFT data remained the same, which was reassuring.
But what was using the additional 400MB of memory between the two snapshots?
The ‘Largest classes’ pie chart gave some indication, with classes of type
EffectiveValueEntry[]
, Gripper, Thumb and Canvas alone taking up 200MB of
additional memory!
To investigate this further, I set snapshot
5 as my baseline and snapshot 6 as my current. This section encompassed a
number of repeated zoom operations, and selecting the ‘Class List’ button
showed up a wealth of interesting information.
As I was concerned about the volume of
memory used, I sorted by the ‘Size Diff’ column, which showed an unfamiliar
object EffectiveValueEntry
.
I used a comparison of two checkpoints
before and after a series of zooms. To my surprise, this showed that in a short
time period, some 3.5 million objects were constructed! Looking closely at the
number of individual classes being created, I saw that there was a common
denominator of approximately 108,160, with the other objects being a multiple
of this number. This indicated that they were all related.
And the class name that jumped out?
Visiblox.Charts.Gripper
. Visiblox is the third-party charting toolkit that we
use for visualising a number of our data sets. So why was the use of this class
leaking so much memory, and why were so many instances created in the first
place?
I selected the Visiblox.Charts.Gripper
entry and used the Instance Categorizer tool. Once loaded, it showed the
reference paths between the instances of the Gripper class and all other
classes, and it immediately showed the link between the LofarViewModel
and the
Visiblox.Charts.RectangleAnnotation
classes. At this point, I had a fair idea
of what could be causing the problem – something I had glanced across earlier
during my lengthy (several hours) manual search.
I then selected the Instance Retention
Graph, which traced the reference paths visually across method calls for a
selected object (one I picked at random) and confirmed my suspicions.
When the SelectionDurationChanged
event is
fired from the SelectionModel, the topSelectionRectangle
(among others)
representing the top of the selected region has a new point assigned to it,
reflecting the changed selection.
topSelectionRectangle.Points[0] = new DataPoint<double, double>(XRange.Minimum, YRange.Minimum);
topSelectionRectangle.Points[1] = new DataPoint<double, double>(XRange.Maximum, GetTopSelectedYValue());
bottomSelectionRectangle.Points[0] = new DataPoint<double, double>(XRange.Minimum, GetBottomSelectedYValue());
bottomSelectionRectangle.Points[1] = new DataPoint<double, double>(XRange.Maximum, YRange.Maximum);
The data points assigned to the
selectionRectangles
, which are ultimately bound from the Visiblox chart, are
newed. It is probable that the chart registers to a PropertyChanged
event or
stores a reference to the DataPoint.
To test my theory I replace the above code
segment with:
topSelectionRectangle.Points[0].X = XRange.Minimum;
topSelectionRectangle.Points[0].Y = YRange.Minimum;
topSelectionRectangle.Points[1].X = XRange.Maximum;
topSelectionRectangle.Points[1].Y = GetTopSelectedYValue();
bottomSelectionRectangle.Points[0].X = XRange.Minimum;
bottomSelectionRectangle.Points[0].Y = GetBottomSelectedYValue();
bottomSelectionRectangle.Points[1].X = XRange.Maximum;
bottomSelectionRectangle.Points[1].Y = YRange.Maximum;
And, amazingly, the memory leak issue was
resolved! However, I was still wondering how 108,160 of these objects could
have been created. It was as though the SelectionDurationChanged
event was
being fired many, many times.
Which, it turns out, it was. And ANTS
Memory Profiler had even shown me where without me
realising it. In the LofarViewModel, the SelectionDurationChanged
event was
called every time that a row in the LOFAR was generated. Which, given
the number of zoom in/out operations, means it was called approximately 5000
times!
The firing of the event from this location
was a bug, inserted early in development as a ‘belts and braces’ approach to
ensure the selection area was valid upon redraw. However, the redraw case was
handled properly at a later stage, but the offending line was not removed!
At this stage, two obtuse bugs had been
found and fixed in less than an hour. And to top it off, the application was
now redrawing the LOFAR ten times faster than previously.
So what has this experience taught me?
-
Be careful with third-party libraries. They may
have implied expected behaviour that you may not be aware of. In our case, we
were lucky to have a bug that exposed this problem at a much faster rate than
normal
-
In a C#/WPF world where references are passed
around willy-nilly, you have to ensure that they are not inadvertently stored
anywhere that may hang onto them and prevent objects from being cleaned up in
Garbage Collection. This includes WPF bindings, unsubscribing from Event
handlers when no longer required and removing items from collections when no
longer required.
-
Monitor your application’s vital statistics and
profile your code regularly. I was surprised that no one else noticed or
questioned why the application memory use was climbing. Perhaps my particular
test scenario was the only one to stress this particular fail case. Either way,
keeping an eye on resource utilisation and regular profiling, perhaps before
milestone releases, to ensure the application is running lean and not leaking
memory is advisable.
-
A dedicated profiling tool saved time. Despite
having a fair idea of the type of issue I was looking for, the fact that it was
an interaction with a third-party library meant that I could not manually dig
through their code and locate the fault.
I spent four hours manually verifying that every
Collection we had was not holding onto references, and that every Event/Message
subscribed to was properly unsubscribed. Tools such as Process Explorer were
great for saying "Yup, you’ve got a problem", but were unable to point the
finger at any particular area. Considering the average hourly cost of
development time, you only need ANTS
Memory Profiler to save you a few hours in total to
get a serious return on investment.
And I’m sure this is not going to be an
isolated case. We’re software engineers: writing bugs is part of what we do. We
just need to arm ourselves with the right tools to ensure that they are found
and fixed in an expedient manner.