The Peeve
I like to download pictures from the web to collections I have in different folders. Places like http://Images.Google.com/
and http://commons.wikimedia.org/wiki/Main_Page
are great to find large beautiful images. It has long been relatively easy to write and run scripts or little apps after the fact to clean up duplicates by finding the MD5 digest of each file.
My peeve is this: Rarely, but just often enough to be peeved, I still have duplicates. A .JPG file can be
different from another by EXIF tags (Reference [01]). The images themselves though can be duplicates. A .PNG file can
be a duplicate of a .GIF file. The encodings are different, but the pixels you see can be the same.
DUIapp solves these two problems. First, duplicates are detected when downloading is attempted. Duplicates never
land on disk. Second, DUIapp almost certainly guarantees the pixel data in a downloaded file is unique regardless
of tags or image type. I say "almost certainly" because an MD5 digest is theoretically not unique. But in all
practical applications, they are indeed unique.
DUIapp creates and maintains an index on disk in each chosen download folder (file named zjqxkImgIdx.bin) and in memory.
This index is used to quickly check if a web image is unique for the given download folder.
But wait, I wanted more
When I started, I added three objectives beyond functionality.
I wanted a responsive UI that shows concurrent use of the image
index even before the index is complete. The lower portion
of the form shows this index is always available. If
you Choose Download Folder and select one on your PC with
a large number of image files (my test folder has 6671 images
taking 1.7GB) you will see an increasing number of "Known" or
"Indexed" files noted towards the bottom center of the form (See
'2222 Known Files' in the UI screenshot).
If you move the TrackBar
slider while this number is increasing,
you will see different selected image file names in the bottom
TextBox
. Clicking the First [ |<- ], Previous [
<- ], Next [ -> ], or Last [ ->| ]
navigation buttons will select different file names in the index.
When you click Last, the TrackBar
slider will drift towards the
left as new files are added to the index.
I wanted the design to include usability features. I came
up with several beside the 'responsive UI' objective. One
feature was to use the application title bar as a status
line. I imagined using this app so that its title bar just
peeks above an overlying Internet Explorer window. When a
new unique web image is downloaded to the chosen folder, its name
from the web is shown in the title bar giving feedback to the user
while browsing. Images that are unique but 'Too small' to
meet minimum image width or height requirements are also noted in
the title bar. This is a usability feature to prevent me
from downloading itty-bitty images that don't meet my large
beautiful images criteria. The minimum width and height are
specified in NumericTextBox
controls - a great usability control
submitted previously to CodeProject (see Reference [02]).
Finally I wanted a way to remove application changes to the
Registry and a supporting disk file. Clicking Disable
Extension does this. Enable Extension puts the
changes back. Implementing event handlers for these buttons
turned up threading and shutdown issues that were valuable to
solve for application reliability. In fact, completing
development of this application led to a novel means of
controlling one thread use pattern (well, novel to me), an event
handler pattern that was also new for me, and a non-trivial
mechanism to reliably complete application shutdown.
I would like to explain these further in the following sections.
Parts and Pieces
DUIapp uses an Internet Explorer context menu extension that is
well-covered in many places on the net. This link
is a good starting point. When DUIapp is running and a new
IE browser window or tab is started, this extension is used
in the following way:
Right click over an image being viewed in Internet
Explorer. See the "Save Picture As..." menu item.
We're sort of like that. Look further down and click the
"Download Unique Image" item:
Clicking "Download Unique Image" causes IE to execute a block of
JavaScript we have specified in our menu extension subkey.
This JavaScript file (TwineBakery.html) is first written to the
special folder "MyDocuments" and then specified in the
subkey. The JavaScript generates an easily-detected cookie
containing the download image URI which is used later by our app
to download the image file.
To see the implementation of this extension, look in the
Constants region of Form1.cs in the DUIapp project and at the
event handlers buttonEnable_Click
and buttonDisable_Click
.
This extension isn't common because of these characteristics:
- An extension for Internet Explorer
- Part of a CodeProject article
- Uses only C# and JavaScript
- Assumes the application user has sufficient privilege to edit
the Registry
Now here is a completely non-standard diagram of DUIapp:
On startup, DUIapp's Form1
creates the extension subkey and
additional registry values for the Download folder and minimum width
and height when the extension subkey is absent in the
registry. Otherwise the existing registry values are used to
populate fields in the UI. Form1
's constructor ends by
starting an Index Initialize task, starting the thread member
within imgAdder
(an instance of ImageQConsumer
), and starting the
thread in watcher (a CookieWatcher
):
Task.Factory.StartNew(() => IndexInitializer(GetDLFolder()));
imgAdder.Start(this); watcher.Start();
}
The ImageQConsumer
and CookieWatcher
elements began as a pattern
I've used before where a class has its own thread member to help
isolate thread code. This pattern is sometimes less
useful. CookieWatcher
is hardly coupled to the UI so the
pattern works well. There are only two references to
Form1.infrequentCookieName
that keep this class from being reused
without change. The linkage between ImageQConsumer
and
Form1
tells me in hindsight a better solution that still doesn't
inflate lines of code in Form1.cs is worth further study.
The CookieWatcher
is essentially a FileSystemWatcher
that looks
for new cookies created by our JavaScript when our menu extension
is selected. The CookieWatcher
contains a ConcurrentQueue
that is stuffed with image URIs gathered from new cookies and an
EventWaitHandle
that is used to signal whatever consumer dequeues
the URIs (our ImageQConsumer
). The blue arrow passing
through Form1
is meant to show the consumer's access to the
CookieWatcher
queue via Form1
. In Form1.cs, we have:
internal static CookieWatcher watcher = new CookieWatcher();
In CookieWatcher.cs, we have:
public EventWaitHandle qHandle = new AutoResetEvent(false);
public ConcurrentQueue<string> quri = new ConcurrentQueue<string>();
Then, in ImageQConsumer.cs, we do:
Form1.watcher.qHandle.WaitOne(250);
if (Form1.watcher.quri.TryDequeue(out uri) == true)
{
... process the URI
The Crazed User
So what's this split Form1
bubble with a couple of Index
Initialize tasks on top of each other?
If you are like me, you like to stress test and corner case
components and methods as you develop. The question is
always 'what if' or 'could someone do'. Take a look again at
the user interface and notice the Restart Index Update
button. The motivation for this was "What if someone copied
or moved a bunch of image files to the download folder while the
index was being made"? They would want to restart making the
index to include the new files. Hence the button.
Ah, but then, what if a crazed user sits there and madly clicks
the Restart Index Update as fast as their little fingers
let them? Well, you would know or you would find out that
even though you set the button's.Enabled
property false as the
first thing in the click event handler, the message pump to your
UI could still get two or three messages into the handler.
And each of these fires off an Index Initialize task to rebuild
the index.
What to do? If I can't prevent more than one message, i.e.,
task, at a time, then maybe I should learn to live with it.
I was just working with a queue used between two threaded
classes. What if I used a queue to record all these task
instances, but only acted fully on the last one (last until the
crazed user goes mad again)?
This is my novel thread handling solution. At checkpoints
during the IndexInitializer
method we check if the task is at the
beginning of the queue and if there is more than one task in the
queue. If so, just leave. The successor to the current
task execution can handle making the index. Here is an
example of this "test and let successor handle indexing"
checkpoint code in IndexInitializer
in Form1.cs:
lock (shutdown)
{
if (shutdown.Bool || initIdx.Count > 1 && initIdx.Peek() == Task.CurrentId)
{
initIdx.Dequeue();
saveIfDirty(dlFolder);
return;
}
idx = 0;
ordinalFiles.Clear();
digestFiles.Clear();
sortedFiles.Clear();
}
...
shutdown.Bool
true indicates we are trying to stop all threads
(tasks) to shut down the application. Of course we should
return for that. initIdx
is our task queue. If there
is more than one entry in the queue (Count > 1) we have a
successor or we are a successor. We peek at the beginning of
the queue to see if it is the current execution. If so, we
know there is a successor that will build the index. We
dequeue our execution task ID. Then we check if the existing
index in memory needs to be written to disk (saveIfDirty
) since
imgAdder
may have added an image while we weren't looking.
Now we can return. Our successor (if not succeeded itself)
will clear the SortedList
's and Dictionary
comprising
the index and begin rebuilding.
Good but there's one last problem. A successor can be
started to index a different folder (Choose Download Folder
performed). If that folder can be indexed very quickly
because it has no or few image files, the successor could 'run
past' a previous execution that is decoding a large image's data
pixels and creating a digest of them. By the time the task
queue is tested at the next checkpoint, the slower task could miss
the successor completely and continue building an index that is no
longer wanted. What is needed is 'hold up' code to keep all
successors waiting until an earlier execution exits. Here is
the code that does this. It precedes the first checkpoint
(shown above) in IndexInitialzer
:
while (true)
{
lock (shutdown)
{
if (shutdown.Bool || initIdx.Peek() == Task.CurrentId)
{
lock (dirtyFlag)
{
if (dirtyFlag.ContainsKey(dlFolder) && dirtyFlag[dlFolder])
saveIndex(dlFolder); dirtyFlag[dlFolder] = false; break;
}
}
}
Thread.Sleep(2);
}
Using lock () in Control Event Handlers (or Not Rather)
It can be a solid deadlock situation for the UI thread to attempt
a lock
statement in a control's event handler. If the lock
object is used widely to protect access to more than one resource,
a deadlock is almost guaranteed. (Wide-area lock objects is
another discussion.)
Easy fix though. Just spawn one of those
System.Threading.Tasks
Task
's available in .NET 4.0
and let it lock for resource access and change the control.
But wait, it's bad to be a non-UI thread and change anything in a
UI control. Microsoft says control changes are not
thread-safe and should be made only by the thread that created
them. In fact, if you are executing your app in Visual
Studio, Visual Studio will raise a cross-thread exception
(InvalidOperationException
) if you try.
What to do? You want to start from the UI thread and use a
critical resource to change a control's properties.
Easy fix. From the spawned task, just Invoke
the change
back on the UI thread. Here is the generalized
pattern. "Disp
" is Dispatcher.CurrentDispatcher
saved during
form construction:
private delegate void ControlChangeDelegate();
private void ContolChange()
{
}
private void ControlChangeTask()
{
lock (lockObj)
Disp.Invoke(new ControlChangeDelegate(ControlChange)); }
private void control1_Change(object sender, EventArgs e)
{
Task.Factory.StartNew(() => ControlChangeTask());
}
Start at the bottom and move up. The regular event handler
(control1_Change
) is fired by the user doing something. The
regular event handler spawns the asynchronous ControlChangeTask
and exits. ControlChangeTask
locks for a critical resource
that will be used in ControlChange
. The lock is held until
the synchronous Invoke
on the UI thread's Dispatcher is
complete. ControlChange
uses the critical resource to make
control changes.
The TrackBar
and the navigation buttons at the bottom of DUIapp's
form all use this pattern.
So You Think Mom Will Always Be There
Sharing
Dispatcher.CurrentDispatcher
or even lock objects with
worker threads has always been problematic. It was not until I
read the first answer in Reference [03] that I came to grips with
losing Mom. I believe this override is necessary and is the
right solution. What may be novel here is that the override is
used twice in shutting down.
Let me back up.
During normal execution a worker thread (general term, not
specifically a
BackgroundWorker
) might use locks on form objects or
invoke methods on the main thread. This is a typical snippet
in
IndexInitializer
in
Form1.cs:
lock (shutdown)
{
if (ordinalFiles.Count == 1 && textBox2.Text.Length == 0)
Disp.Invoke(new ActivateButtonsDelegate(ActivateButtons));
}
When the user clicks the big red X to exit an application, worker
threads have Mom taken away. Lock objects are lost and the UI
thread's Dispatcher is gone. The form will be gone from the
desktop but, using the TaskManager, you will see the stranded
threads keep the application around.
The trick is to keep Mom around until we can all say goodbye. We override the
FormClosing
event handler for Form1
:
protected override void OnFormClosing(FormClosingEventArgs e)
{
var rk = GetRegistryKey(false);
if (null == rk && !shutdown.Bool)
{
var result = MessageBox.Show("Exit without Enabled Extension?",
"Extension Disabled", MessageBoxButtons.OKCancel);
if (result == System.Windows.Forms.DialogResult.Cancel)
{
e.Cancel = true;
base.OnFormClosing(e);
return;
}
}
else if (rk != null)
rk.Close();
if (!shutdown.Bool)
{
watcher.EndWatcher(); e.Cancel = true;
base.OnFormClosing(e);
}
shutdown.Bool = true; }
Also we define a delegate that can be used to send FormClosing
events:
internal delegate void FormCloseDelegate();
internal void FormClose()
{
this.Close();
}
The handler needs a little explanation. First, I wanted to
give a warning if the user inadvertently decided to exit but really
did want to keep the extension in the registry and the JavaScript on
disk. If they say Cancel in the MessageBox
, the FormClosing
event is cancelled, the override is exited, and the user gets to
click Enable Extension before closing the app again.
If the extension is enabled or if the user says OK to close with no
extension, we check if this is the first closing event. If it
is the first (shutdown.Bool
is false), we end the watcher but keep
the form around by also cancelling the event here. This time
though we set shutdown.Bool = true
before exiting the override.
I needed to give some thread the responsibility of closing the form
a second time after all threads had seen shutdown.Bool ==
true
. But who? The UI thread isn't busy at this point
but it seemed clumsy to enter a loop after setting the shutdown flag
to check that both imgAdder
and any Index Initialize tasks have
exited. Index Initialize tasks may or may not be around so
they are a bad choice. We want to keep our little
CookieWatcher pure and unaware of any shutdown dance. If
ImageQConsumer
took care of this, it would only need to check Index
Initialize tasks have exited and call the FormClose
delegate.
It knows by its exit that it's no longer around. So this is the code that ImageQConsumer
executes once it leaves its processing loop after seeing shutdown.Bool == true
:
int retries = 100; while (Form1.initIdx.Count > 0 && retries-- > 0)
Thread.Sleep(30);
lock (Form1.shutdown)
lock (Form1.dirtyFlag)
{
foreach (string s in Form1.dirtyFlag.Keys)
{
if (Form1.dirtyFlag[s])
{
Form1.saveIndex(s);
break;
}
}
Form1.Disp.Invoke(new Form1.FormCloseDelegate(uiForm.FormClose));
}
Why do I wait so long? The choice of waiting up to 3 seconds (100 x 30 ms) was a
judgment call. I have a 44 MB test file
that takes about 18 seconds on my PC to decode and make a digest of
pixel data. I'm sure there are even bigger, longer images.
I think a user will put up with waiting for a window to go away for
up to 3 seconds before they start to worry or get annoyed.
Make it shorter or remove the while
statement if you wish.
What we do next is save any index that needs saving on behalf of an Index
Initialize task that has failed to save in the allotted time (the
foreach
statement). Then the FormClose
delegate is
invoked, this time the FormClosing
event isn't cancelled,
imgAdder
goes away, and Mom sings Adieu. (Not to worry. She is only a
double click away.)
A Compressing Matter
With terabyte drives common today, disk usage is not such a
pressing matter as before. Still, one might want to reduce
the size of on-disk indexes that this application creates.
Even with the MD5 digests, a compressed index is about 60% the
size of the uncompressed one. As presented in this article,
DUIapp doesn't use compressed index files. If you would like
to use SevenZip compression, I'll give the steps to do that:
- SevenZip DLLs are included in the download demo folder.
But you may want to get the latest versions. If so...
- Download and run the SevenZip installer from http://www.7-zip.org/
There are two installers. One for 32-bit Windows and one for
64-bit Windows. You can uninstall this from the Control
Panel later, but for now this is the best way to get the latest
signed 7z.dll or 7z64.dll libraries. They install in
\Program Files\7-zip\. You might consider leaving SevenZip
installed. It has a good File Explorer menu extension of
its own.
- Download the SevenZipSharp.dll from http://sevenzipsharp.codeplex.com/
Place this DLL and the 7z.dll or 7z64.dll in the same directory
that has or will have
DUIapp.exe
, our application. Empty
bin\Debug and bin\Release folders are zipped in the download
source.
- Locate the dui.sln file in the dui folder of the download
source.
- Open dui.sln with Visual Studio 2010. VS2012 should be
OK too. I haven't tested in VS2008. In the Solution
Explorer right click the
DUIapp
project and select
Properties. In the Build tab, type in COMPRESS
in the
'Conditional compilation symbols:' box. Rebuild the
solution. Remember to delete any previously made indexes
(zjqxkImgIdx.bin) in your download folders since DUIapp will now
try to read and will make compressed indexes.
- If you "deploy" DUIapp.exe anywhere, be sure to include the
SevenZipSharp.dll and the correct or both 7-zip signed libraries.
Jumping to Conditionals
Did some faint buzzer go off when I said the index could be used before it was complete? Wouldn't the whole index be needed to guarantee uniqueness. BZZZZZ. OK. You got me.
Imagine this scenario: DUIapp is started and begins creating a
new index in a folder with thousands of images, each needing
decoding and digesting to add an entry to the index. The
ImgQConsumer
is running, waiting for image URI's. Since
creating a new index will take a few minutes, the user browses to a
page with an image and selects "Download Unique Image". The
ImgQConsumer
retrieves the image to memory and finds its digest is
currently unique. Unique because the digest for some existing
file X has not yet been added. ImgQConsumer
writes the
retrieved image to disk. A few moments later, the Index
Initialize task runs across a duplicate digest for file X.
What to do?
Or think of a simpler case: The user copies a bunch of
duplicate images to the download folder and starts DUIapp.
Even if we required the index to be fully initialized before
allowing any downloads, that doesn't solve this simpler case.
There are three resolutions that made sense to me:
- In a SILENT manner, simply delete new duplicates as they are
found.
- Only ADVISE the user of new duplicates in the title bar as
they are found.
- Ask the user if we should delete a new duplicate or keep it.
As is, DUIapp takes the last resolution. It seemed confusing
and messy to choose different resolutions at run time. Even a
command line switch or persisted visual startup mode selector is an
over complication. An intentional recompilation seemed
cleanest to me. If "SILENT
" is added to the DUIapp project's
Build properties' Conditional compilation symbols, the first
resolution will always be taken. Use "ADVISE
" to build with the
second resolution. Or leave as-is to have DUIapp ask.
Parts and Pieces, Part 2
This application is made of three pieces: one exe file and two
DLLs for NumericTextBox
and ImageWebControl
. A total of
five pieces are needed if you use the SevenZip compression.
All these are needed in one place for startup and loading. A
poor-man's deployment, like the demo download available above,
simply has all the pieces in one folder. To execute, you
double-click DUIapp.exe
in that folder.
There are ways to package the pieces into one exe file (and add
obfuscation if the packager supports that). One popular packager
is Microsoft's ILMerge (Reference [06]). But please, also
check over References [07] and [08] as we bravely march on to .NET
4.5. One muses, "What will this look like in Metro"?
Summary
I presented an application that allows you to keep images in
files of a chosen folder unique. This app creates and uses
an index that can take a few minutes to create. The index is
used to check uniqueness.
Creating a full new index in memory for my test folder (6671
image files of 1.7 GB) takes about 10 minutes on my PC. If
the chosen folder is changed or the application is exited, the
index being created is saved to disk to the point of
interruption. Restarting resumes extending the partial
index. Starting the app with my test folder chosen and a
complete index resident on disk takes about 1.5 minutes to bring
the full index back into memory.
This application includes an Internet Explorer menu extension
that can be persisted or removed at will by two buttons (Enable
Extension, Disable Extension). When this
extension is present, the user can start an IE browser window or
tab and select "Download Unique Image" when right-clicking over a
web page image. Such images are retrieved but not stored to
disk unless their image data is [currently] unique. You can
strike "[currently]" when an index is complete. Vanilla and
SILENT builds note index completion in the title bar.
ADVISE builds don't report completion since this would cover the last
duplicate found.
Several threading issues were discussed.
DUIapp was tested on:
- XP, IE 7 and IE 8, 32-bit
- W7, IE 8, 32-bit
- W7, IE 9, 64-bit
- W8CP, IE 10CP, 64-bit
References
-
http://www.codeproject.com/Articles/27242/ExifTagCollection-An-EXIF-metadata-extraction-libr
by Lev Danielyan 24 Jun 2008
- http://www.codeproject.com/Articles/30812/Simple-Numeric-TextBox
by DaveyM69 9 Nov 2008
- http://stackoverflow.com/questions/1731384/how-to-stop-backgroundworker-on-forms-closing-event/
- http://www.codethinked.com/net-40-and-systemthreadingtasks - a good blog on 4.0 Tasks
- http://www.codeproject.com/Articles/19682/A-Pure-NET-Single-Instance-Application-Solution
by Shy Agam 17 Nov 2007
- http://www.microsoft.com/en-us/download/details.aspx?id=17630 - ILMerge
- http://research.microsoft.com/en-us/people/mbarnett/ilmerge.aspx - Mike Barnett on ILMerge
- http://www.mattwrock.com/post/2012/02/29/What-you-should-know-about-running-ILMerge-on-Net-45-Beta-assemblies-targeting-Net-40.aspx
- Matt Wrock's referenced blog post
History
Submitted to CodeProject 17 Aug 2012.