Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Making a Multi-threaded C# 4.0 App to Download Unique Images

0.00/5 (No votes)
17 Aug 2012 1  
This article covers some details and issues in making an application used with Internet Explorer to download only unique images to a chosen folder. DUIapp creates and maintains an index in the folder to include unique and exclude duplicate images selected from IE web pages.

Sample Image

The Peeve

I like to download pictures from the web to collections I have in different folders. Places like http://Images.Google.com/ and http://commons.wikimedia.org/wiki/Main_Page are great to find large beautiful images. It has long been relatively easy to write and run scripts or little apps after the fact to clean up duplicates by finding the MD5 digest of each file.

My peeve is this: Rarely, but just often enough to be peeved, I still have duplicates. A .JPG file can be different from another by EXIF tags (Reference [01]). The images themselves though can be duplicates. A .PNG file can be a duplicate of a .GIF file. The encodings are different, but the pixels you see can be the same.

DUIapp solves these two problems. First, duplicates are detected when downloading is attempted. Duplicates never land on disk. Second, DUIapp almost certainly guarantees the pixel data in a downloaded file is unique regardless of tags or image type. I say "almost certainly" because an MD5 digest is theoretically not unique. But in all practical applications, they are indeed unique.

DUIapp creates and maintains an index on disk in each chosen download folder (file named zjqxkImgIdx.bin) and in memory. This index is used to quickly check if a web image is unique for the given download folder.

But wait, I wanted more

When I started, I added three objectives beyond functionality.

I wanted a responsive UI that shows concurrent use of the image index even before the index is complete. The lower portion of the form shows this index is always available. If you Choose Download Folder and select one on your PC with a large number of image files (my test folder has 6671 images taking 1.7GB) you will see an increasing number of "Known" or "Indexed" files noted towards the bottom center of the form (See '2222 Known Files' in the UI screenshot).

If you move the TrackBar slider while this number is increasing, you will see different selected image file names in the bottom TextBox. Clicking the First [ |<- ], Previous [ <- ], Next [ -> ], or Last [ ->| ] navigation buttons will select different file names in the index. When you click Last, the TrackBar slider will drift towards the left as new files are added to the index.

I wanted the design to include usability features. I came up with several beside the 'responsive UI' objective. One feature was to use the application title bar as a status line. I imagined using this app so that its title bar just peeks above an overlying Internet Explorer window. When a new unique web image is downloaded to the chosen folder, its name from the web is shown in the title bar giving feedback to the user while browsing. Images that are unique but 'Too small' to meet minimum image width or height requirements are also noted in the title bar. This is a usability feature to prevent me from downloading itty-bitty images that don't meet my large beautiful images criteria. The minimum width and height are specified in NumericTextBox controls - a great usability control submitted previously to CodeProject (see Reference [02]).

Finally I wanted a way to remove application changes to the Registry and a supporting disk file. Clicking Disable Extension does this. Enable Extension puts the changes back. Implementing event handlers for these buttons turned up threading and shutdown issues that were valuable to solve for application reliability. In fact, completing development of this application led to a novel means of controlling one thread use pattern (well, novel to me), an event handler pattern that was also new for me, and a non-trivial mechanism to reliably complete application shutdown.

I would like to explain these further in the following sections.

Parts and Pieces

DUIapp uses an Internet Explorer context menu extension that is well-covered in many places on the net. This link is a good starting point. When DUIapp is running and a new IE browser window or tab is started, this extension is used in the following way:

Right click over an image being viewed in Internet Explorer. See the "Save Picture As..." menu item. We're sort of like that. Look further down and click the "Download Unique Image" item:

Using Download Unique Image menu item

Clicking "Download Unique Image" causes IE to execute a block of JavaScript we have specified in our menu extension subkey. This JavaScript file (TwineBakery.html) is first written to the special folder "MyDocuments" and then specified in the subkey. The JavaScript generates an easily-detected cookie containing the download image URI which is used later by our app to download the image file.

To see the implementation of this extension, look in the Constants region of Form1.cs in the DUIapp project and at the event handlers buttonEnable_Click and buttonDisable_Click. This extension isn't common because of these characteristics:

  • An extension for Internet Explorer
  • Part of a CodeProject article
  • Uses only C# and JavaScript
  • Assumes the application user has sufficient privilege to edit the Registry

Now here is a completely non-standard diagram of DUIapp:

Diagram of DUIapp

On startup, DUIapp's Form1 creates the extension subkey and additional registry values for the Download folder and minimum width and height when the extension subkey is absent in the registry. Otherwise the existing registry values are used to populate fields in the UI. Form1's constructor ends by starting an Index Initialize task, starting the thread member within imgAdder (an instance of ImageQConsumer), and starting the thread in watcher (a CookieWatcher):

    Task.Factory.StartNew(() => IndexInitializer(GetDLFolder()));
    imgAdder.Start(this); //imgAdder needs this to use delegates on main thread
    watcher.Start();
}

The ImageQConsumer and CookieWatcher elements began as a pattern I've used before where a class has its own thread member to help isolate thread code. This pattern is sometimes less useful. CookieWatcher is hardly coupled to the UI so the pattern works well. There are only two references to Form1.infrequentCookieName that keep this class from being reused without change. The linkage between ImageQConsumer and Form1 tells me in hindsight a better solution that still doesn't inflate lines of code in Form1.cs is worth further study.

The CookieWatcher is essentially a FileSystemWatcher that looks for new cookies created by our JavaScript when our menu extension is selected. The CookieWatcher contains a ConcurrentQueue that is stuffed with image URIs gathered from new cookies and an EventWaitHandle that is used to signal whatever consumer dequeues the URIs (our ImageQConsumer). The blue arrow passing through Form1 is meant to show the consumer's access to the CookieWatcher queue via Form1. In Form1.cs, we have:

/// <summary>
/// static CookieWatcher used to queue image Uri's.
/// Shares public queue with consumer
/// </summary>
internal static CookieWatcher watcher = new CookieWatcher();

In CookieWatcher.cs, we have:

/// <summary>
/// EventWaitHandle used to signal consumer of cookie file name change events
/// </summary>
public EventWaitHandle qHandle = new AutoResetEvent(false); // initially unsignalled

/// <summary>
/// ConcurrentQueue used to enqueue web image Uri's for a consumer
/// </summary>
public ConcurrentQueue<string> quri = new ConcurrentQueue<string>();

Then, in ImageQConsumer.cs, we do:

// amounts to polling every 1/4 sec or whenever watcher signals
Form1.watcher.qHandle.WaitOne(250);
if (Form1.watcher.quri.TryDequeue(out uri) == true)
{
    ... process the URI

The Crazed User

So what's this split Form1 bubble with a couple of Index Initialize tasks on top of each other?

If you are like me, you like to stress test and corner case components and methods as you develop. The question is always 'what if' or 'could someone do'. Take a look again at the user interface and notice the Restart Index Update button. The motivation for this was "What if someone copied or moved a bunch of image files to the download folder while the index was being made"? They would want to restart making the index to include the new files. Hence the button.

Ah, but then, what if a crazed user sits there and madly clicks the Restart Index Update as fast as their little fingers let them? Well, you would know or you would find out that even though you set the button's.Enabled property false as the first thing in the click event handler, the message pump to your UI could still get two or three messages into the handler. And each of these fires off an Index Initialize task to rebuild the index.

What to do? If I can't prevent more than one message, i.e., task, at a time, then maybe I should learn to live with it. I was just working with a queue used between two threaded classes. What if I used a queue to record all these task instances, but only acted fully on the last one (last until the crazed user goes mad again)?

This is my novel thread handling solution. At checkpoints during the IndexInitializer method we check if the task is at the beginning of the queue and if there is more than one task in the queue. If so, just leave. The successor to the current task execution can handle making the index. Here is an example of this "test and let successor handle indexing" checkpoint code in IndexInitializer in Form1.cs:

// Check if we are supposed to shutdown or have a successor and return.
// Otherwise reset and rebuild the index.
lock (shutdown)
{
    if (shutdown.Bool || initIdx.Count > 1 && initIdx.Peek() == Task.CurrentId)
    {
        initIdx.Dequeue();
        saveIfDirty(dlFolder);
        return;
    }

    idx = 0;
    ordinalFiles.Clear();
    digestFiles.Clear();
    sortedFiles.Clear();
}
...

shutdown.Bool true indicates we are trying to stop all threads (tasks) to shut down the application. Of course we should return for that. initIdx is our task queue. If there is more than one entry in the queue (Count > 1) we have a successor or we are a successor. We peek at the beginning of the queue to see if it is the current execution. If so, we know there is a successor that will build the index. We dequeue our execution task ID. Then we check if the existing index in memory needs to be written to disk (saveIfDirty) since imgAdder may have added an image while we weren't looking. Now we can return. Our successor (if not succeeded itself) will clear the SortedList's and Dictionary comprising the index and begin rebuilding.

Good but there's one last problem. A successor can be started to index a different folder (Choose Download Folder performed). If that folder can be indexed very quickly because it has no or few image files, the successor could 'run past' a previous execution that is decoding a large image's data pixels and creating a digest of them. By the time the task queue is tested at the next checkpoint, the slower task could miss the successor completely and continue building an index that is no longer wanted. What is needed is 'hold up' code to keep all successors waiting until an earlier execution exits. Here is the code that does this. It precedes the first checkpoint (shown above) in IndexInitialzer:

// Wait if we are a successor.  Prevent execution for very quick folders,
// folders with few or no images, from 'running past' execution for slow folders
while (true)
{
    lock (shutdown)
    {
        // let all tasks through at shutdown or just the task at beginning of queue
        if (shutdown.Bool || initIdx.Peek() == Task.CurrentId)
        {
            lock (dirtyFlag)
            {
                if (dirtyFlag.ContainsKey(dlFolder) && dirtyFlag[dlFolder])
                    saveIndex(dlFolder); // save any ImgQConsumer added files
                dirtyFlag[dlFolder] = false; // initialize or reset dirty flag
                break;
            }
        }
    }
    Thread.Sleep(2);
}

Using lock () in Control Event Handlers (or Not Rather)

It can be a solid deadlock situation for the UI thread to attempt a lock statement in a control's event handler. If the lock object is used widely to protect access to more than one resource, a deadlock is almost guaranteed. (Wide-area lock objects is another discussion.)

Easy fix though. Just spawn one of those System.Threading.Tasks Task's available in .NET 4.0 and let it lock for resource access and change the control.

But wait, it's bad to be a non-UI thread and change anything in a UI control. Microsoft says control changes are not thread-safe and should be made only by the thread that created them. In fact, if you are executing your app in Visual Studio, Visual Studio will raise a cross-thread exception (InvalidOperationException) if you try.

What to do? You want to start from the UI thread and use a critical resource to change a control's properties.

Easy fix. From the spawned task, just Invoke the change back on the UI thread. Here is the generalized pattern. "Disp" is Dispatcher.CurrentDispatcher saved during form construction:

private delegate void ControlChangeDelegate();
private void ContolChange()
{
    //... use critical resource to make change in control.
}
private void ControlChangeTask()
{
    lock (lockObj)
        Disp.Invoke(new ControlChangeDelegate(ControlChange)); // Invoke is synchronous
}
private void control1_Change(object sender, EventArgs e)
{
    Task.Factory.StartNew(() => ControlChangeTask());
}

Start at the bottom and move up. The regular event handler (control1_Change) is fired by the user doing something. The regular event handler spawns the asynchronous ControlChangeTask and exits. ControlChangeTask locks for a critical resource that will be used in ControlChange. The lock is held until the synchronous Invoke on the UI thread's Dispatcher is complete. ControlChange uses the critical resource to make control changes.

The TrackBar and the navigation buttons at the bottom of DUIapp's form all use this pattern.

So You Think Mom Will Always Be There

Sharing Dispatcher.CurrentDispatcher or even lock objects with worker threads has always been problematic. It was not until I read the first answer in Reference [03] that I came to grips with losing Mom. I believe this override is necessary and is the right solution. What may be novel here is that the override is used twice in shutting down.

Let me back up.

During normal execution a worker thread (general term, not specifically a BackgroundWorker) might use locks on form objects or invoke methods on the main thread. This is a typical snippet in IndexInitializer in Form1.cs:
lock (shutdown)
{
   if (ordinalFiles.Count == 1 && textBox2.Text.Length == 0)
        Disp.Invoke(new ActivateButtonsDelegate(ActivateButtons));
}

When the user clicks the big red X to exit an application, worker threads have Mom taken away. Lock objects are lost and the UI thread's Dispatcher is gone. The form will be gone from the desktop but, using the TaskManager, you will see the stranded threads keep the application around.

application form gone, application still in TaskManager

The trick is to keep Mom around until we can all say goodbye. We override the FormClosing event handler for Form1:

protected override void OnFormClosing(FormClosingEventArgs e)
{
    var rk = GetRegistryKey(false);
    // Is the extension disabled and this FormClosing event the first?
    if (null == rk && !shutdown.Bool)
    {
        var result = MessageBox.Show("Exit without Enabled Extension?",
                        "Extension Disabled", MessageBoxButtons.OKCancel);
        if (result == System.Windows.Forms.DialogResult.Cancel)
        {
            e.Cancel = true;
            base.OnFormClosing(e);
            return;
        }
    }
    else if (rk != null)
        rk.Close();

    // Is this the first FormClosing event (Extension Enabled or user OK with Disabled)
    if (!shutdown.Bool)
    {
        watcher.EndWatcher(); // exit the CookieWatcher now
        // keep form around until non-UI threads see and use shutdown.Bool == true.
        e.Cancel = true;
        base.OnFormClosing(e);
    }
    // Set shutdown.Bool without lock.  UI thread is only setter, other threads will
    // get and see shutdown.Bool == true sooner or later
    shutdown.Bool = true; // imgAdder will raise last FormClosing event via delegate.
}

Also we define a delegate that can be used to send FormClosing events:

internal delegate void FormCloseDelegate();
internal void FormClose()
{
   this.Close();
}

The handler needs a little explanation. First, I wanted to give a warning if the user inadvertently decided to exit but really did want to keep the extension in the registry and the JavaScript on disk. If they say Cancel in the MessageBox, the FormClosing event is cancelled, the override is exited, and the user gets to click Enable Extension before closing the app again.

If the extension is enabled or if the user says OK to close with no extension, we check if this is the first closing event. If it is the first (shutdown.Bool is false), we end the watcher but keep the form around by also cancelling the event here. This time though we set shutdown.Bool = true before exiting the override.

I needed to give some thread the responsibility of closing the form a second time after all threads had seen shutdown.Bool == true. But who? The UI thread isn't busy at this point but it seemed clumsy to enter a loop after setting the shutdown flag to check that both imgAdder and any Index Initialize tasks have exited. Index Initialize tasks may or may not be around so they are a bad choice. We want to keep our little CookieWatcher pure and unaware of any shutdown dance. If ImageQConsumer took care of this, it would only need to check Index Initialize tasks have exited and call the FormClose delegate. It knows by its exit that it's no longer around. So this is the code that ImageQConsumer executes once it leaves its processing loop after seeing shutdown.Bool == true:

// while (true) loop broken by Form1.shutdown.Bool set true.
int retries = 100; // give any initializer tasks a chance to exit
while (Form1.initIdx.Count > 0 && retries-- > 0)
    Thread.Sleep(30);
// save a dirty index if one exists then send FormClosing event
lock (Form1.shutdown)
lock (Form1.dirtyFlag)
{
    foreach (string s in Form1.dirtyFlag.Keys)
    {
        if (Form1.dirtyFlag[s])
        {
            Form1.saveIndex(s);
            break;
        }
    }
    Form1.Disp.Invoke(new Form1.FormCloseDelegate(uiForm.FormClose));
}

Why do I wait so long? The choice of waiting up to 3 seconds (100 x 30 ms) was a judgment call. I have a 44 MB test file that takes about 18 seconds on my PC to decode and make a digest of pixel data. I'm sure there are even bigger, longer images.

I think a user will put up with waiting for a window to go away for up to 3 seconds before they start to worry or get annoyed. Make it shorter or remove the while statement if you wish.

What we do next is save any index that needs saving on behalf of an Index Initialize task that has failed to save in the allotted time (the foreach statement). Then the FormClose delegate is invoked, this time the FormClosing event isn't cancelled, imgAdder goes away, and Mom sings Adieu. (Not to worry. She is only a double click away.)

A Compressing Matter

With terabyte drives common today, disk usage is not such a pressing matter as before. Still, one might want to reduce the size of on-disk indexes that this application creates.  Even with the MD5 digests, a compressed index is about 60% the size of the uncompressed one. As presented in this article, DUIapp doesn't use compressed index files. If you would like to use SevenZip compression, I'll give the steps to do that:

  • SevenZip DLLs are included in the download demo folder.  But you may want to get the latest versions. If so...
  • Download and run the SevenZip installer from http://www.7-zip.org/  There are two installers. One for 32-bit Windows and one for 64-bit Windows. You can uninstall this from the Control Panel later, but for now this is the best way to get the latest signed 7z.dll or 7z64.dll libraries. They install in \Program Files\7-zip\. You might consider leaving SevenZip installed. It has a good File Explorer menu extension of its own.
  • Download the SevenZipSharp.dll from http://sevenzipsharp.codeplex.com/  Place this DLL and the 7z.dll or 7z64.dll in the same directory that has or will have DUIapp.exe, our application. Empty bin\Debug and bin\Release folders are zipped in the download source.
  • Locate the dui.sln file in the dui folder of the download source.
  • Open dui.sln with Visual Studio 2010. VS2012 should be OK too. I haven't tested in VS2008. In the Solution Explorer right click the DUIapp project and select Properties. In the Build tab, type in COMPRESS in the 'Conditional compilation symbols:' box. Rebuild the solution. Remember to delete any previously made indexes (zjqxkImgIdx.bin) in your download folders since DUIapp will now try to read and will make compressed indexes.
  • If you "deploy" DUIapp.exe anywhere, be sure to include the SevenZipSharp.dll and the correct or both 7-zip signed libraries.

Jumping to Conditionals

Did some faint buzzer go off when I said the index could be used before it was complete? Wouldn't the whole index be needed to guarantee uniqueness. BZZZZZ. OK. You got me.

Imagine this scenario: DUIapp is started and begins creating a new index in a folder with thousands of images, each needing decoding and digesting to add an entry to the index. The ImgQConsumer is running, waiting for image URI's. Since creating a new index will take a few minutes, the user browses to a page with an image and selects "Download Unique Image". The ImgQConsumer retrieves the image to memory and finds its digest is currently unique. Unique because the digest for some existing file X has not yet been added. ImgQConsumer writes the retrieved image to disk. A few moments later, the Index Initialize task runs across a duplicate digest for file X.  What to do?

Or think of a simpler case: The user copies a bunch of duplicate images to the download folder and starts DUIapp.  Even if we required the index to be fully initialized before allowing any downloads, that doesn't solve this simpler case.

There are three resolutions that made sense to me:

  • In a SILENT manner, simply delete new duplicates as they are found.
  • Only ADVISE the user of new duplicates in the title bar as they are found.
  • Ask the user if we should delete a new duplicate or keep it.

As is, DUIapp takes the last resolution. It seemed confusing and messy to choose different resolutions at run time. Even a command line switch or persisted visual startup mode selector is an over complication. An intentional recompilation seemed cleanest to me. If "SILENT" is added to the DUIapp project's Build properties' Conditional compilation symbols, the first resolution will always be taken. Use "ADVISE" to build with the second resolution. Or leave as-is to have DUIapp ask.

Parts and Pieces, Part 2

This application is made of three pieces: one exe file and two DLLs for NumericTextBox and ImageWebControl. A total of five pieces are needed if you use the SevenZip compression.  All these are needed in one place for startup and loading. A poor-man's deployment, like the demo download available above, simply has all the pieces in one folder. To execute, you double-click DUIapp.exe in that folder.

There are ways to package the pieces into one exe file (and add obfuscation if the packager supports that). One popular packager is Microsoft's ILMerge (Reference [06]). But please, also check over References [07] and [08] as we bravely march on to .NET 4.5. One muses, "What will this look like in Metro"?

Summary

I presented an application that allows you to keep images in files of a chosen folder unique. This app creates and uses an index that can take a few minutes to create. The index is used to check uniqueness.

Creating a full new index in memory for my test folder (6671 image files of 1.7 GB) takes about 10 minutes on my PC. If the chosen folder is changed or the application is exited, the index being created is saved to disk to the point of interruption. Restarting resumes extending the partial index. Starting the app with my test folder chosen and a complete index resident on disk takes about 1.5 minutes to bring the full index back into memory.

This application includes an Internet Explorer menu extension that can be persisted or removed at will by two buttons (Enable Extension, Disable Extension). When this extension is present, the user can start an IE browser window or tab and select "Download Unique Image" when right-clicking over a web page image. Such images are retrieved but not stored to disk unless their image data is [currently] unique. You can strike "[currently]" when an index is complete. Vanilla and SILENT builds note index completion in the title bar. ADVISE builds don't report completion since this would cover the last duplicate found.

Several threading issues were discussed.

DUIapp was tested on:

  • XP, IE 7 and IE 8, 32-bit
  • W7, IE 8, 32-bit
  • W7, IE 9, 64-bit
  • W8CP, IE 10CP, 64-bit

References

  1. http://www.codeproject.com/Articles/27242/ExifTagCollection-An-EXIF-metadata-extraction-libr by Lev Danielyan 24 Jun 2008
  2. http://www.codeproject.com/Articles/30812/Simple-Numeric-TextBox by DaveyM69 9 Nov 2008
  3. http://stackoverflow.com/questions/1731384/how-to-stop-backgroundworker-on-forms-closing-event/
  4. http://www.codethinked.com/net-40-and-systemthreadingtasks - a good blog on 4.0 Tasks
  5. http://www.codeproject.com/Articles/19682/A-Pure-NET-Single-Instance-Application-Solution by Shy Agam 17 Nov 2007
  6. http://www.microsoft.com/en-us/download/details.aspx?id=17630 - ILMerge
  7. http://research.microsoft.com/en-us/people/mbarnett/ilmerge.aspx - Mike Barnett on ILMerge
  8. http://www.mattwrock.com/post/2012/02/29/What-you-should-know-about-running-ILMerge-on-Net-45-Beta-assemblies-targeting-Net-40.aspx - Matt Wrock's referenced blog post

History

Submitted to CodeProject 17 Aug 2012.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here