Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#4.0

Fault Tolerance for Large Files on Cranky Hardware

4.87/5 (8 votes)
13 Sep 2012CPOL8 min read 26.3K   975  
WPF and Form utilities that do a better job copying and comparing large files on cranky hardware. Plea for software manufacturers to consider fault tolerance.

Sample Image

Introduction

If you do not have any problems moving, copying, or using large files on externally connected disks, this article may not be of interest to you. If you suspect your external hardware sometimes has tiny hiccups when using large files, these fault-tolerant utilities should help you copy or compare files to/from/on external disks.

Background

I use Norton Ghost (http://www.symantec-norton.com/) to backup my hard drives as single file images. Norton Ghost is a great product and I highly recommend it. But, now and then, I need to shuffle backups or "restore points" around or actually restore disk volumes. That's when I see the following sorts of problems. The following screenshot shows the Symantec Recovery Point Browser's Verify Recovery Point... dialog:

recovery point check dialog

WILLM_C_Drive002.v2i was on USB-connected drive T:. This screenshot shows a failed verification.

failed integrity verification

And this shows a successful verification that was made on the same file immediately after the failed run above:

successful integrity check

This can also happen when I compare large Virtual Machine images or video files with Beyond Compare (http://ScooterSoftware.com). Beyond Compare is a great product and I highly recommend it. Since Beyond Compare offers to show you differences between compared files, it must scan the entirety of both files before reporting a failed compare. These are partial screenshots showing a bad CRC-based Beyond Compare file compare:

Beyond Compare screenshot LHS

Beyond Compare screenshot RHS

This shows the result of comparing these files with Sequence Comparer, the second utility in this article:

Sequence Comparer successful compare

What I wanted was a way to copy large files and know, when a copy is made, that it is a reliable duplicate. I also wanted a way to prove files are duplicates if a copy was made by simple copy and paste using the File Explorer.

Some Insight

Digital machines are supposedly reliable in large part because they are digital - no analog vagary in a digital bit. Why then can there be different results in repeating a verification or comparison with these applications?

Enter my technical term, "Cranky Hardware", which means hardware that is close to being digitally ideal, but in extended use is occasionally error prone. This is a diagram of the cranky hardware I believe is involved in making a bad large file copy or comparison:

RAM to Hard Disk diagram

System RAM is part of cranky hardware because memory can be affected by voltage transients or high temperature.

Disk Cache is part of System RAM. The little rectangles shown in the Disk Cache represent disk sectors for recently read data or write data that is being collected to write more efficiently in groups or blocks. The use and replacement of these little rectangles may have rare race conditions that only show up when dealing with large files on external disks.

Most Disk Controllers have their own cache for sector data which can be subject to the same rare problems as the System RAM Disk Cache.

The physical Hard Disk is part of cranky hardware because, well, what could be more cranky than detecting the magnetic state of spinning specs of rust? Yes, there are CRC and ECC mechanisms to detect and correct bad bits, but what keeps these mechanisms from being cranky?

And then there is the USB Controller. Here is a diagram of one of those:

USB Controller diagram

Roughly, if the seeing of the flow isn't sufficiently springy, a badly colored bit can be selected. Or something like that. I may not have the details exact, but I suspect external disk controllers (USB, eSATA, etc.) are the crankiest part of hardware that can't handle large files consistently. I don't remember any cranky behavior when only internal IDE or SATA disks are involved. So, what to do?

Using the code

The most basic means of fault tolerance is to retry operations that fail to make expected results. For file copy, this means reading data written, comparing it to original data, and retrying the write, read and compare operation if a difference is found. The special magic secret I found was to start any segment retry at a slightly smaller file offset than the previous try. I believe this helps prevent the Disk Cache from short-circuiting write-then-read sequences from getting real Hard Disk bits.

Verify Copy shown at the start of this article is a WPF application that does file copy with retries. It uses System.Windows.Controls.Primitives elements and System.Windows.Input.MediaCommands to "Start", "Pause", and "Stop" a file copy operation like a media player. When Verify Copy is minimized, progress is shown in the task bar icon and hovering over the icon lets you use the Start, Stop, Copy controls. Of course you don't have to minimize this utility to use it. This is just a convenience for this kind of WPF app.

This is a screenshot of Verify Copy from the task bar:

Verify Copy from the Task Bar

The cursor is not shown but it was hovering over the Pause button in the mini display.

Verify Copy uses a 3-pass strategy. Pass 1 does the basic read original, write copy, read copy and compare sequence with up to 2 retries. Verify Copy fails the file copy if retries are exhausted in this pass.

Pass 2 is a read original, read copy and compare phase that verifies the copied file after it is fully written with up to 2 retries. Disk Cache short circuits should be relatively non-existent in this pass. If retries are exhausted, segment position in the file is noted.

Finally Pass 3 attempts to re-write any segments in the copied file that failed retries in Pass 2. Note the "Repairs needed 2" "Repairs made 1" displayed at the bottom right in the screenshot of Verify Copy at the top of this article. This was taken in the middle of pass 3. Also note the Dark Goldenrod Rectangle displayed immediately below the progress bar. Verify Copy changes the color of this Rectangle each time a retry is made.

If you find Verify Copy is using too much CPU resource for you to do other work, try Pause to suspend copying the file. Depending on the buffer size used, Verify Copy may take a little while to respond. Clicking "Resume" (shown once Pause is in effect), will resume the file copy.

If you build Verify Copy from the source code and solution, there are two conditional compilation symbols that can be used. If DEFEAT_CACHE is defined, Verify Copy will use large read/write buffers that are only limited by available memory. An additional step will be used to pre-write the destination file before a copy write. This should completely eliminate Disk Cache short circuits on most systems but does cost additional time in making the verified copy. MEMORY can be defined in addition to DEFEAT_CACHE which will do a CRC-32 check on the read original buffer before and after its use to verify it hasn't changed during use. This also costs additional time and I can report that using this switch never showed my read original memory buffer (RAM) was cranky.

Finally, if you want to build and run the Sequence Comparer in the Visual Studio debugger, you will have to change the Startup project in the Solution Explorer. Here is a picture of that menu being selected:

Set Startup Menu

Philosophy and a Plea

How do you deal with cranky hardware if you are a software manufacturer? I think there are two philosophies one can take.

  1. Assume and demand the user has perfect hardware. There are few users that actually have problem hardware and attempting any fault tolerance for less than perfect hardware can implicate the product's trustworthiness. If cranky hardware is accounted for, maybe the software will be viewed as possibly cranky too.
  2. Try to give the user every chance in completing the task they desire using the product. To avoid implications, make fault tolerance an option that is obvious in its intent and provide it to the user if they are desperate in their need.

My plea is that software manufacturers consider the second philosophy.

Added Words

I've tested the demo code on Windows 7 x86 and Windows 8 x64. A bug fix was made to reported 'repairs needed' in Verify Copy and setting Create and Last Write DateTime's from the source is now made to the destination file on copy completion. Sequence Comparer now remembers the last folders for File 1 and File 2 to help when multiple files will be compared between two folders. I found SequenceCompare was faster than CRC32 classic, slice-by-4, or slice-by-8 checks so the option to select "Binary" or "Crc" comparison was not added.

I would like to highly recommend the LaCie eSATA PCI Card (Design by SISMO) for PCI-based systems with cranky hardware. I've found the flow seeing to be extremely springy and have only seen one or two compare retries since using this hardware in place of the USB adapters I've used so far. I also noticed read/write access to eSATA-connected disks is now faster than my internal EIDE disks. USB-connected disks were about 1/5 the performance of internal disks (cranky indeed).

References

  1. http://elegantcode.com/2011/03/01/wpfwindows-7-task-bar-part-threeoverlay-icons/ - 3rd part of good series on Win 7 Task Bar features.
  2. http://www.codeproject.com/Articles/96498/WPF-Integrating-our-application-with-the-Windows-7 - 2010 article on using Icon Overlays and the ProgressBar in .NET 3.5 and 4.0
  3. http://msdn.microsoft.com/en-us/library/ms752293.aspx - I needed this to come up with the Geometry for the PauseImage overlay.
  4. http://msdn.microsoft.com/en-us/library/system.windows.window.icon%28v=vs.100%29.aspx - specifying the task bar icon in XAML
  5. http://www.codeproject.com/Articles/2380/Cyclic-Redundancy-Check-CRC32-HashAlgorithm - I added public UInt32 ComputeHash(byte[] byteArray, bool AsUInt32)
  6. http://stackoverflow.com/questions/778678/how-to-change-the-color-of-progressbar-in-c-sharp-net-3-5 - Future work.
  7. http://technet.microsoft.com/en-us/library/bb742613.aspx - File Cache Performance and Tuning. Deeper than I can wade right now.
  8. http://www.blackwasp.co.uk/ExtensionMethods.aspx - One intro on Extension methods.
  9. http://geekswithblogs.net/NewThingsILearned/archive/2008/08/25/refresh--update-wpf-controls.aspx - blog explaining workings of Refresh extension.
  10. http://www.pinvoke.net/default.aspx/kernel32.getdiskfreespace - GetDiskFreeSpace used in GetDiskStuff.cs

History

Submitted to CodeProject 4 September, 2012.
First update 13 September, 2012.  See "Added Words"

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)