Introduction
I am currently writing a tool (Transfer) that will copy all files from a specified source directory (and its subdirectory) to a specified target directory, maintaining the source directory structure. The tool will be published on CodeProject when it is completed.
The tool differs significantly from XCOPY in that:
- Whether or not to overwrite an existing target file is based upon:
- the source and target file lengths, and, if equal,
- the contents of the files.
- Subdirectories can be excluded, either en masse (like /bin) or by specific selection.
- Copying files can be limited to specific extensions.
During the research for the way in which to most efficiently traverse the source directory, I performed tests of four algorithms. These algorithms, along with their testing methods, were:
Algorithm Test Method
GetFiles retrieve_GetFiles_files
FileSystemInfo retrieve_FileSystemInfo_files
Stacks retrieve_Stack_files
Recursion retrieve_Recursive_files
The Project
Although there may be some valid objections made regarding my choice of algorithm, I limited myself to well-known implementations. For the purpose of timing, I tried to limit each method to the minimum code required.
Before I go too far, the reader should know that this project was developed only to determine which algorithm, based on execution time, I should use in the Transfer tool project. That tool requires the collection of directory and subdirectory names and the collection of all of the names of files that are at or below a specified topmost directory.
Timing Template
Each of the four algorithms was executed a user-specified number of times. The template that I used to time each algorithm was:
using System.Diagnostics;
TimeSpan elapsed;
Stopwatch stopwatch = new Stopwatch ( );
int time_ms;
stopwatch.Start();
for ( int i = 0; ( i < iterations ); i++ )
{
if ( [ target file ].Count > 0 )
{
[ target file ].Clear ( );
}
[ execute an algorithm from the topmost_directory ]
}
stopwatch.Stop();
elapsed = stopwatch.Elapsed;
time_ms = elapsed.Hours * 3600000 +
elapsed.Minutes * 60000 +
elapsed.Seconds * 1000 +
elapsed.Milliseconds;
Because two algorithms had a recursive component, all of the algorithms required:
- use of a List
< string >
structure, declared at the global level, - preconditioning of their output files.
So, even though a specific algorithm did not require it, all target files were cleared at the beginning of each iteration.
GetFiles Test Method
The output of each algorithm was required to be saved in a List < string >
structure. This requirement increased the length of some algorithms whose returned values were in a string [ ]
format. For instance, the GetFiles
algorithm:
void retrieve_GetFiles_files ( string root )
{
string [ ] entries;
entries = Directory.GetFiles (
root,
"*.*",
SearchOption.AllDirectories );
foreach ( string entry in entries )
{
FileAttributes attributes = File.GetAttributes (
entry );
if ( ( attributes & FileAttributes.Directory ) ==
FileAttributes.Directory )
{
}
else
{
GetFiles_files.Add ( entry );
}
}
}
All of the work is performed in the GetFiles
invocation, but because the output was required to be a List < string >
of files, the GetFiles
test method required a change to:
- eliminate directories,
- convert the output from
string [ ]
to List < string >
.
This conversion added time to the GetFiles
algorithm.
FileSystemInfo Test Method
void retrieve_FileSystemInfo_files ( string root )
{
DirectoryInfo directory_info = null;
FileSystemInfo [ ] directory_entries = null;
directory_info = new DirectoryInfo ( root );
directory_entries = directory_info.
GetFileSystemInfos ( );
foreach ( FileSystemInfo entry in directory_entries )
{
if ( entry is DirectoryInfo )
{
retrieve_FileSystemInfo_files ( entry.FullName );
}
else if ( entry is FileInfo )
{
FileSystemInfo_files.Add ( entry.FullName );
}
else
{
}
}
}
As with the GetFiles
invocation, the output from the FileSystemInfo
invocation is composed of both directories and files. In addition, it is necessary to recurse through the directories. As a result, it is impossible to record filenames within the retrieve_FileSystemInfo_files
method. This is one of the methods that forced the clearing of the output List < string >
before invoking the method.
Stacking and Recursion
In searching for algorithms, I came across the Microsoft Developers Network (MSDN) article How to: Iterate Through a Directory Tree (C# Programming Guide).
What I found particularly interesting was an implicit disavowal of the GetFiles
and FileSystemInfo
methods. For within the article appeared the following (paraphrased) warning:
The weakness in this approach (root.GetDirectories("*.*", System.IO.SearchOption.AllDirectories);
) is that if any one of the subdirectories under the specified root causes a DirectoryNotFoundException or UnauthorizedAccessException, the whole method fails and returns no directories. The same is true when you use the GetFiles method. If you have to handle these exceptions on specific subfolders, you must manually walk the directory tree ...
With this warning in mind, I implemented the Stack and Recursive methods, as provided in the MSDN article.
Stack Test Method
I think stacks are the forgotten data structure. I was implementing them in C in the 1970's; used them in Pascal during the 1980's; and then, for some reason, stopped using them. But here, the stack proves its worth.
void retrieve_Stack_files ( string root )
{
Stack < string > directories = new Stack < string > ( 20 );
directories.Push ( root );
while ( directories.Count > 0 )
{
string current_directory = directories.Pop ( );
string [ ] subdirectories;
try
{
subdirectories = Directory.GetDirectories (
current_directory );
}
catch ( UnauthorizedAccessException e )
{
continue;
}
catch ( DirectoryNotFoundException e )
{
continue;
}
try
{
foreach ( string file in Directory.GetFiles (
current_directory ) )
{
Stack_files.Add ( file );
}
}
catch ( UnauthorizedAccessException e )
{
continue;
}
catch ( DirectoryNotFoundException e )
{
continue;
}
try
{
foreach ( string subdirectory in subdirectories )
{
directories.Push ( subdirectory );
}
}
catch ( UnauthorizedAccessException e )
{
continue;
}
catch ( DirectoryNotFoundException e )
{
continue;
}
}
}
Yes, the source code is somewhat longer than the two preceding algorithms, but I think it may be more readable. But far more important is its robustness. If all that we can retrieve is the first list of subdirectories, then we at least get all of the files beneath each one - we do not fail without returning something. If, however, the reader prefers to throw an exception, that code can be easily added to each of the catch
blocks.
Although it is tempting to collapse the try-catch blocks, experimentation shows that the execution time increases. I have no idea why this is the case.
Recursive Test Method
void retrieve_Recursive_files ( DirectoryInfo root )
{
FileInfo [ ] files = null;
DirectoryInfo [ ] subdirectories = null;
try
{
files = root.GetFiles ( "*.*" );
}
catch ( UnauthorizedAccessException e )
{
}
catch ( DirectoryNotFoundException e )
{
}
if ( files != null )
{
foreach ( FileInfo fi in files )
{
Recursive_files.Add ( fi.FullName );
}
subdirectories = root.GetDirectories ( );
foreach ( DirectoryInfo dirInfo in subdirectories )
{
retrieve_Recursive_files ( dirInfo );
}
}
}
Timing Results
I ran the ComparingDirectoryTraversalMethods
project tool for one and ten iterations, traversing a chosen directory using each of the four algorithms. The results are:
Iterations GetFiles FileSystemInfo Stack Recursive
1 3062 4158 5139 6164
10 31098 41895 51453 61823
Conclusion
This paper has presented timings for four popular directory traversing algorithms.
Considering the warning of the earlier referenced MSDN article and the fact that there is not an order of magnitude difference in execution times, I believe that serious consideration should be given to the Stack and the Recursive algorithms for traversing directory structures.
I have chosen the Stack implementation for the Transfer tool. It will be executed in a thread, separate from the UI thread.
History
- 11/9/2012 - Original article.