Introduction
I've found myself in need of a piece of code a few times to search a drive for files. I am currently thinking of writing a WCF service for deploying on all my machines to ease their management. One of the methods it will expose shall retrieve the complete list of files and folders present on the drives of the host machine. The idea is to populate a Database with all these files and to write a tool to analyze this data. I currently have a little bit more than one terabyte of data in online storage and I know that there are many duplicate files, so...
Quite recently, I already wrote a piece of code seeking for particular files and folders to delete them. It's a tiny tool that I wrote to clean a development folder, which I tend to need to do quite regularly when I copy/exchange code. It searches for a couple of folders like ".svn", "bin", "obj", "Backup" and stuff like that and deletes them, which implies deleting their complete contents. Other folders are searched for files having a name like "*.pdb", "*.suo", "Thumbs.db" etc... and these files also get deleted. Such engines differ a bit from the other in that it must be able to search some folders for specific matches and it must be able to just take everything for other folders.
Besides that, I am dreaming of some "line counter" tool for the future to count the number of lines of code in a project, which would require something like getting the collection of all ".cs" and "*.vb" files I presume. This is a simple search for matching files, but something must be done at the end on the results because the files must be read.
Coming back to my WCF service idea, I would already be glad with all files and folders being listed in my Database, but it would definitively be greater if I could avoid storing uninteresting files and folders like everything under the "Windows" folder, everything under the "Program Files" folder or other particular files anywhere else. I thought I could add a new capability to my previous implementation, which is that the engine should be able to include or to exclude matches. I wanted to be able to exclude particular files and even complete folders, and I really wanted to re-use my existing piece of code.
The problems really started when I thought a little bit further about my idea. Ideally, a search engine should be able to consider other things than just the name of a file/folder, and what I ultimately wished was a bit more than just excluding a "Windows" or a "Program Files" folder. I thought that adding the possibility to exclude files having one or several specific attributes would definitively do the trick, I would even be able to exclude System/Hidden files for example.
But well, all that meant writing a new search engine. And if tomorrow I need to be able to look inside files or analyze any other aspect of a file/folder that I even can't think about today, I would probably be forced to rewrite or transform my engine. So I thought it was a good time to think in terms of genericity.
If we think in terms of genericity, a search engine doesn't have to know how to process objects. For a desktop search engine that reads inside files, it is a fact that we will need different readers for the various file formats but the task of a search engine is to locate files or folders and to apply search filters on these objects, not to contain the knowledge of these filters. Moreover, it is not desirable to couple the search engine to the many various file types than one could think of.
So, if we think in terms of search filters, we can think of the search filter as an object being capable of doing something with a File System Object and returning a result indicating wether the File System Object meets particular criteria's. The "something" in question can be anything like looking at the name, or looking at the attributes, or reading the file if the File System Object is a file. The File System Object is simply either a file or a folder.
For reading a file, we can for instance think about a search filter that would be a generic reader able to request a "provider" for the file type that it would be looking at, which could be served by a plugin system. Such a "provider" is generally something that has to be bought unless you wish to acquire knowledge in proprietary binary file formats and that wasn't the scope of my article so I implemented a simple search filter that just locates ASCII strings. I just wanted to illustrate that this "File System Search Engine" is generic enough to be used as a Desktop Search Engine, but that wasn't my focus and I implemented filters that I really needed. Naturally, I am not comparing my piece of code with "real" desktop search engines.
The Search Engine in Brief
The engine has the following capabilities:
- Folders and files are treated separately.
- Multiple filters can be combined.
- The introspection logic can be controlled: you can ask for the list of all files and folders that are under a particular root folder, including or excluding the whole tree under the child folders, but you can also look only at child folders matching a particular filter; when folder matches are found, you can use the same search filter but you can also specify that the whole contents of a matching folder should be kept. This can be useful in a “mass-delete” scenario, where you just want to completely delete folders matching some specific criteria’s and potentially some other files matching other criteria’s in folders that shouldn’t be deleted entirely.
- The engine is able to find files and folders by name using wildcards or Regular Expressions.
- The engine is able to find files and folders having (or not) one or several
System.IO.FileAttributes
attribute(s).
- The engine is able to find files containing a particular ASCII string.
- The engine can be extended with additional filters to permit virtually any other kind of search.
- Positive matches can be used for inclusion or exclusion.
The FileSystemInfoNameSearchFilter
class that compares file/folder names with wildcards or Regular Expressions makes use of the excellent Wildcard class written by Reinux. Many thanks to Reinux for this elegant solution.
The Search Engine in Detail
The engine consists of 13 classes built on top of 4 interfaces:
IDirectoryBrowser
ISearchFilter<T>
IDirectoryBrowserSearchFilters
IDirectoryBrowsingResults
Shortly said, the engine consists of a class implementing IDirectoryBrowser
, the DirectoryBrowser
class, which makes use of objects implementing ISearchFilter<T>
to build filtered lists of folders and files that it returns encapsulated in an object implementing IDirectoryBrowsingResults
, the DirectoryBrowsingResults
class.
The DirectoryBrowser
can use IDirectoryBrowserSearchFilters
arguments (implemented by the DirectoryBrowserSearchFilters
class); this interface defines one property for an object implementing ISearchFilter<FileInfo>
and one property for an object implementing ISearchFilter<DirectoryInfo>
, this is how we pass separate filters for folders and files.
The DirectoryBrowsingResults
class encapsulates a reference to the DirectoryBrowser
object that created the results and the files and folders as two separate collections. The DirectoryBrowser
contains a list of eventual exceptions that may have occurred during processing, so this reference is useful when you make use of one of the DirectoryBrowser
static methods instead of instantiating one yourself.
ISearchFilter<T>
is implemented by the abstract class SearchFilter<T>
, which is nothing more than a base class containing plumbing for a "Decorator" pattern implementation, along with a Template Method pattern implementation to provide consistency in the algorithm for derived types. I have chosen to implement a "Decorator" pattern to "stack" search filters, allowing to pass only one "Decorator" search filter to the browser object and centralize the results combination logic along with some exception handling. I just wished to do it like that but it could be anything else: we could pass a list of filters to the browser, we could make a new class to encapsulate functionality to maintain and expose multiple filters, we could do many other elegant things.
On top of this SearchFilter<T>
base class, three classes provide built-in search filters:
FileSystemInfoNameSearchFilter
can compare the name of a System.IO.FileSystemInfo object
(can be a System.IO.FileInfo
or a System.IO.DirectoryInfo
object) with a list of wildcards or a list of Regular Expressions to find file matches.
FileSystemInfoAttributeSearchFilter
differs a little bit as it can compare System.IO.FileAttributes
with the Attributes
property of the System.IO.FileSystemInfo
object.
FileInfoContentsSearchFilter
: can look at the contents of the file and search for a specific ASCII string. This third one is there really for the sole purpose of illustrating the concept and the flexibility of decorated ISearchFilter<T>
implementations, it is extremely basic and limited and it will probably lack performance.
It is possible to forbid or allow analysis of child folders, but also to allow analysis only for child folders having matched the provided SearchFilter<DirectoryInfo>
; matching folders contents can be integrally added to the results but they can also be searched normally using the search filters. All this is defined in the DirectoryBrowserSearchBehavior
class via the BrowserMode
and DirectoryMatchRules
properties.
It would be a little bit tedious to present each class. Instead, I’ll focus on the major points of interest.
As you probably guessed, all the magic occurs in the DirectoryBrowser
class. At the root folder level, the BrowseDirectory()
method has actually only two worries: determining whether its child folders should be inspected (and how), and providing its own list of direct child files and folders.
public IDirectoryBrowsingResults BrowseDirectory(
DirectoryBrowserSearchBehavior searchBehavior,
IDirectoryBrowserSearchFilters searchFilters)
{
DirectoryBrowserSearchBehavior behavior =
searchBehavior ?? new DirectoryBrowserSearchBehavior();
this.Results.ClearResults();
this.ExceptionsList.Clear();
this.HasBrowsed = true;
if (this.RootFolderExists)
{
IEnumerable<DirectoryInfo> allDirectories;
IEnumerable<DirectoryInfo> matchingDirectories =
this.GetDirectories(searchFilters);
IEnumerable<FileInfo> files = this.GetFiles(searchFilters);
if (behavior.BrowserMode == DirectoryBrowserModes.SearchOnlyMatchingDirectories)
{
if (behavior.DirectoryMatchRules ==
DirectoryMatchRules.IncludeCompleteDirectoryContents)
{
this.Results.AddResults(this.BrowseChildDirectories(matchingDirectories,
searchBehavior, null));
}
else
{
this.Results.AddResults(this.BrowseChildDirectories(matchingDirectories,
searchBehavior, searchFilters));
}
}
else if (behavior.BrowserMode == DirectoryBrowserModes.SearchAllChildDirectories)
{
allDirectories = this.GetDirectories(null);
if (behavior.DirectoryMatchRules ==
DirectoryMatchRules.IncludeCompleteDirectoryContents)
{
this.Results.AddResults(this.BrowseChildDirectories(
allDirectories.Except(matchingDirectories,
new DirectoryInfoEqualityComparerByFullName()),
searchBehavior, searchFilters));
this.Results.AddResults(this.BrowseChildDirectories(
matchingDirectories, searchBehavior, null));
}
else
{
this.Results.AddResults(this.BrowseChildDirectories(allDirectories,
searchBehavior, searchFilters));
}
}
this.Results.AddDirectories(matchingDirectories);
this.Results.AddFiles(files);
}
return this.Results;
}
When child directories need to be searched, the operation is nothing more than instantiating a new DirectoryBrowser
object for each child folder, running a new search and appending the results to our own results.
private IDirectoryBrowsingResults BrowseChildDirectories(
IEnumerable<DirectoryInfo> directories,
DirectoryBrowserSearchBehavior searchBehavior,
IDirectoryBrowserSearchFilters searchFilters)
{
IDirectoryBrowsingResults results = new DirectoryBrowsingResults(this,
this.RootFolder);
if (directories != null)
{
foreach (DirectoryInfo dir in directories)
{
IDirectoryBrowser browser = CreateBrowser(dir.FullName);
results.AddResults(browser.BrowseDirectory(searchBehavior, searchFilters));
if (browser.HasErrors)
{
this.HandleExceptions(browser.Errors);
}
}
}
return results;
}
The operation of collecting the child folders of a folder is a separate responsibility. We delegate the filtering to a LINQ extension method using a lambda expression that itself also delegates the job of testing the match to the search filter.
protected IEnumerable<DirectoryInfo> GetDirectories(
IDirectoryBrowserSearchFilters searchFilters)
{
try
{
if (searchFilters == null || searchFilters.DirectoryFilter == null ||
!searchFilters.DirectoryFilter.ContainsFilterCriterias)
return this.RootFolderInfo.GetDirectories();
return this.RootFolderInfo.GetDirectories().Where(dirInfo =>
searchFilters.DirectoryFilter.Matches(dirInfo, this.HandleException));
}
catch (Exception ex)
{
this.HandleException(ex);
return new List<DirectoryInfo>();
}
}
It is naturally pretty much the same for files.
protected IEnumerable<FileInfo> GetFiles(IDirectoryBrowserSearchFilters searchFilters)
{
try
{
if (searchFilters == null || searchFilters.FileFilter == null ||
!searchFilters.FileFilter.ContainsFilterCriterias)
return this.RootFolderInfo.GetFiles();
return this.RootFolderInfo.GetFiles().Where(fileInfo =>
searchFilters.FileFilter.Matches(fileInfo, this.HandleException));
}
catch (Exception ex)
{
this.HandleException(ex);
return new List<FileInfo>();
}
}
Besides the DirectoryBrowser
class, another class deserves our attention: the SearchFilter<T>
abstract class for the "Decorator" pattern implementation. As you see, a "Decorator" pattern implementation starts by defining - in a way or another - an “inner” object implementing the same interface, that’s to say a “decorated ISearchFilter<T>
”. Here, I chose to make use of a second constructor.
public abstract class SearchFilter<T>
: ISearchFilter<T>
{
protected SearchFilter()
: this(null)
{
}
protected SearchFilter(ISearchFilter<T> innerFilter)
{
this.InnerFilter = innerFilter;
}
}
The next step is to implement the Matches()
method that is defined by the ISearchFilter<T>
interface. This is where we execute the code of our decorated ISearchFilter<T>
, if any. This is also where we control how we want to actually combine the results. If I had used a “Decorator” approach for the DirectoryBrowser
object itself, I would be appending DirectoryBrowsingResults
here.
To ensure that the decorated filter will always be called, the logic skeleton must reside in a non-virtual method; the IsMatch()
method that we are calling here is actually an abstract method. Search filter classes deriving from SearchFilter<T>
just have to override this IsMatch()
method and provide at least similar constructors, all the rest is done safely thanks to this base class.
public Boolean Matches(T item, ExceptionHandler errorHandler)
{
if (this.HasInnerFilter && !this.InnerFilter.Matches(item, errorHandler))
{
return false;
}
try
{
return this.IsMatch(item);
}
catch (Exception ex)
{
if (errorHandler != null)
{
errorHandler(ex);
}
return false;
}
}
Here is the actual implementation that can be found in the FileSystemInfoNameSearchFilter
built-in filter. The object of type T will be a System.IO.FileSystemInfo
object (either a System.IO.FileInfo
or a System.IO.DirectoryInfo
object), and you can just code the algorithm of your choice to retain this object or not. Notice the use of the excellent Wildcard class of Reinux .
protected override sealed Boolean IsMatch(T item)
{
Boolean result = false;
String text = item == null ? null : item.Name;
if (item == null || String.IsNullOrEmpty(text))
return false;
if (! this.ContainsFilterCriterias)
return true;
foreach (String pattern in this.Patterns)
{
if (this.CreateMatchObject(pattern).IsMatch(text))
{
result = true;
break;
}
}
return (this.SearchOption == SearchOptions.IncludeMatches ? result : !result);
}
private Regex CreateMatchObject(String pattern)
{
if (this.MatchMode == FileSystemInfoNameMatchMode.UseRegEx)
return new Regex(pattern, RegexOptions.IgnoreCase);
return new Wildcard(pattern, RegexOptions.IgnoreCase);
}
Finally, a quick hint from the DirectoryInfoEqualityComparerByName
class: if you don’t want to write an algorithm to generate a hash code, you can always delegate this task to a String
object and just concentrate on making a string that uniquely identifies the object.
public int GetHashCode(DirectoryInfo obj)
{
if (obj != null)
return obj.FullName.GetHashCode();
return String.Empty.GetHashCode();
}
Using the Code
Defining a complex search can be somehow heavy. Typically, a DirectoryBrowser
object needs to be instantiated. Then you probably need to define and combine multiple filters for files and for folders, which maybe implied first creating a couple of lists for patterns or another “complex” construct. Then you have to create a DirectoryBrowserSearchFilters
object to store your filters and finally you call the appropriate BrowseDirectory()
method providing the DirectoryBrowserSearchFilters
object.
This makes actually sense only if you really need to do all of that, but most of the time you don’t. Therefore, a couple of static “shortcut”/”helper” methods have been added to the DirectoryBrowser
and DirectoryBrowserSearchFilters
classes. The DirectoryBrowserSearchFilters
class encapsulates some functionality to ease the creation of built-in filters (like taking patterns as a comma-separated string to build the list for you), and the DirectoryBrowser
class reflects these shortcuts by offering multiple overloads that can create filters for you with a minimal set of arguments.
The four samples below illustrate simple and more elaborate searches. As complexity goes up, you can see in these samples what happens behind the scenes when filters are constructed automatically as we are forced to build custom filters ourselves. I am pretty aware that I could have added even more overloads to make it even easier. The reason I didn’t do something for the search behavior for example is that it wouldn’t really increase readability, I would need to put the BrowseDirectory()
method arguments on multiple lines anyway because the enumerators names are long.
The easiest case, no filters:
public IDirectoryBrowsingResults ListAllFilesAndFolders()
{
DirectoryBrowserSearchBehavior searchBehavior;
searchBehavior = new DirectoryBrowserSearchBehavior
{
BrowserMode =
DirectoryBrowserModes.SearchAllChildDirectories,
DirectoryMatchRules =
DirectoryMatchRules.IncludeCompleteDirectoryContents
};
return DirectoryBrowser.BrowseDirectory(@"C:\", searchBehavior);
}
A simple search based on wildcards:
public IDirectoryBrowsingResults SimpleMatches()
{
DirectoryBrowserSearchBehavior searchBehavior;
searchBehavior = new DirectoryBrowserSearchBehavior
{
BrowserMode =
DirectoryBrowserModes.SearchAllChildDirectories,
DirectoryMatchRules =
DirectoryMatchRules.IncludeCompleteDirectoryContents
};
return DirectoryBrowser.BrowseDirectory(@"C:\", searchBehavior,
"bin,obj,.svn",
SearchOptions.IncludeMatches,
"*.pdb,*.dll",
SearchOptions.IncludeMatches);
}
A search based on file contents rather than file name or attributes:
public IDirectoryBrowsingResults SearchFilesContents()
{
DirectoryBrowserSearchBehavior searchBehavior;
DirectoryBrowserSearchFilters searchFilters;
FileSystemInfoNameSearchFilter<DirectoryInfo> directoryFilter;
FileInfoContentsSearchFilter fileFilter;
searchBehavior = new DirectoryBrowserSearchBehavior
{
BrowserMode = DirectoryBrowserModes.ExcludeChildDirectories,
DirectoryMatchRules = DirectoryMatchRules.IncludeMatchingDirectoryContents
};
directoryFilter = new FileSystemInfoNameSearchFilter<DirectoryInfo>
{ SearchOption = SearchOptions.ExcludeMatches };
directoryFilter.Patterns.Add("*");
fileFilter = new FileInfoContentsSearchFilter
{
SearchOption = SearchOptions.IncludeMatches,
TextToFind = "I love C#",
};
searchFilters = new DirectoryBrowserSearchFilters(directoryFilter, fileFilter);
return DirectoryBrowser.BrowseDirectory(@"C:\", searchBehavior, searchFilters);
}
And finally, a completely custom search combining multiple filters:
public IDirectoryBrowsingResults CombineMultipleSearchFilters()
{
DirectoryBrowserSearchBehavior searchBehavior;
DirectoryBrowserSearchFilters searchFilters;
FileSystemInfoAttributeSearchFilter<FileInfo> excludeSystemFiles;
FileSystemInfoAttributeSearchFilter<FileInfo> excludeHiddenFiles;
FileSystemInfoNameSearchFilter<FileInfo> excludeConfigFiles;
FileSystemInfoNameSearchFilter<DirectoryInfo> folderExclusions;
excludeSystemFiles = new FileSystemInfoAttributeSearchFilter<FileInfo>(
SearchOptions.ExcludeMatches, FileAttributes.System);
excludeHiddenFiles = new FileSystemInfoAttributeSearchFilter<FileInfo>(
excludeSystemFiles, SearchOptions.ExcludeMatches, FileAttributes.Hidden);
excludeConfigFiles = new FileSystemInfoNameSearchFilter<FileInfo>(excludeHiddenFiles,
SearchOptions.ExcludeMatches, new List<String>(1) { "*.config" });
folderExclusions = DirectoryBrowserSearchFilters.CreateDirectoryInfoNameSearchFilter(
"bin,obj,.svn", SearchOptions.ExcludeMatches);
searchFilters = new DirectoryBrowserSearchFilters(folderExclusions,
excludeConfigFiles);
searchBehavior =new DirectoryBrowserSearchBehavior
{
BrowserMode =
DirectoryBrowserModes.SearchOnlyMatchingDirectories,
DirectoryMatchRules =
DirectoryMatchRules.IncludeMatchingDirectoryContents
};
return DirectoryBrowser.BrowseDirectory(@"C:\", searchBehavior, searchFilters);
}
You will notice while combining multiple filters that a folder/file has to satisfy ALL filters to become a part of the results (unless it is in a folder that is supposed to be integrally added to the results). This behavior is by design, but can be changed in the SearchFilter<T>
abstract class. The Matches()
method automatically returns false if it contains a decorated SearchFilter<T>
that returned false. Control could easily be granted on this behavior by adding an additional property to the SearchBehavior
class, it didn’t make much sense to implement it when I was writing the engine.
Points of Interest
The SearchFilter<T>
class is hopefully interesting because it makes use of two design patterns that are probably not the most widely used design patterns. It is pretty discrete but it is an illustration of something that can be done using a "Decorator" pattern, which is generally best implemented in a base class together with a Template Method pattern to keep the algorithm logic intact as I did.
Besides that I initially wanted to sparingly use a few of the newest additions to the .NET framework like LINQ, Lambda Expressions and Automatic Properties. At the end, I concretely only make use of LINQ extension methods with a lambda expression at exactly 3 tiny locations (well, the third makes use of a comparer instead of a lambda expression) but at least it makes sense when it is used. Automatic properties are used a bit more widely when declaring a corresponding field was completely useless but it’s not really worth the word. All this means that the engine can be adapted to work with the .NET Framework 2.0 quite easily.
History
01-01-2009: initial post.