Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Recursive patterned File Globbing

0.00/5 (No votes)
27 Aug 2002 2  
Class and application to recursively or non-recursively match files or directories based on a wildcard pattern.

Introduction

One of my favorite features of Perforce is its recursive wildcard syntax. I use it every day. Instead of navigating trees in a GUI or directories at the Command Prompt looking for a particular file to perform an operation, I merely use the Perforce '...' recursive syntax to find the file and check it out:

p4 edit ...MyFile.cpp

As further specialized functionality, the '...' syntax can accept extensions, too.

p4 edit ....h

This checks out all *.h files across every directory under the current one.

I desired similar type of functionality in my own applications, so I began searching the Internet for any existing code implementing this behavior. I came across DJGPP's glob() function (also in OpenBSD). Coupled with the fnmatch() POSIX function, they made a great match, and it worked almost exactly how I wanted. The only catch was the license. GPL is not conducive to many environments, and despite the fact I release source code to most of my products, I can't afford to have all the code I write fall under the GPL license.

So, it was back to searching. Next, I found Matthias Wandel's MyGlob code, from his Exif Jpeg camera setting parser and thumbnail remover application (see the Credits for URL). Its behavior was mostly what I wanted, and it didn't fall under some license where you have to sell your soul. In fact, it falls under no license and is freely usable.

After modifying a version of his code to my liking, I emailed him, told him what I had done, and received his permission to post my modified version here. In any case, Matthias wrote the original implementation. I slapped on a bunch of new features, and I'll discuss the product as a whole below.

The newest versions of this code may be found at http://workspacewhiz.com/ in the Misc. Code section.

Globbing?

Frankly, I was surprised, too. My Internet search would have gone more quickly had I known a "glob" was exactly what I was looking for. I'd always thought a glob was that gooey stuff my roommates in college made for dinner, but I guess I was sorely mistaken. :)

A file glob, in this case, is zero or more file names matched via a pattern, possibly with wildcards embedded within.

Patterns and Wildcards

Without a path specified, matching (er, globbing) of files starts in the current directory.

Wildcard Description
? Matches any single character of the file name or directory name.
* Matches 0 or more characters of the file name or directory name.
/ at end of pattern Any pattern with a closing slash will start a directory search, instead of the default file search.
** Contrary to my wanting to use a Perforce-style '...' recursive syntax, Matthias brought up an important point. Those individuals using 4DOS (http://www.jpsoft.com/) are used to '...' meaning '..\..\'. After some thought, I believe Matthias's original '**' syntax for recursion is a far better solution.

Some examples follow:

Example Pattern Description
File.txt Matches a file or directory called File.txt.
File*.txt Matches any file or directory starting with File and ending with a .txt extension.
File?.txt Matches any file or directory starting with File and containing one more character.
F??e*.txt Matches a file or directory starting with F, followed by any two characters, followed by e, then any number of characters up to the extension .txt.
File* Matches a file or directory starting with File and ending with or without an extension.
* Matches all files (non-recursive).
*/ Matches all directories (non-recursive).
A*/ Matches any directory starting with A (non-recursive).
**/* Matches all files (recursive).
** Shortened form of above. Matches all files (recursive). Internally, expands to **/*
**/ Matches all directories (recursive).
**{filename chars} Matches {filename chars} recursively. Internally, expands to .../*{filename chars}.
{dirname chars}** Expands to {dirname chars}*/**.
{dirname chars}**{filename chars} Expands to {dirname chars}*/**/*{filename chars}.
**.h Matches all *.h files recursively. Expands to **/*.h.
**resource.h Matches all *resource.h files recursively. Expands to .../*resource.h.
BK** Matches all files in any directory starting with BK, recursively. Expands to BK*/**.
BK**.h Matches all *.h files in any directory starting with BK, recursively. Expands to BK*/**/*.h.
c:/Src/**/*.h Matches all *.h files recursively, starting at c:/Src/.
c:/Src/**/*Grid/ Recursively matches all directories under c:/Src/ that end with Grid.
c:/Src/**/*Grid*/ Recursively matches all directories under c:/Src/ that contain Grid.
c:/Src/**/*Grid*/**/ABC/**/Readme.txt Recursively matches all directories under c:/Src/ that contain Grid. From the found directory, recursively matches directories until ABC/ is found. From there, the file Readme.txt is searched for recursively.

Finally, a couple flags are available. Flags are appended at the end of the pattern line. Each flag begins with an @ character. Spaces should not be inserted between flags unless they are intended as part of the string literal.

Flags and Other Expansions Description
@-pattern Adds pattern to the ignore list. Any file matching a pattern in the ignore list is discounted from the search.
@=pattern Adds pattern to the exclusive file list. Any file not matching a pattern in the exclusive file list is automatically removed from the search.
More than two periods for going up parent directories. Similar to 4DOS, each period exceeding two periods goes up one additional parent directory. So, a 4 period path expands to ../../../.

And a few examples:

Example Pattern Description
Src/**/@-SCCS/@-BitKeeper/ Recursively lists all directories under Src/, but directories called SCCS/ and BitKeeper/ are filtered.
Src/**@=*.lua@=README Recursively lists all files under Src/ which match *.lua or README. All other files are ignored.
Src/**/@-SCCS/@-BitKeeper/@=*.lua@=README Recursively lists all files under Src/ which match *.lua or README. The versions of those files that may exist in SCCS/ or BitKeeper/ are ignored.

Matching Files

The class FileGlobBase is the base class for all glob operations. It is not possible to instantiate FileGlobBase. There is a single abstract function which must be overridden called FoundMatch(). Any time a match is found, FoundMatch() is called with the matched name.

Should we want to print the names to stdout as they are received, we would create a derived class like this:

class FileGlobPrintStdout : public FileGlobBase
{
    virtual void FoundMatch( const char* name )
    {
        printf( "%s\n", name );
    }
};

Next, we instantiate the object:

FileGlobPrintStdout fileGlob;

To begin the matching process, the function FileGlobBase::MatchPattern() is called with the requested pattern.

fileGlob.MatchPattern( "**" );

By the time MatchPattern() exits, all files existing in the current directory and below will have been passed to FileGlobPrintStdout::FoundMatch() and printed to stdout.

Ignoring Files and Directories

Several source control systems add extra directories within the working copy. CVS, for example, adds a directory called CVS/ to every directory in the working copy. BitKeeper adds a directory called BitKeeper/ to the root of the working copy and directories called SCCS/ to every directory contained in the working copy under source control.

The file globbing class provides an easy solution to this problem. FileGlobBase::AddIgnorePattern() may be called with a pattern (wildcarded or not), and any file or directory matching the pattern is simply ignored. This functionality directly corresponds to a file pattern's @- flag, described above.

Directory ignore patterns are specified as Dir/. The closing slash must be present. Dir/ and Dir are two different patterns, the first referring to directories and the second to files.

To remove all the CVS directories from the recursive list, we call:

fileGlob.AddIgnorePattern( "CVS/" );
fileGlob.MatchPattern( "**" );

This approach works equally well with files. If the desired file list should contain MP3 files and no WAV files, we insert the wildcard *.wav.

fileGlob.AddIgnorePattern( "*.wav" );
fileGlob.MatchPattern( "**" );

Obviously, matching the pattern *.wav while ignoring the pattern *.wav will result in no files being listed.

Forcing Only Certain Files

FileGlobBase implements a function called AddExclusivePattern(). Providing exclusive patterns ensures your application only receives files through FoundMatch() which match any exclusive patterns registered. This functionality directly corresponds to the file pattern flag @=, described above.

fileGlob.AddExclusivePattern( "*.lua" );
fileGlob.AddExclusivePattern( "*.c" );
fileGlob.MatchPattern( "**" );

When recursively matching the **, only files matching *.lua and *.c are considered.

Class Details

The classes included in the archive are documented as per Doxygen conventions. A fair amount of documentation exists in the source and header files.

Class: FileGlobBase

FileGlobBase is the base class of all file glob access. It is not possible to instantiate a FileGlobBase class. A derived class, implementing FoundMatch(), must be used.

Class: FileGlobList

FileGlobList is derived from FileGlobBase and mixes in a std::list< std::string > container. FileGlobList provides an implementation for FoundMatch() and stores the matched file list in the std::list<> container. All basic STL operations for std::list<> may used directly on FileGlobList.

For each file sent to FoundMatch(), the container is iterated using a case insensitive search. If the file is already in the container, it is ignored. Otherwise, it is appended at the end. In this manner, the MatchPattern() function may be called multiple times to accumulate a large list of unique files. It should be noted that subsequent calls to MatchPattern() may insert files out of sorted order into the container.

Example: Glob Application

A sample application showing off the globbing capabilities is included. The sample Visual Studio .NET solution, Glob.sln, builds the executable Glob.exe. Glob.exe is a simplistic command-line interface to the file glob code.

Glob.exe may be run without arguments (or with -?) to see its usage.

For applications supporting it, Glob.exe's output may be piped into another application. If using BitKeeper and the desire is to edit all *.cpp files and *.h files, the user would run:

glob -i SCCS/ -i BitKeeper/ **.cpp **.h | bk edit -

The -i command line option is used to specify ignore patterns. The contents of the SCCS/ or BitKeeper/ directories should not be considered.

Finally, exclusive patterns may be specified via command-line flag -e. These are similar to the @= flag entries described above.

glob -e *.cpp -e *.h **

Wish List

  • It would be fantastic to find a free implementation of the POSIX fnmatch() function and replace the WildMatch() routine that is used now. fnmatch() provides greater "regular expression" style matching capabilities.

Known Bugs

There are a lot of combinations for matching. I've tried a great deal of combinations, and those combinations seem to work. If you run into an issue, let me know. It would not be surprising for cases to pop up that don't work as desired. Don't hesitate to contact me at jjensen@workspacewhiz.com with issues.

Credits

  • Perforce (http://www.perforce.com/), for giving me the idea in the first place. For those who don't know, Perforce provides a free two user license for their source control software. I currently use Perforce at home for all my source control needs. It's not perfect, but it gets the job done better than most.
  • Matthias Wandel, for the MyGlob() code, which is the basis of the algorithm driving the file globber. The original C implementation exists in myglob.c from his Exif Jpeg camera setting parser and thumbnail remover application at http://www.sentex.net/~mwandel/jhead/.
  • Jack Handy, for his article at http://www.codeproject.com/string/wildcmp.asp. wildcmp() was expanded into WildMatch(), allowing case sensitive and insensitive comparisons.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here