This article discusses the implementation of wildcard matching, classic globbing, and modern gitignore-style globbing algorithms using star-loop iterations instead of recursion. The algorithms are efficient, typically running in linear time, and avoid exponential blow-up in CPU usage for worst-case scenarios.
Introduction
Jack Handy’s wildcard string compare efficiently compares a string to a pattern containing *
and ?
wildcards. The algorithm is fast and safe, but does not support gitignore-style globbing rules. In this article, I will illustrate that classic globbing and modern gitignore-style globbing algorithms can be fast too. I will also explain what’s wrong with recursive implementations that blow up exponentially and why some freely available source code should not be used.
Background
Wildcard string matching and globbing isn’t as trivial as it may seem at first. In fact, mistakes in the past resulted in serious vulnerabilities such as denial of service. Simple patches, such as limiting the number of matches to limit the CPU time, have been applied to fix implementations that suffer exponential blow-up in execution time. More disconcerting is that buggy globbing implementations can be easily located on the Web: just search for wildmat.c to find copies of implementations that may crash when a character class range is incomplete, such as [a-
.
The Trouble with Recursion
To understand how globbing implementations may cause denial of service, we will take a quick look at some examples. Recursion is often used in simple implementations of wildcard matching with *
and ?
. The idea is to scan the pattern and the string from left to right while pairwise matching the characters. When a star is encountered in the pattern, the do-match function is recursively called to compare the rest of the pattern to the rest of the string. If the recursive do-match call succeeds, then we’re done. If the recursive call fails however, we go back where we left off before the recursive call, advance one character further in the string
, and repeat the recursive call. This loop is the “star-loop” that consumes none, one, or more characters in place of a star in the pattern as shown in the example below:
int naive_recursive_match(const char *text, const char *wild)
{
while (*text != '\0')
{
if (*wild == '*')
{
while (*++wild == '*')
continue;
if (*wild == '\0')
return TRUE;
while (naive_recursive_match(text, wild) == FALSE && *text != '\0')
text++;
return *text != '\0' ? TRUE : FALSE;
}
if (*wild != '?' && *wild != *text)
return FALSE;
text++;
wild++;
}
while (*wild == '*')
wild++;
return *wild == '\0' ? TRUE : FALSE;
}
This algorithm is fairly simple and needs little explanation: recursive calls are made for stars until matching or the end of the string
of text is reached. A ?
matches any character. Otherwise, the current pattern character is matched with the current character in the text. We move up in the text and in the pattern by one character, and repeat until the end of the text is reached. Note that any trailing stars can be ignored in the pattern when we get to the end of the text.
This approach works well for short strings to match and patterns with few stars. However, it is easy to find examples that result in an explosive execution time blow-up due to excessive backtracking on stars. Cases such as a*a*a*a*a*a*a*a*b
and a string aaa…aaa
of 100 a
’s take minutes to terminate with a failure to match. Just try the following two shell commands:
$ touch `python3 -c 'print(100*"a")'`
$ ls a*a*a*a*a*a*a*a*b
It takes bash-3.2 about 8 minutes to terminate on my 2.9GHz i7 machine with decent performance. It takes tcsh-6.18 over an hour. The explosive 2n states for n stars is to blame for this behavior.
Using ABORT
One of the first globbing algorithms in history written by Rich Salz wildmat.c (original) had this undesirable behavior, but was updated by Lars Mathiesen in the early 2000s with an ABORT
state. Their comments in the updated wildmat.c source code still remain there to this day and are quite insightful:
Quote:
Once the control of one instance of DoMatch enters the star-loop, that instance will return either TRUE or ABORT, and any calling instance will therefore return immediately after (without calling recursively again). In effect, only one star-loop is ever active. It would be possible to modify the code to maintain this context explicitly, eliminating all recursive calls at the cost of some complication and loss of clarity (and the ABORT stuff seems to be unclear enough by itself).
Apparently, the exponential blow-up problem was already well known at that time, as we find the following comment in wildmat.c demonstrating the issue:
pattern: -*-*-*-*-*-*-12-*-*-*-m-*-*-*
text 1: -adobe-courier-bold-o-normal--12-120-75-75-m-70-iso8859-1
text 2: -adobe-courier-bold-o-normal--12-120-75-75-X-70-iso8859-1
Text 1 matches with 51 calls, while text 2 fails with 54 calls.
Without the ABORT, then it takes 22310 calls to fail. Ugh.
The powerful insight here is that only the last star-loop should be active, because advancing any of the previous star-loops is not productive when the last star-loop fails. This clever change made globbing execute in linear time for the typical case. Only in the worst case, the algorithm takes quadratic time in the length of the pattern and string.
Let’s apply this improvement to our example recursive matcher:
enum { FALSE = 0, TRUE = 1, ABORT = 2 };
int recursive_match(const char *text, const char *wild)
{
while (*text != '\0')
{
if (*wild == '*')
{
int ret;
while (*++wild == '*')
continue;
if (*wild == '\0')
return TRUE;
while ((ret = recursive_match(text, wild)) == FALSE && *text != '\0')
text++;
if (ret != TRUE)
return ABORT;
return *text != '\0' ? TRUE : FALSE;
}
if (*wild != '?' && *wild != *text)
return FALSE;
text++;
wild++;
}
while (*wild == '*')
wild++;
return *wild == '\0' ? TRUE : FALSE;
}
This improvement avoids exponential blow-up, but still makes a recursive call for each star in the pattern. This is not necessary. We can save that state of the matcher so that we can restore it from our backup to execute another iteration of the last star-loop until we are done.
Non-Recursive Wildcard String Matching
To replace recursion with iteration, we need two backup positions: the current position in the pattern right after the star and the current position in the text string. To iterate the star-loop, we restore the backed-up positions, advance one character in the text string, and try again. We keep iterating until the match succeeds or we run out of text. If a star occurs after our last backup position, we effectively enter a new star-loop and remove the one we were iterating on previously:
int match(const char *text, const char *wild)
{
const char *text_backup = NULL;
const char *wild_backup = NULL;
while (*text != '\0')
{
if (*wild == '*')
{
text_backup = text;
wild_backup = ++wild;
}
else if (*wild == '?' || *wild == *text)
{
text++;
wild++;
}
else
{
if (wild_backup == NULL)
return FALSE;
text = ++text_backup;
wild = wild_backup;
}
}
while (*wild == '*')
wild++;
return *wild == '\0' ? TRUE : FALSE;
}
The execution time of this algorithm is linear in the length of the pattern and string for typical cases. In the worst case, this algorithm takes quadratic time (to see why, consider pattern *aaa…aaab
with ½n a
’s and string aaa…aaa
with n a
’s, which requires ¼n2 comparisons before failing.)
In case you're asking if there is a linear worst-case algorithm, the answer is yes: by constructing a deterministic finite automaton (DFA) from the pattern for matching. This is beyond the scope of this article and requires taking into account the (high) cost of DFA construction.
Non-Recursive Glob Matching
Now that we reviewed the background and issues with string matching and globbing, let's move on to the main part of this article. Globbing differs from string matching with respect to a subtle meaning of the *
and ?
wildcards in globs: these wildcards do not match pathname separators, i.e., do not match /
(and do not match \
in Windows). For example, foo*.h
does not match foo/bar.h
.
To adjust our previous match algorithm to perform glob matching, we add some checks for the /
pathname separator such that *
and ?
cannot match it:
int globly_match(const char *text, const char *glob)
{
const char *text_backup = NULL;
const char *glob_backup = NULL;
while (*text != '\0')
{
if (*glob == '*')
{
text_backup = text;
glob_backup = ++glob;
}
else if ((*glob == '?' && *text != '/') || *glob == *text)
{
text++;
glob++;
}
else
{
if (glob_backup == NULL || *text_backup == '/')
return FALSE;
text = ++text_backup;
glob = glob_backup;
}
}
while (*glob == '*')
glob++;
return *glob == '\0' ? TRUE : FALSE;
}
For practical glob matching applications however, this glob-like ("globly") algorithm is not yet complete. It lacks support for character classes, such as [a-zA-Z]
to match letters and [^0-9]
to match any character except a digit. It also lacks a means to escape the special meaning of the *
, ?
, and [
meta characters with a backslash. Adding these new features requires a redesign of the glob meta character tests in the main loop, using a switch
to select glob meta characters and by using continue to commence the main loop. Breaking from the switch
moves control to the star-loop:
#define PATHSEP '/' // pathname separator, we should define '\\' for Windows instead
int glob_match(const char *text, const char *glob)
{
const char *text_backup = NULL;
const char *glob_backup = NULL;
while (*text != '\0')
{
switch (*glob)
{
case '*':
text_backup = text;
glob_backup = ++glob;
continue;
case '?':
if (*text == PATHSEP)
break;
text++;
glob++;
continue;
case '[':
{
int lastchr;
int matched = FALSE;
int reverse = glob[1] == '^' || glob[1] == '!' ? TRUE : FALSE;
if (*text == PATHSEP)
break;
if (reverse)
glob++;
for (lastchr = 256; *++glob != '\0' && *glob != ']'; lastchr = *glob)
if (lastchr < 256 && *glob == '-' && glob[1] != ']' && glob[1] != '\0' ?
*text <= *++glob && *text >= lastchr :
*text == *glob)
matched = TRUE;
if (matched == reverse)
break;
text++;
if (*glob != '\0')
glob++;
continue;
}
case '\\':
glob++;
default:
if (*glob != *text && !(*glob == '/' && *text == PATHSEP))
break;
text++;
glob++;
continue;
}
if (glob_backup == NULL || *text_backup == PATHSEP)
return FALSE;
text = ++text_backup;
glob = glob_backup;
}
while (*glob == '*')
glob++;
return *glob == '\0' ? TRUE : FALSE;
}
The PATHSEP
character is either the conventional /
or the Windows \
used in the string text to separate pathnames. Note that traditional Unix uses !
for character class inversion, for example [!0-9]
. Here, we offer a choice of !
and the more conventional ^
to invert a character class, for example [^0-9]
.
The execution time of this algorithm is linear in the length of the pattern and string for typical cases and takes quadratic time in the worst case.
Non-recursive gitignore-style glob Matching
As we arrive at the second main part of this article, let's take a look at modern gitignore-style globbing rules and design an efficient non-recursive algorithm.
Gitignore-style globbing applies the following rules to determine file and directory pathname matches:
* | matches anything except a / |
? | matches any one character except a / |
[a-z] | matches one character in the selected range of characters |
[^a-z] | matches one character not in the selected range of characters |
[!a-z] | matches one character not in the selected range of characters |
/ | when used at the begin of a glob, matches if pathname has no / |
**/ | matches zero or more directories |
/** | when at the end of a glob, matches everything after the / |
\? | matches a ? (or any character specified after the backslash) |
We also make the following assumption: when a glob contains a path separator /
, the full pathname is matched. Otherwise, the basename of a file or directory is matched. For example, *.h
matches foo.h
and bar/foo.h
, bar/*.h
matches bar/foo.h
but not foo.h
and not bar/bar/foo.h
. A leading /
in the glob can be used to force /*.h
to match foo.h
but not bar/foo.h
. Furthermore, a leading ./
or /
is ignored in pathnames, so ./foo/bar
is matched as foo/bar
as we would expect.
Examples:
* | matches a , b , x/a , x/y/b | |
a | matches a , x/a , x/y/a | but not b , x/b , a/a/b |
/* | matches a , b | but not x/a , x/b , x/y/a |
/a | matches a | but not x/a , x/y/a |
a?b | matches axb , ayb | but not a , b , ab , a/b |
a[xy]b | matches axb , ayb | but not a , b , azb |
a[a-z]b | matches aab , abb , acb , azb | but not a , b , a3b , aAb , aZb |
a[^xy]b | matches aab , abb , acb , azb | but not a , b , axb , ayb |
a[^a-z]b | matches a3b , aAb , aZb | but not a , b , aab , abb , acb , azb |
a/*/b | matches a/x/b , a/y/b | but not a/b , a/x/y/b |
**/a | matches a , x/a , x/y/a | but not b , x/b |
a/**/b | matches a/b , a/x/b , a/x/y/b | but not x/a/b , a/b/x |
a/** | matches a/x , a/y , a/x/y | but not a , b/x |
a\?b | matches a?b | but not a , b , ab , axb , a/b |
To implement gitignore-style globbing, we need two star-loops: one for single star, the “*
-loop”, and another for double star, the “**
-loop”. The **
-loop overrules the *
-loop because there is no point in backtracking on a single star when we encounter a double star after it in the glob pattern. The converse is not true: we should backtrack on a double star when a single star that comes after it in the glob does not match:
int gitignore_glob_match(const char *text, const char *glob)
{
const char *text1_backup = NULL;
const char *glob1_backup = NULL;
const char *text2_backup = NULL;
const char *glob2_backup = NULL;
if (*glob == '/')
{
while (*text == '.' && text[1] == PATHSEP)
text += 2;
if (*text == PATHSEP)
text++;
glob++;
}
else if (strchr(glob, '/') == NULL)
{
const char *sep = strrchr(text, PATHSEP);
if (sep)
text = sep + 1;
}
while (*text != '\0')
{
switch (*glob)
{
case '*':
if (*++glob == '*')
{
if (*++glob == '\0')
return TRUE;
if (*glob != '/')
return FALSE;
text1_backup = NULL;
glob1_backup = NULL;
text2_backup = text;
glob2_backup = ++glob;
continue;
}
text1_backup = text;
glob1_backup = glob;
continue;
case '?':
if (*text == PATHSEP)
break;
text++;
glob++;
continue;
case '[':
{
int lastchr;
int matched = FALSE;
int reverse = glob[1] == '^' || glob[1] == '!' ? TRUE : FALSE;
if (*text == PATHSEP)
break;
if (reverse)
glob++;
for (lastchr = 256; *++glob != '\0' && *glob != ']'; lastchr = *glob)
if (lastchr < 256 && *glob == '-' && glob[1] != ']' && glob[1] != '\0' ?
*text <= *++glob && *text >= lastchr :
*text == *glob)
matched = TRUE;
if (matched == reverse)
break;
text++;
if (*glob != '\0')
glob++;
continue;
}
case '\\':
glob++;
default:
if (*glob != *text && !(*glob == '/' && *text == PATHSEP))
break;
text++;
glob++;
continue;
}
if (glob1_backup != NULL && *text1_backup != PATHSEP)
{
text = ++text1_backup;
glob = glob1_backup;
continue;
}
if (glob2_backup != NULL)
{
text = ++text2_backup;
glob = glob2_backup;
continue;
}
return FALSE;
}
while (*glob == '*')
glob++;
return *glob == '\0' ? TRUE : FALSE;
}
The execution time of this algorithm is linear in the length of the pattern and string for typical cases and cubic in the worst case, when both a **
-loop and a *
-loop are active.
Refinements
Files with names starting with a dot (.
) are treated differently by shell globs, meaning that a dot beginning a name must be matched explicitly and cannot be matched by a wildcard. To replicate this behavior, we add the following:
int glob_match(const char *text, const char *glob)
{
int nodot = 1; ...
while (*text != '\0')
{
switch (*glob)
{
case '*':
if (nodot && *text == '.')
break;
...
case '?':
if (nodot && *text == '.')
break;
...
case '[':
{
...
if (nodot && *text == '.')
break;
...
}
...
default:
if (*glob != *text && !(*glob == '/' && *text == PATHSEP))
break;
nodot = *glob == '/';
...
Case-insensitive matching in ASCII can be achieved with tolower()
and toupper()
as follows:
if (tolower(*glob) != tolower(*text) && !(*glob == '/' && *text == PATHSEP))
break;
int chr;
for (lastchr = 256; *++glob != '\0' && *glob != ']'; lastchr = *glob)
if (lastchr < 256 && *glob == '-' && glob[1] != ']' && glob[1] != '\0' ?
(tolower(*text) <= (chr = *++glob) && tolower(*text) >= lastchr) ||
(toupper(*text) <= chr && toupper(*text) >= lastchr)
: tolower(*text) == tolower(*glob))
matched = TRUE;
Unicode matching can be supported in two ways: by using wide strings, i.e., wchar_t
or std::wstring
or by using UTF-8 encoded strings. The wide string option requires no changes to the algorithms. The UTF-8 version requires a few changes for ?
and character classes, with everything else staying the same:
case '?':
if (*text == PATHSEP)
break;
utf8(&text);
glob++;
continue;
case '[':
{
int chr = utf8(&text);
int lastchr = 0x10ffff;
int matched = FALSE;
int reverse = glob[1] == '^' || glob[1] == '!' ? TRUE : FALSE;
if (chr == PATHSEP)
break;
if (reverse)
glob++;
glob++;
while (*glob != '\0' && *glob != ']')
if (lastchr < 0x10ffff && *glob == '-' && glob[1] != ']' && glob[1] != '\0' ?
glob++, chr <= utf8(&glob) && chr >= lastchr :
chr == (lastchr = utf8(&glob)))
matched = TRUE;
if (matched == reverse)
break;
if (*glob)
glob++;
continue;
}
where utf8()
returns the wide character and advances by one UTF-8 character in the glob:
int utf8(const char **s)
{
int c1, c2, c3, c = (unsigned char)**s;
if (c != '\0')
(*s)++;
if (c < 0x80)
return c;
c1 = (unsigned char)**s;
if (c < 0xC0 || (c == 0xC0 && c1 != 0x80) || c == 0xC1 || (c1 & 0xC0) != 0x80)
return 0xFFFD;
if (c1 != '\0')
(*s)++;
c1 &= 0x3F;
if (c < 0xE0)
return (((c & 0x1F) << 6) | c1);
c2 = (unsigned char)**s;
if ((c == 0xE0 && c1 < 0x20) || (c2 & 0xC0) != 0x80)
return 0xFFFD;
if (c2 != '\0')
(*s)++;
c2 &= 0x3F;
if (c < 0xF0)
return (((c & 0x0F) << 12) | (c1 << 6) | c2);
c3 = (unsigned char)**s;
if (c3 != '\0')
(*s)++;
if ((c == 0xF0 && c1 < 0x10) || (c == 0xF4 && c1 >= 0x10) ||
c >= 0xF5 || (c3 & 0xC0) != 0x80)
return 0xFFFD;
return (((c & 0x07) << 18) | (c1 << 12) | (c2 << 6) | (c3 & 0x3F));
}
Conclusions
Wildcard matching, classic globbing, and modern gitignore-style globbing can be fast when implemented with star-loop iterations instead of recursion. These algorithms typically run in linear time in the length of the string and pattern and in the worst case, may run in quadratic or cubic time in the length of the pattern and string, without blowing up CPU usage exponentially.
Extended globbing features such as shell brace expansion can be added to our matcher. However, brace expansion requires recursion with backtracking, which may make the exponential blow-up problem worse if we're not careful. Consider for example a*{b,a*{b,a*{b,a*{b,a*{b,a*{b,a*{b,a*b}}}}}}}
. There is no easy way to limit the execution time for brace expansion, except perhaps by limiting the number of braces or by converting an extended glob to a regex for matching although regex matching can be costly too.
I wrote this article after implementing gitignore-style globbing for ugrep (fast universal grep), due to the lack of available and usable open source code. The versions of wildmat.c that I found in official repositories all had a nasty bug that is actually documented in the source code: "Might not be robust in face of malformed patterns; e.g., "foo[a-" could cause a segmentation violation." Ugh.
History
- 3rd August, 2019: First draft
- 5th August, 2019: First publication
- 8th August, 2019: Updated
- 12th September, 2019: Added Java download sources
- 19th September, 2019: Added shell execution example;
[]
does not match /
; dotglob refinement; new and updated download sources - 17th October, 2019: Minor update to improve gitignore globbing by removing a leading
/
from the pathname, when present - 23rd January, 2020: Improved character class globbing to match shell globbing when
-
(dash) is the last character e.g. [a-]
, fixed UTF-8 character class range matching (i.e. []
with dash) in section Refinements. - 22nd July, 2023: Minor update to case-insensitive globbing code refinement