Regular expressions (regex) are a fundamental tool in software development, used for pattern matching and data manipulation across various programming languages. However, their complexity, especially with complex patterns, can lead to inefficiency and performance issues. This paper introduces a regex optimization tool that systematically decomposes regular expressions into their core components—such as quantifiers, groups, alternation, and character classes—while applying advanced techniques for simplification and optimization. By leveraging an Abstract Syntax Tree (AST)-based approach, the tool enhances the readability and efficiency of regex patterns, helping developers avoid common pitfalls like excessive backtracking and redundant operations. In addition, this paper outlines practical strategies for optimizing regex, providing developers with the tools and knowledge to write more efficient, maintainable, and effective regular expressions.
Introduction
Regular expressions (regex) are powerful tools for pattern matching and text manipulation across various programming languages. They offer a concise way to search, extract, and modify text by specifying patterns for matching sequences of characters. However, while expressive, regex patterns can quickly become inefficient or overly complex, leading to performance issues, especially in large-scale data processing.
Regex optimization involves systematically simplifying and improving regex patterns to enhance efficiency, readability, and maintainability without altering their intended behavior. Refining complex expressions reduces the overhead caused by excessive backtracking and redundant operations, ensuring faster execution times and lower resource consumption.
This paper presents an approach to regex optimization that relies on a formal structural understanding of regular expressions. By breaking down regex into its fundamental components—such as literals, character classes, quantifiers, groups, and alternations—and applying optimization techniques, we can significantly streamline and simplify regex patterns. Using tools like Abstract Syntax Trees (ASTs) to represent and manipulate regex structures programmatically allows for a deeper analysis, paving the way for automated optimization techniques.
The following sections outline the core concepts of regex optimization, introduce a structured BNF (Backus-Naur Form) to represent regular expression syntax, and describe methods for transforming inefficient patterns into optimized versions. This paper aims to provide a robust framework for developers and automated tools to optimize and manage complex regular expressions effectively.
Background
This is a followed up article from the tool for decomposing a regular expressions.
Table of Contents
- Abstract
- Introduction
- Core Concept of Regex Optimization
- Optimizing regular expression
- Optimizing Quantifiers
- Leverage Character Classes
- Optimize OR Conditions (|)
- Utilize Non-Capturing Groups
- Implement Anchors Strategically
- Avoid Greedy Quantifiers
- Minimize Backtracking
- Use Lookaheads and Lookbehinds with Care
- Lookahead
- Lookbehind
- Practical Example of lookbehind
- The Regex optimizing tool
- Why Use an AST?
- Example of optimization
- Who is the optimizer tool for?
- Further Improvement?
- Conclusion
- Other Useful online tools
- 1. Regex101
- 2. RegExr
- 3. RegexBuddy
- 4. Regex Pal
- 5. Regex Tester
- 6. Regular expression tester
- Reference
- Appendix and Download of JavaScript source code
- Function ASTtoString()
- Function ASTtoRegex()
- Function parseRegex2AST()
- Function optimizeAST()
Changes
9 October 2024. Formatted the code section and added a download button after contents.
Core Concept of Regex Optimization
Regex optimization involves dissecting a regular expression into its basic elements and structures, such as literal characters, character classes, quantifiers, and groups. This process is crucial for developers who must debug or modify complex regex patterns. It clarifies what each part of the regex does and how it affects the matching behavior. Since regular expression syntax is complex and not always logical, it helped me by first establishing a formal BNF description of the syntax.
BNF syntax for regular expression
<regexp> ::= "/" <pattern> "/" [<flags>]
<pattern> ::= <concatenation> ("|" <concatenation>)*
<concatenation> ::= <element>*
<element> ::= <group>
| <character-class>
| <quantified>
| <anchor>
| <escaped-character>
| <character>
<group> ::= "(" <pattern> ")"
| "(?<identifier>" <pattern> ")"
| "(?:" <pattern> ")"
| "(?=" <pattern> ")"
| "(?!" <pattern> ")"
| "(?<=" <pattern> ")"
| "(?<!" <pattern> ")"
<character-class> ::= "[" [ "^" ] <character-class-body> "]"
<character-class-body> ::= <character-range> | <character>*
<character-range> ::= <character> "-" <character>
<quantified> ::= <element> <quantifier> [ <lazy-modifier> ]
<quantifier> ::= "*" | "+" | "?" | "{" <digits> [ "," <digits> ] "}"
<lazy-modifier> ::= "?"
<anchor> ::= "^" | "$" | "\b" | "\B"
<escaped-character> ::= "\\" <character>
<character> ::= any single character except special characters
| <escaped-character>
<identifier> ::= <letter> (<letter> | <digit> | "_")*
<flags> ::= <flag>*
<flag> ::= "g" | "i" | "m" | "s" | "u" | "y"
It is now easier to see how a regular expression can be broken down into simpler components. The JavaScript function I developed for the breakdown is listed in the appendix.
The main group for decomposition consists of:
- Escaped characters are those preceded by a backslash \, which changes the usual meaning of the character following it. For example, \n represents a new line, and \d matches any digit. They are used to enable the inclusion of special characters in the regex pattern as literal characters or to signify special regex functions (like \b for word boundaries).
- Character sets, enclosed within square brackets [], match any one character from a set of characters. For instance, [a-z] matches any lowercase letter, and [^a-z] matches any character that is not a lowercase letter. They allow the regex to be more flexible and concise by enabling the matching of characters from a defined set, enhancing pattern-matching capabilities within strings.
- Quantifiers determine how many instances of a preceding element (like a character, group, or character class) are needed for a match. Standard quantifiers include * (0 or more), + (1 or more), and ? (0 or 1). Quantifiers are critical for specifying the number of occurrences in the string that match the preceding element, allowing regex patterns to match varying text lengths. In its core form, the quantifiers are greedy, meaning they try to consume as much input as possible for matching. Its counterpart, lazy, means it will consume the least input to fulfill the match. Laziness is indicated by adding an extra ? to the beforementioned quantifiers.
- Braces are used as quantifiers to specify the number of times a pattern element must appear. They can define a fixed number {n}, a range {n,m}, or a minimum number of times {n,} a pattern must occur. This type of quantifier helps match a precise number of repetitions in a pattern, allowing for precise control over the occurrences of a component. This is particularly useful in matching specific formats of data like dates, serial numbers, or parts of codes. The simple quantifiers can also be written using braces as: {0,} for *, {1,} for +, and {0,1} for ?
- Alternation, denoted by the pipe symbol |, acts like a boolean OR. It matches the pattern before or after the |. For instance, cat|dog matches cat or dog. Alternation allows for matching one of several possible parts, making it worthwhile to include multiple options within a single regex pattern, enhancing flexibility and scope.
- Groups in regular expressions treat multiple characters as a single unit. They are enclosed in parentheses (). There are two forms of groups: capturing and non-capturing groups.
- Capturing groups save the part of the string matched by the part of the regex inside the parentheses. This allows the user to extract information from strings or to re-use parts of the pattern in the same regex (backreferences). There are two capturing groups: unnamed groups (…) and named groups (?<name>…). The named group was an addition to the unnamed groups that could only be backreference via their relative position in the regex. The named group allows us to give a group a more self-explanatory name referenced by that name. Any regex inside plain parentheses () is used for capturing. For instance, (abc) captures the sequence abc, and (?<year>\d{4}) captures four digits as a year.
- There are five non-capturing groups. The first one is (?:…), which matches the group, but they do not save the text by the group. They are used when you need the grouping functionality without the overhead of capturing. Prefix the group with ?:, like (?:abc) to group it without capturing. It treats "abc" as a single unit without remembering the match.
The second and third ones are the two lookahead groups (positive and negative lookahead). Positive Lookahead (?= ... ) matches a group after the main expression without including it in the result. Negative Lookahead (?! ... ) asserts that the group specified does not follow the main expression.
The fourth and fifth are the two look-behind groups (positive and negative look-behind). Positive Look-behind (?<= ... ) asserts that the group precedes the main expression and must be matched, but it does not consume any characters. Negative Lookbehind (?<! ... ) asserts that the specified group does not precede the main expression.
Each component is crucial in constructing complex and efficient regular expressions for pattern matching and text processing tasks.
Optimizing regular expression
Regular expressions are great tools for pattern matching and text processing, but they can quickly become inefficient if not carefully crafted. Optimizing your regular expressions boosts performance and enhances their readability and maintainability. Here's a breakdown of some essential best practices and tips for optimizing your regular expressions. First, let's look at the syntax for Regex. From a top perspective, you can break it down into the following list.
Symbol
| Naming
|
^
| Anchor Start
|
$
| Anchor end
|
[…]
| Char class
|
\d,\D,\w,|W,\s,\S,\b,\B
| Predefined class
|
Abc…
| Literals
|
+,*,?,{…},+?,*?,??,{…}?
| Quantifiers
|
(…),(?<name>…)
| Capturing groups
|
(?:…),(?=…),(?!...),(?<=…),(?<!...)
| Non-capturing groups
|
Cat|dog|bird
| Alternation
|
Let's start by looking at the Quantifiers.
Optimizing Quantifiers
Quantifiers appear after a <pattern> where any quantifiers can be followed by an optional ? to indicate a lazy quantifier. E.g.
<pattern>+ meaning one or more occurrences of the pattern
<pattern>* meaning zero or more occurrences of the pattern
<pattern>? meaning zero of one occurrence of the pattern.
A quantifier can also be a bracket quantified. e.g. {n} or {n,} or {n,m}. A bracket quantifier can also be followed by an optional ? indicating a lazy bracket.
<pattern>{n} n occurrence of the <pattern>
<pattern> {n,} n or more occurrences of the <pattern>
<pattern> {n,m} from n to m occurrence of the <pattern>
Efficiency can often be achieved by simplifying how repetitions are handled:
Use + instead of {1,} to match one or more occurrences.
Use * instead of {0,} for zero or more occurrences.
Replace {0,1} with a ? to match zero or one occurrence.
Simplify {n,n} to {n}
As I mentioned, a quantifier can be followed by the lazy quantifier symbol ? e.g.
<pattern>+?, <pattern>*?, <pattern>?? Or pattern>{…}?
We can also optimize the regex under certain conditions with the lazy quantifier.
If we encounter <pattern>+? where the lazy quantifier is applied, it is redundant to keep the +? because the pattern itself will be sufficient to capture the required behavior. We can, therefore, optimize it to:
<pattern>+? → <pattern>
For example: \d+? → \d
You could extend this optimization to other lazy quantifiers if needed, such as:
<pattern>*? → potentially equivalent to an empty pattern in some cases.
<pattern>?? → reduces to just the pattern itself if applicable.
However, these optimizations would need to be carefully applied depending on the context of the regex. Here is a classic example where it does not work. Consider the regex:
/<.*?>/
This regex is commonly used to match HTML tags like <div>, <span>, etc. The pattern
.*? matches zero or more characters lazily, which matches as few characters as possible until the closing > is found. E.g. <div><span>Hello</span></div>. Matches <div>, <span>, </span>, </div>
Here, the *? doesn't simply match an empty string; instead, it matches the shortest possible sequence of characters (between < and >) inside an HTML tag. If we did not put the lazy quantifier ? then it will match the entire string: <div><span></span></div> which is not what you want to do.
There are some specific scenarios where a lazy quantifier follows the pattern *? (i.e., <pattern>*?) can be simplified by removing it. These cases occur when the pattern is redundant because of the surrounding context or because it doesn't contribute to the overall match. However, detecting these cases generally requires analyzing the structure of the regex and its context.
- When the Pattern is at the End and Followed by an Anchor or Another Character That Enforces a Match. If the *? pattern is placed just before an anchor (^, $, \b) or a mandatory literal, and it’s non-contributory. The pattern can be removed. The lazy quantifier allows zero pattern occurrences to be matched, and the anchor or literal will determine the match.
Example 1: a*?$ Since the *? allows zero matches and the $ enforces the end of the string. We can simplify this to just: $ In this case, the pattern a*? can be safely removed because the match will always end at the end of the string ($), whether or not there are any ‘a’ characters.
Example 2: \w*?\b Again, we can simplify; since \w*? can match zero characters, the word boundary will always be satisfied. This simplifies to just: \b The \w*? here is redundant because the word boundary will match even if there are no preceding word characters. - When the Pattern is part of a larger optional sequence. It can be removed if the pattern *? is part of an already optional larger sequence (e.g., due to an alternation or another quantifier).
Example: (a|b*?). Simplification can be done to (a|) and simplified to a?. Here, b*? can be safely removed because it doesn't contribute anything different from the empty alternative in the alternation. - When the Pattern is Repeated Within an Already Optional Group If *? is applied within a group already repeated with a quantifier (like ? or {0,}), the *? can be redundant and may be removed.
Example 1: (a*?)? The outer ? already marks the group optional, so *? Is redundant, and the regex simplifies to (a)? or just a?
Example 2: (a*)*? Where the outer *? makes the whole group redundant because a* already allows zero matches. This simplifies to a* - In Alternation with an Empty Pattern. If *? appears in alternation (|) with an empty pattern or another pattern allowing zero matches, the *? can be removed.
Example: (a*?|b), which can be simplified to just (a|b).
When simplifying patterns like <pattern>*? (i.e., zero or more occurrences of a pattern lazily), you can apply the optimization by removing the <pattern>*? if it's followed by certain anchors that make the lazy quantifier redundant. Anchors typically don't require any match themselves and dictate positions in the input (like the start or end of a string), so *? can be safely removed if the anchor sufficiently restricts the match. The following anchors can be used to simplify. ^, $, \b, \B, lookahead assertion: (?=…), Negative lookahead assertion: (?!...), Lookbehind assertion: (?<=…) and Negative lookbehind assertion (?<!...)
Leverage Character Classes
Predefined character classes can significantly simplify your expressions:
- \d can be used instead of [0-9], matching any digit.
- \w is a substitute for [a-zA-Z0-9_], matching any alphanumeric character plus underscore.
- \s effectively matches any whitespace character, streamlining patterns and avoiding lengthy alternatives. \s substitutes [ \t\n\t\f\v]
- A charset using multiple predefined character classes like [\d\w] can be simplified to [\w] since \d is a subset of \w.
Optimize OR Conditions (|)
When using alternation (|), placing the most likely conditions first can save time, especially if subsequent conditions are subsets of previous ones. Also, simplifying alternations by extracting common prefixes or suffixes can reduce complexity. For instance, we perform the following optimization of the OR conditions.
Transforming patterns like:
abc|abd → ab[cd]
abc|ab → ab[c]?
[0-9]|[a-z]|[A-Z]|[_] → \w
a|b|c → [abc]
a|b| → [ab]?
Utilize Non-Capturing Groups
Non-capturing groups (?:...) are preferable when you do not need to save the captured group for later use. This approach reduces the overhead since the regex engine does not need to track the contents of these groups. Sometimes, a non-capturing grouping is unnecessary or doesn’t change the behavior. For example, when a single pattern (a literal, char class, or predefined escapes) follows the non-capturing group, then the non-capturing can be removed. Also, if a non-capture group precedes by ^ (ANHOR_START) or $(ANCHOR_END), then the non-capturing group is redundant.
(?:a) → a
(?:a|b|c)? → [abc]?
^(?:abc) → ^abc
(?:abc)$ → abc$
(?:abc)+ → abc+
(?:[a-z])+ → [a-z]+
Implement Anchors Strategically
Using anchors like ^ (ANCHOR_START) for the start of a string and $ (ANCHOR_END) for the end of a string can prevent unnecessary matching attempts across the entire string when the pattern is known to be fixed or limited to specific positions.
Avoid Greedy Quantifiers
Greedy quantifiers (*, +) attempt to match as much text as possible, which can lead to performance issues. Opt for their lazy counterparts (*?, +?), which match the smallest possible string, enhancing efficiency. However, it depends on what you are trying to match.
Minimize Backtracking
To prevent excessive backtracking, which can slow down your regex, utilize atomic groups (?>...) and possessive quantifiers like *+ or ++. These constructs tell the regex engine to avoid revisiting previously matched characters, improving performance in complex patterns.
Since JavaScript does not support atomic groups or possessive quantifiers, you often need to rely on other methods to prevent inefficient backtracking:
- Non-Capturing Groups with Greedy Quantifiers: You can sometimes rewrite parts of your regex using non-capturing groups (?:...) combined with greedy quantifiers, being careful about the overall pattern design to minimize backtracking.
- Lookahead Assertions: Use lookahead assertions to check for conditions without actually consuming characters, which can sometimes help manage backtracking.
- Specific Character Exclusions: Instead of using a . (dot), which matches any character (often leading to excessive backtracking), specify more explicitly which characters to match or not match. For example, use something like [^,]+ to match strings up to a comma without including it.
For example, if you’re trying to match a sequence that ends with a specific suffix, instead of writing something like .*suffix, which can lead to excessive backtracking, you might specify what shouldn't be matched before the suffix, like [^ ]*suffix (assuming the suffix is preceded by spaces).
Use Lookaheads and Lookbehinds with Care
Lookaheads and look behinds are potent tools for assertions in regex, but they should be used sparingly. If overused, they can impact performance. Employ these when necessary to assert conditions without consuming characters.
Lookaheads and lookbehinds are advanced regex features that allow you to specify additional conditions for a match without including those in the match itself. These "zero-width assertions" do not consume any characters on the string; they just assert whether a match is possible.
Lookahead
Positive Lookahead (?=) asserts that what immediately follows the current position in the string must match the pattern specified inside the parentheses but does not include that in the match.
Example: X(?=Y) matches X only if X is followed by Y. In the string XYP, X would match because it's followed by Y.
Optimizing positive lookahead by removing redundant constructions.
(?=abc)abc → abc
(?:abc|def)(?=ghi) → (?:abc|def)
Negative Lookahead (?!) asserts that what immediately follows the current position in the string must not match the pattern specified inside the parentheses.
Example: X(?!Y) matches X only if X is not followed by Y. In the string XZP, X would match because Y does not follow it.
Negative lookaheads (?!...) can sometimes be rewritten more efficiently, especially when applied to simple character classes or patterns. E.g.
(?!a)b → [^a]b
This pattern matches b, but only if not preceded by a. Depending on the context, this can often be replaced with a negated character class for simplicity.
Lookbehind
Positive Lookbehind (?<=) asserts that what immediately precedes the current position in the string must match the pattern specified in the parentheses but not include that in the match.
Example: (?<=Y)X matches X only if Y precedes X. In the string YXP, X would match because Y precedes it.
Negative Lookbehind (?<!) asserts that what immediately precedes the current position in the string must not match the pattern specified inside the parentheses.
Example: (?<!Y)X matches X only if X is not preceded by Y. In the string ZXP, X would match because it's not preceded by Y.
Practical Example of lookbehind
Consider a scenario where you need to find dollar amounts not preceded by the word Refund. The regex could use a negative lookbehind to ensure the amounts we match aren't part of a refund: /(?<!Refund )\$[0-9]+/ with the target string “Refund $100, Earned $200" will match $200 because it is not preceded by "Refund ".
Lookaheads and lookbehinds are incredibly powerful for conditions where you need to check the context of a substring without including that context in the final matched result. They are often used in data validation, parsing complex string formats, and conditional processing within lines of text.
It can be challenging to implement any optimization of regular expressions. A few can be done without knowing the purpose of the regular expression. However, most are based on how you want to extract information from the regular expression. For example, changing a capturing grouping to a non-capturing requires that you don’t plan to use the grouping later on. It will be impossible for an optimizer to know that; therefore, it can only be suggested that any grouping you don’t plan to refer back to can be changed to a non-capturing group.
The Regex optimizing tool
Now, with all the optimization potential in place, we can turn our attention to the actual coding for the tool. At first, I was planning to analyze the regular expression to find a pattern that can convert to a simpler version or is redundant. However, I quickly ran into limitations using that approach. Instead, I realized that to get through, I needed to build an abstract syntax tree (AST) for a regular expression and then use compiler techniques to rearrange, simplify, and remove redundant elements in the regular expression.
An Abstract Syntax Tree (AST) represents the syntactic structure of a source code or other formal language, such as a regular or mathematical expression. The AST abstracts away the actual syntax used (like parentheses, brackets, or commas), focusing instead on the structural and logical elements of the code. Each node in the tree represents a construct occurring in the code.
Key Characteristics of a General AST for Programming Language is nodes, edges, abstraction and hierarchy:
- Each node represents a syntactic construct, such as an operator, literal, function, or code block. Nodes are often categorized by their “type” (e.g., EXPRESSION, STATEMENT, FUNCTION_CALL, etc.).
- The edges represent relationships between nodes. For example, a function call might be a parent node with child nodes representing the function name and arguments.
- An AST simplifies the source code, abstracting away details like punctuation or specific symbols that are not necessary to understand the code’s logic.
- Hierarchy reflects the program's hierarchical structure, showing which constructs are “inside” or “related to” others.
Example:
Consider a simple arithmetic expression found in countless books about compilers:
2 + 3 * 4
The AST for this expression would look something like this:
+
/ \
2 *
/ \
3 4
- The root node is the + operator because addition is the main operation.
- The * operator is a child of the +, showing that multiplication happens before addition due to precedence rules.
- The numbers 2, 3, and 4 are leaf nodes (the operands of the operations).
Why Use an AST?
- Compilers and interpreters use ASTs to analyze and manipulate code, such as for syntax checking, optimization, or translation into machine code.
- During compilation, an AST analyzes the structure and dependencies between operations to identify areas where the code can be optimized.
- Once an AST is constructed, a compiler can easily traverse the tree to generate the equivalent machine code or bytecode.
- Many code analysis tools (like linters or formatters) work by manipulating ASTs because it’s easier to analyze a structured tree than raw source code.
The purists would properly argue that for my simple regex project, it is more like a parse tree than an AST?
An AST is often confused with a parse tree (or concrete syntax tree), but they are different:
- A parse tree contains all the details of the grammar and syntax of the source code, including all tokens, parentheses, and punctuation.
- An AST abstracts unnecessary grammar details and focuses on the semantic structure.
We're working with a parse tree if your structure retains all the details (like distinguishing + from +?. It would be more accurately called an AST if it simplifies or optimizes such cases. So, we have a parse tree that we optimized into an AST. For simplicity, I will call it an AST.
For our regular expression project, we have Nodes like (only a subset shown below):
ANCHOR_START ^
ANCHOR_END $
ALTERNATION |
CAPTURING GROUP ( … ), (?<name> … )
NON-CAPTURING Group (?: … ), and others
LITERAL. A…Z , a…z, 0… 9
QUANTIFIER. +,*,?, {…}
For our Project, we need four different functions. (JavaScript code in Appendix)
1. Convert a regular expression into an AST. ParseRegextoAST()
2. Convert it back to a regular expression. ASTtoRegex()
3. Optimized the regular expression. OptimizeAST()
4. Decoded the AST for debugging purposes. ASTtoString()
All four different functions include recursive functions to descend into the tree and optimize it.
Who is the optimizer tool for?
In optimizing regular expressions, it’s important to recognize the different needs of users at various skill levels. An optimizer provides significant advantages for beginners and intermediate users by automating best practices and avoiding common inefficiencies. Novices often write patterns prone to excessive backtracking or redundancies, while intermediate users may overlook subtle optimizations like simplifying quantifiers or removing unnecessary capturing groups. An optimizer helps bridge this gap, ensuring performance improvements, reducing errors, and serving as a learning tool by demonstrating more efficient patterns. In contrast, advanced users are typically as proficient as an optimizer in crafting highly optimized regex. Their deeper understanding of regex engines allows them to make fine-tuned optimizations based on specific context, balancing performance with readability and maintainability. While advanced users may not always require automated optimizations, they can still benefit from an optimizer's consistency and convenience, particularly for catching repetitive or easily overlooked improvements.
Example of optimization
Using the regular expression /[a-zA-Z_][0-9a-zA-Z_]*/g
you can optimize it to /[A-Z_a-z]\w*/
The details of the optimization is listed below.
Original 1st Regex:
/[a-zA-Z_][0-9a-zA-Z_]*/g
Prior Optimization:
1: ROOT: Child[2]
2: CHAR_CLASS: Value=[a-zA-Z_], Parent=1
3: QUANTIFIER: Value=*: Child[1], Parent=1
4: CHAR_CLASS: Value=[0-9a-zA-Z_], Parent=3
Suggested optimization:
[a-zA-Z_] to [A-Z_a-z] to simplify regex
Replaced [0-9a-zA-Z_] with \w to simplify regex
New Optimize regular expression:
/[A-Z_a-z]\w*/
After Optimization
1: ROOT: Child[2]
2: CHAR_CLASS: Value=[A-Z_a-z], Parent=1
3: QUANTIFIER: Value=*: Child[1], Parent=1
4: PREDEFINED_CLASS: Value=\w, Parent=3
Further Improvement?
After finishing this project, I realized I was using simple peephole compiler techniques to optimize a regex. The drawback is that every optimization step has to be coded. The next step is to look at a global analysis of the regex structure using rule-based optimization, where we look at patterns or sequences of patterns that can transform into a more optimized version using a rule-based approach. This will enable more optimization techniques to be added without a need to hand code every new optimization. However, this will be a future project, version 2 of the optimizing regular expressions tool.
Conclusion
The four tools listed are a crucial part of optimizing a regular expression through decomposition, and optimization not only aids in debugging and development but also enhances the regular expression to be the most optimal version and improves one's ability to write more efficiently and error-free regexes. The Optimizing tool set plays an important role in this educational process. The Author's regular expression tool is the sixth tool mentioned. It has a unique feature that allows you to test two regular expressions simultaneously, decompose a regular expression, or optimize a regular expression. This becomes useful when trying to build a regular expression for a given problem, and then dynamically, the expression can be built to match more of the needed pattern successively.
Other Useful online tools
There are several online tools for testing and debugging regular expression. Here are some widely-used regex tools and their relevant details that you can reference in your article:
1. Regex101
- Website: Regex101
- Description: Regex101 is a powerful online tool for testing and debugging regular expressions. It supports multiple programming languages, including JavaScript, Python, PHP, and Go. The tool provides a detailed explanation of each regex part as you type, along with a quick reference guide and a library of user-submitted regex patterns.
- Key Features:
- Real-time regex parsing and testing.
- Detailed explanations of regex constructions.
- Code generator for various languages.
- User pattern library.
2. RegExr
- Website: RegExr
- Description: RegExr is another popular online tool to learn, build, and test regular expressions. It offers a clean interface and provides real-time visual feedback on your regex pattern matching.
- Key Features:
- Real-time results and highlighting.
- Extensive community patterns and examples.
- Detailed help and cheat sheets.
- History of your regex tests for easy backtracking.
3. RegexBuddy
- Website: RegexBuddy
- Description: RegexBuddy is a downloadable tool for Windows that acts as your regex assistant. It helps you create and understand complex regexes and implements them in source code.
- Key Features:
- Detailed analysis of regular expressions.
- Test regexes against sample texts.
- Integration with various programming environments.
- Regex building blocks for easier assembly.
4. Regex Pal
- Website: Regex Pal
- Description: Regex Pal is a straightforward, web-based tool for testing JavaScript regular expressions quickly. It provides immediate visual feedback but is more straightforward and less feature-rich than Regex101 or RegExr.
- Key Features:
- Quick testing with real-time highlighting.
- Minimalistic and fast.
- Sidebar with regex tokens and short descriptions.
5. Regex Tester
- Website: Regex Tester and Debugger Online - Javascript, PCRE, PHP
- Description: Regex Tester from RegexPlanet supports testing and debugging regex for multiple programming languages, including Java, .NET, and Ruby. It offers a unique environment to test regex in different programming contexts.
- Key Features:
- Support for several programming languages.
- Regex library and community examples.
- Advanced options for regex testing and results.
6. Regular expression tester
- Website: Testing Regular expression for matching specific text patterns (hvks.com)
- Description: Regular Expression Tester supports testing and debugging regex for JavaScript programming languages. It offers a unique environment to test regex in different programming contexts plus an optimizer
- Key Features:
- Support for most programming languages.
- Samples of regular expression for most common day coding problems
- Offer decomposing of regular expression for easier debugging and understanding.
- Offer optimization of regular expression
- Offer multiline matching.
- Can check and test two regular expressions simultaneously.
- Offers printing and emailing of results.
Reference
- J. Goyvaerts & S. Levithan, Regular Expression Cookbook, O’Reilly May 2009
- Regular Expression Tester. Testing Regular expression for matching specific text patterns (hvks.com)
- Decoding regular expression. Decomposing regular expression (hvks.com)
Appendix
All code fragments in this appendix are based on JavaScript and can be downloaded as well.
Function ASTtoString()
function ASTtoString(rootNode) {
if (!rootNode) return 'Invalid AST';
function ASTtoStringInternal(node, indent = 0, line = 1, parentLine = null) {
const indentation = ' '.repeat(indent);
let result = [];
result.push(`${line}: ${indentation}${node.type}`);
if (node?.value) result.push(`: Value=${node.value}`);
const childCount = node.children.length;
if (childCount > 0) result.push(`: Child[${childCount}]`);
if (parentLine !== null) result.push(`, Parent=${parentLine}`);
result.push('\n');
let currentLine = line + 1;
for (const child of node.children) {
const childResult = ASTtoStringInternal(child, indent + 1, currentLine, line);
result.push(childResult.result);
currentLine = childResult.line;
}
return { result: result.join(''), line: currentLine };
}
return ASTtoStringInternal(rootNode).result;
}
Function ASTtoRegex()
function ASTtoRegex(node) {
let regex = '';
switch (node.type) {
case 'ROOT':
case 'GROUP':
case 'NON_CAPTURING':
case 'CAPTURING':
case 'NAMED_CAPTURING':
case 'LOOKAHEAD':
case 'NEGATIVE_LOOKAHEAD':
case 'LOOKBEHIND':
case 'NEGATIVE_LOOKBEHIND':
const groupPrefixes = {
'NON_CAPTURING': '(?:',
'NAMED_CAPTURING': `(?<${node.value}>`,
'CAPTURING': '(',
'LOOKAHEAD': '(?=',
'NEGATIVE_LOOKAHEAD': '(?!',
'LOOKBEHIND': '(?<=',
'NEGATIVE_LOOKBEHIND': '(?<!'
}
if (groupPrefixes[node.type])
regex += groupPrefixes[node.type]
node.children.forEach((child) =>
regex += ASTtoRegex(child)
)
if (['NON_CAPTURING', 'CAPTURING', 'NAMED_CAPTURING','LOOKAHEAD', 'NEGATIVE_LOOKAHEAD', 'LOOKBEHIND', 'NEGATIVE_LOOKBEHIND'].includes(node.type))
regex += ')'
break
case 'ALTERNATION':
const needsGrouping = node.parent && node.parent.type == 'GROUP'
if (needsGrouping) regex += '('
node.children.forEach((child, childIndex) => {
if (childIndex > 0)
regex += '|'
regex += ASTtoRegex(child)
})
if (needsGrouping) regex += ')'
break
case 'QUANTIFIER':
if (node.children.length > 0)
regex += ASTtoRegex(node.children[0])
regex += node.value
break
case 'CHAR_CLASS':
case 'PREDEFINED_CLASS':
case 'LITERAL':
case 'ANCHOR_START':
case 'ANCHOR_END':
case 'ESCAPED_CHAR':
regex += node.value
break
default:
throw new Error(`Unknown node type: ${node.type} (Node value: ${node.value})`)
}
return regex
}
Function parseRegextoAST()
class RegexNode {
constructor(type, value, parent=null) {
this.type = type;
this.value = value;
this.parent= parent;
this.children = [];
}
addChild(node) {
node.parent = this;
this.children.push(node);
}
}
function parseRegextoAST(regex) {
const regexString = regex instanceof RegExp ? regex.source : regex
function parseRegex2AST(regexString, capturingGroupCount = { count: 0 }, parent = null) {
const rootNode = new RegexNode('ROOT', null, parent)
let i = 0
function handleQuantifier(rootNode, regex, currentIndex, quantifier) {
let isLazy = (currentIndex + quantifier.length < regex.length && regex[currentIndex + quantifier.length] === '?')
let quantifierValue = quantifier + (isLazy ? '?' : '')
const lastNode = rootNode.children.pop()
const quantifierNode = new RegexNode('QUANTIFIER', quantifierValue, rootNode)
quantifierNode.addChild(lastNode)
rootNode.addChild(quantifierNode)
}
function handleGroup(type, regex, startIndex, name = '') {
let depth = 1
let i = startIndex
let groupNode = new RegexNode(type, name, rootNode)
while (i < regex.length && depth > 0) {
if (regex[i] === '(' && !(regex[i - 1] === '\\')) depth++
if (regex[i] === ')' && !(regex[i - 1] === '\\')) depth--
i++
}
let content = regex.substring(startIndex, i - 1)
let parsedContent = parseRegex2AST(content, capturingGroupCount, groupNode)
if (parsedContent.type === 'ROOT') parsedContent.type = 'GROUP'
if (parsedContent.type === 'GROUP') {
if (parsedContent.children.length === 1)
groupNode.addChild(parsedContent.children[0])
else if (parsedContent.children.length != 0)
groupNode.addChild(parsedContent)
} else
groupNode.addChild(parsedContent)
return { node: groupNode, newPosition: i }
}
while (i < regexString.length) {
const char = regexString[i]
switch (char) {
case '^':
rootNode.addChild(new RegexNode('ANCHOR_START', '^', rootNode))
break
case '$':
rootNode.addChild(new RegexNode('ANCHOR_END', '$', rootNode))
break
case '\\':
i++
if (['b', 'B', 'd', 'D', 'w', 'W', 's', 'S'].includes(regexString[i]))
rootNode.addChild(new RegexNode('PREDEFINED_CLASS', '\\' + regexString[i], rootNode))
else
rootNode.addChild(new RegexNode('ESCAPED_CHAR', regexString[i], rootNode))
break
case '[':
let endIdx = regexString.indexOf(']', i + 1)
rootNode.addChild(new RegexNode('CHAR_CLASS', regexString.substring(i, endIdx + 1), rootNode))
i = endIdx
break
case '(':
let result
if (regexString.substring(i, i + 3) === '(?:') {
i += 3
result = handleGroup('NON_CAPTURING', regexString, i)
} else if (regexString.substring(i, i + 3) === '(?<' && regexString.indexOf('>', i) !== -1) {
let nameEndIdx = regexString.indexOf('>', i)
let name = regexString.substring(i + 3, nameEndIdx)
i = nameEndIdx + 1
result = handleGroup('NAMED_CAPTURING', regexString, i, name)
} else if (regexString.substring(i, i + 3) === '(?=' || regexString.substring(i, i + 3) === '(?!') {
let type = regexString.substring(i + 2, i + 3) === '=' ? 'LOOKAHEAD' : 'NEGATIVE_LOOKAHEAD'
i += 3
result = handleGroup(type, regexString, i)
} else if (regexString.substring(i, i + 4) === '(?<=' || regexString.substring(i, i + 4) === '(?<!') {
let type = regexString.substring(i + 3, i + 4) === '=' ? 'LOOKBEHIND' : 'NEGATIVE_LOOKBEHIND'
i += 4
result = handleGroup(type, regexString, i)
} else {
i++
let currentCaptureCount = ++capturingGroupCount.count
result = handleGroup('CAPTURING', regexString, i)
result.node.value = currentCaptureCount
}
rootNode.addChild(result.node)
i = result.newPosition - 1
break
case '|':
let endIndex
if (rootNode.type !== 'ALTERNATION') {
const altNode = new RegexNode('ALTERNATION', null, rootNode)
const leftSideGroup = new RegexNode('GROUP', null, rootNode)
leftSideGroup.children = [...rootNode.children]
rootNode.children = []
altNode.addChild(leftSideGroup)
rootNode.addChild(altNode)
endIndex = regexString.length
const alternationContent = regexString.substring(i + 1)
const rightSideAST = parseRegex2AST(alternationContent, capturingGroupCount, altNode)
if (rightSideAST.children[0] && rightSideAST.children[0].type === 'ALTERNATION')
rightSideAST.children[0].children.forEach(child => altNode.addChild(child))
else {
const rightSideGroup = new RegexNode('GROUP', null, altNode)
rightSideGroup.children = [...rightSideAST.children]
altNode.addChild(rightSideGroup)
}
} else {
endIndex = regexString.length
const alternationContent = regexString.substring(i + 1)
const additionalAST = parseRegex2AST(alternationContent, capturingGroupCount, rootNode)
if (additionalAST.children[0] && additionalAST.children[0].type === 'ALTERNATION')
additionalAST.children[0].children.forEach(child => rootNode.addChild(child))
else {
const additionalGroup = new RegexNode('GROUP', null, rootNode)
additionalGroup.children = [...additionalAST.children]
rootNode.addChild(additionalGroup)
}
}
i = endIndex - 1
break
case '*':
case '+':
case '?':
handleQuantifier(rootNode, regexString, i, char)
if (regexString[i + 1] === '?')
i++
break
case '{':
let closeBraceIndex = regexString.indexOf('}', i)
if (closeBraceIndex !== -1) {
handleQuantifier(rootNode, regexString, i, regexString.substring(i, closeBraceIndex + 1))
if (regexString[closeBraceIndex + 1] === '?')
++closeBraceIndex
i = closeBraceIndex
}
break
default:
rootNode.addChild(new RegexNode('LITERAL', char, rootNode))
break
}
i++
}
return rootNode
}
function combineLiteralNodes(astNode) {
for (let i = 0; i < astNode.children.length; i++) {
let node = astNode.children[i]
if (node.type === 'LITERAL') {
while (i + 1 < astNode.children.length && astNode.children[i + 1].type === 'LITERAL') {
node.value += astNode.children[i + 1].value
astNode.children.splice(i + 1, 1)
}
}
if (node.children && node.children.length > 0)
combineLiteralNodes(node)
}
}
function removeUnnecessaryGroups(astNode) {
for (let i = 0; i < astNode.children.length; i++) {
let node = astNode.children[i]
if (node.type === 'GROUP' && node.value == null && node.children.length === 1) {
let child = node.children[0]
astNode.children[i] = child
child.parent = astNode
}
if (node.children && node.children.length > 0)
removeUnnecessaryGroups(node)
}
}
let ast = parseRegex2AST(regexString)
combineLiteralNodes(ast)
removeUnnecessaryGroups(ast)
return ast
}
Function optimizeAST()
function optimizeAST(node, changes = new Set()) {
let changeDesc = '';
function nextSibling(currentNode) {
if (!currentNode.parent)
return null;
const siblings = currentNode.parent.children;
const index = siblings.indexOf(currentNode);
if (index >= 0 && index < siblings.length - 1)
return siblings[index + 1];
return null;
}
function previousSibling(node) {
if (!node.parent)
return null;
const siblings = node.parent.children;
const index = siblings.indexOf(node);
if (index > 0)
return siblings[index - 1];
return null;
}
function CompareNodes(node1, node2) {
if (!node1 && !node2) return true;
if (!node1 || !node2) return false;
if (node1.type !== node2.type || node1.value !== node2.value)
return false;
if (node1.children.length !== node2.children.length)
return false;
for (let i = 0; i < node1.children.length; i++) {
if (!CompareNodes(node1.children[i], node2.children[i]))
return false;
}
return true;
}
function normalizeCharacterClass(value) {
function preprocessCharacterClass(value) {
const hasWordChar = /\\w/.test(content);
const hasNonWhitespace = /\\S/.test(content);
const hasDigit = /\\d/.test(content);
content = content.replace(/(\\w)+/g, hasWordChar ? '\\w' : '')
.replace(/(\\S)+/g, hasNonWhitespace ? '\\S' : '')
.replace(/(\\d)+/g, hasDigit ? '\\d' : '');
if (hasNonWhitespace)
content = content.replace(/\\d|\\w/g, '');
else if (hasWordChar)
content = content.replace(/\\d/g, '');
content = content.replace(/\\d/g, '0-9');
content = content.replace(/\\w/g, 'a-zA-Z0-9_');
content = content.replace(/\\D/g, '^0-9');
content = content.replace(/\\W/g, '^a-zA-Z0-9_');
content = content.replace(/\[\\D\]/g, '^0-9');
content = content.replace(/\[\\W\]/g, '^a-zA-Z0-9_');
content = content.replace(/\[\\s\]/g, ' \t\n\r\f\v');
content = content.replace(/\[\\S\]/g, '^ \t\n\r\f\v');
return content;
}
function expandCharacterRange(range) {
const result = [];
const start = range.charCodeAt(0);
const end = range.charCodeAt(2);
for (let i = start; i <= end; i++) {
result.push(String.fromCharCode(i));
}
return result;
}
function postProcessCharacterSet(characters) {
const digitRange = '0123456789';
const wordRange = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_';
const spaceRange = ' \t\n\r\v\f';
let charSet = new Set(characters);
function isFullRangeCovered(range) {
return [...range].every(char => charSet.has(char));
}
function createCharacterRanges(chars) {
let ranges = [];
let lastChar = chars[0];
let rangeStart = lastChar;
for (let i = 1; i < chars.length; i++) {
let currentChar = chars[i];
if (currentChar.charCodeAt(0) !== lastChar.charCodeAt(0) + 1) {
if (rangeStart === lastChar)
ranges.push(rangeStart);
else
ranges.push(`${rangeStart}-${lastChar}`);
rangeStart = currentChar;
}
lastChar = currentChar;
}
if (rangeStart === lastChar)
ranges.push(rangeStart);
else
ranges.push(`${rangeStart}-${lastChar}`);
return ranges;
}
let result = [];
let processedChars = new Set();
if (isFullRangeCovered(spaceRange)) {
result.push('\\s');
[...spaceRange].forEach(char => processedChars.add(char));
}
if (isFullRangeCovered(wordRange)) {
result.push('\\w');
[...wordRange].forEach(char => processedChars.add(char));
}
else
{
if (isFullRangeCovered(digitRange)) {
result.push('\\d');
[...digitRange].forEach(char => processedChars.add(char));
}
}
const remainingChars = [...charSet].filter(char => !processedChars.has(char));
if (remainingChars.length > 0) {
remainingChars.sort();
let ranges = createCharacterRanges(remainingChars);
result.push(ranges.join(''));
}
return result.join('');
}
let content = value.replace(/^\[|\]$/g, '');
content = preprocessCharacterClass(content);
const isNegated = value.startsWith('^');
let characters = new Set();
let ranges = content.match(/[^\\]-[^\\]/g) || [];
let withoutRanges = content.replace(/[^\\]-[^\\]/g, '');
let remainingParts = withoutRanges.match(/\\.|[^\\-]+|-/g) || [];
let parts = [...ranges, ...remainingParts];
parts.forEach(part => {
if (part.length === 3 && part[1] === '-' && !part.startsWith('\\')) {
expandCharacterRange(part).forEach(char => characters.add(char));
} else if (part === '\\w') {
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_'.split('').forEach(char => characters.add(char));
} else {
part.split('').forEach(char => characters.add(char));
}
});
let normalized = postProcessCharacterSet(Array.from(characters).sort().join(''));
return { normalized, isNegated };
}
node.children.forEach(child => optimizeAST(child, changes));
switch (node.type) {
case 'CHAR_CLASS':
const { normalized, isNegated } = normalizeCharacterClass(node.value);
switch (normalized) {
case '\\d':
case '0123456789':
node.type = 'PREDEFINED_CLASS';
changeDesc = `Replaced ${node.value} with`;
node.value = isNegated ? '\\D' : '\\d';
changes.add(`${changeDesc} ${node.value} to simplify regex`);
break;
case '\\w':
case 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_':
node.type = 'PREDEFINED_CLASS';
changeDesc = `Replaced ${node.value} with`;
node.value = isNegated ? '\\W' : '\\w';
changes.add(`${changeDesc} ${node.value} to simplify regex`);
break;
case '\\s':
case ' \t\n\r\f\v':
node.type = 'PREDEFINED_CLASS';
changeDesc = `Replaced ${node.value} with`;
node.value = isNegated ? '\\S' : '\\s';
changes.add(`${changeDesc} ${node.value} to simplify regex`);
break;
default:
let content = node.value;
if (content.replace(/^\[|\]$/g, '') != normalized) {
changes.add(`${node.value} to [${normalized}] to simplify regex`);
node.value = '[' + normalized + ']';
}
}
break;
case 'QUANTIFIER':
let optimizationDone = false;
function optimizeBracketQuantifier(quantifier) {
const regex = /\{(\d+)(?:,(\d*))?\}(\?)?/;
const match = regex.exec(quantifier);
let optimizedQuantifier = quantifier;
let changeDesc = '';
if (match) {
const n = parseInt(match[1], 10);
const m = match[2] ? parseInt(match[2], 10) : null;
const isLazy = !!match[3];
if (m === null) {
if (match[0].includes(',')) {
if (n === 0) {
optimizedQuantifier = isLazy ? '*?' : '*';
changeDesc = `Optimized {0,} to ${optimizedQuantifier}`;
} else if (n === 1) {
optimizedQuantifier = isLazy ? '+?' : '+';
changeDesc = `Optimized {1,} to ${optimizedQuantifier}`;
}
} else {
optimizedQuantifier = `{${n}}`;
if(isLazy)
changeDesc = `Simplified {${n}}${isLazy ? '?' : ''} to {${n}}`;
}
} else {
if (n === m) {
optimizedQuantifier = `{${n}}`;
changeDesc = `Simplified {${n},${m}}${isLazy ? '?' : ''} to {${n}}`;
} else if (n === 0 && m === 1) {
optimizedQuantifier = isLazy ? '??' : '?';
changeDesc = `Optimized {0,1} to ${optimizedQuantifier}`;
} else {
optimizedQuantifier = `{${n},${m}}${isLazy ? '?' : ''}`;
changeDesc = `Retained quantifier {${n},${m}}${isLazy ? '?' : ''}`;
}
}
}
return { optimizedQuantifier, changeDesc };
}
if (node.children.length === 0) {
node.parent.children = node.parent.children.filter(child => child !== node);
changes.add(`Removed empty Quantifier`);
break;
}
let bracket = optimizeBracketQuantifier(node.value);
node.value = bracket.optimizedQuantifier;
if (bracket.changeDesc !== '') changes.add(bracket.changeDesc);
let isLazy = node.value.slice(-1) === '?' && node.value.slice(-2, -1) !== '?';
let baseValue = node.value;
let newQuantifier = '';
changeDesc = '';
switch (baseValue) {
case '{1}':
case '{1}?':
newQuantifier = '';
if (node.children.length > 0) {
const childNode = node.children[0];
node.type = childNode.type;
node.value = childNode.value;
node.children = childNode.children;
if (node.children.length > 0)
node.children.forEach(child => child.parent = node);
changes.add(`Removed redundant quantifier ${baseValue} and simplified to its child node ${childNode.type}`);
} else {
node.type = '';
node.value = '';
node.children = [];
changes.add(`Removed redundant quantifier ${baseValue} (no children to move up). THAT SHOULD NOT BE POSSIBLE`);
}
optimizationDone = true;
break;
case '{0}':
case '{0}?':
newQuantifier = '';
let removeNode = node.children.pop();
if (removeNode) {
node.parent.children = node.parent.children.filter(child => child !== node);
changes.add(`Removed node with zero quantifier ${baseValue}`);
}
optimizationDone = true;
break;
case '?':
case '*':
case '+':
break;
case '+?':
{
let childNode = node.children[0];
node.parent.children = node.parent.children.map(child => child === node ? childNode : child);
childNode.parent = node.parent;
changes.add(`Removed lazy +? quantifier and replaced with ${childNode.type}`);
optimizationDone = true;
}
break;
case '*?':
{
const optimizationTable = {
'null-null-literal': 'removeLazy',
'null-null-nonLiteral': 'keep',
'null-literal-literal': 'removeLazy',
'null-literal-nonLiteral': 'keep',
'null-nonLiteral-literal': 'removeLazy',
'null-nonLiteral-nonLiteral': 'keep',
'null-assertion-literal': 'removeLazy',
'null-assertion-nonLiteral': 'removeLazy',
'literal-null-literal': 'removeLazy',
'literal-null-nonLiteral': 'keep',
'literal-literal-literal': 'keep',
'literal-literal-nonLiteral': 'keep',
'literal-nonLiteral-literal': 'removeLazy',
'literal-nonLiteral-nonLiteral': 'keep',
'literal-assertion-literal': 'removeLazy',
'literal-assertion-nonLiteral': 'removeLazy',
'nonLiteral-null-literal': 'removeLazy',
'nonLiteral-null-nonLiteral': 'keep',
'nonLiteral-literal-literal': 'keep',
'nonLiteral-literal-nonLiteral': 'keep',
'nonLiteral-nonLiteral-literal': 'removeLazy',
'nonLiteral-nonLiteral-nonLiteral': 'keep',
'nonLiteral-assertion-literal': 'removeLazy',
'nonLiteral-assertion-nonLiteral': 'removeLazy'
};
function classifyNode(node) {
if (node === null) return 'null';
if (node.type === 'LITERAL' ) return 'literal';
if (node.type === 'PREDEFINED' && node.value != '\b' && node.value != '\B' ) return 'literal';
if (node.type === 'CHAR_CLASS' ) return 'literal';
if (node.type === 'ANCHOR_START' || Node.type === 'ANCHOR_END') return 'assertion';
if (node.type === 'PREDEFINED'&& (node.value === '\\b' || node.value === '\\B' ) ) return 'assertion';
return 'nonLiteral';
}
const nextNode = nextSibling(node);
const prevNode = previousSibling(node);
const prevNodeType = classifyNode(prevNode);
const nextNodeType = classifyNode(nextNode);
const quantifierNodeType = classifyNode(node.children[0]);
const decisionKey = `${prevNodeType}-${nextNodeType}-${quantifierNodeType}`;
const action = optimizationTable[decisionKey];
const isAnchorOrLookaround = nextNode && ['ANCHOR_START', 'ANCHOR_END', 'LOOKAHEAD', 'NEGATIVE_LOOKAHEAD', 'LOOKBEHIND', 'NEGATIVE_LOOKBEHIND'].includes(nextNode.type);
const isWordBoundary = nextNode && nextNode.type === 'PREDEFINED_CLASS' && (nextNode.value === '\\b' || nextNode.value === '\\B');
switch (action) {
case 'remove':
let childNode = node.children[0];
node.type = childNode.type;
node.value = childNode.value;
node.children = childNode.children;
node.children.forEach(child => child.parent = node);
changes.add(`Removed redundant lazy quantifier entirely followed by ${childNode.type}`);
optimizationDone = true;
break;
case 'removeLazy':
node.value = '*';
changes.add(`Removed lazy quantifier and kept greedy * for scenario: ${decisionKey}`);
optimizationDone = true;
break;
case 'keep':
changes.add(`Kept lazy quantifier *? for scenario: ${decisionKey}`);
optimizationDone = true;
break;
default:
changes.add(`No action for scenario: ${decisionKey}`);
break;
}
}
break;
case '??':
let childNode = node.children[0];
node.parent.children = node.parent.children.map(child => child === node ? childNode : child);
childNode.parent = node.parent;
changes.add(`Removed lazy ?? quantifier and replaced with ${childNode.type}`);
optimizationDone = true;
break;
}
if (!optimizationDone && (newQuantifier )) {
node.value = newQuantifier + (isLazy ? '?' : '');
changes.add(changeDesc + node.value);
}
break;
case 'NON_CAPTURING':
if (node.children.length === 0) {
node.parent.children = node.parent.children.filter(child => child !== node);
changes.add(`Removed empty non-capture`);
}
if (node.children.length === 1) {
let child = node.children[0];
if (['LITERAL', 'PREDEFINED_CLASS', 'CHAR_CLASS','GROUP'].includes(child.type) || (child.type === 'QUANTIFIER' && !/[\{\}]/.test(child.value))) {
node.type = child.type;
node.value = child.value;
node.children = child.children;
node.children.forEach(c => c.parent = node);
changes.add(`Removed unnecessary non-capturing group `+(child.value?`around ${child.value}`:``));
} else if (node.children.every(child => child.type === 'ALTERNATION') &&
(nextSibling(node)==null||nextSibling(node).type==='ANCHOR_END')) {
node.type = 'ALTERNATION';
node.value = '';
node.children = node.children[0].children;
node.children.forEach(c => c.parent = node);
changes.add(`Removed unnecessary non-capturing group around alternation.`);
}
}
if (node.parent && ['ANCHOR_START', 'ANCHOR_END'].includes(node.parent.type)) {
node.type = node.children[0].type;
node.value = node.children[0].value;
node.children = node.children[0].children;
node.children.forEach(c => c.parent = node);
changes.add(`Removed unnecessary non-capturing group at start(^)/end($) of regex.`);
}
break;
case 'ALTERNATION': {
function commonPrefix(a, b) {
let minLength = Math.min(a.length, b.length);
for (let i = 0; i < minLength; i++) {
if (a[i] !== b[i])
return a.substring(0, i);
}
return a.substring(0, minLength);
}
function commonSuffix(a, b) {
let minLength = Math.min(a.length, b.length);
for (let i = 0; i < minLength; i++) {
if (a[a.length - 1 - i] !== b[b.length - 1 - i])
return a.substring(a.length - i);
}
return a.substring(a.length - minLength);
}
function identifyAlternationScenario(children) {
const isCharClassOrPredefined = children.every(child => {
return ['CHAR_CLASS', 'PREDEFINED_CLASS'].includes(child.type) ||
(child.type === 'LITERAL' && child.value.length === 1);
});
const isLiteralGroups = children.every(child =>
child.type === 'LITERAL'
);
const isSingleLiteralGroups = children.every(child =>
(child.type === 'LITERAL' && child.value.length === 1)
);
return { isCharClassOrPredefined, isLiteralGroups, isSingleLiteralGroups };
}
const { isCharClassOrPredefined, isLiteralGroups, isSingleLiteralGroups } = identifyAlternationScenario(node.children);
if (isCharClassOrPredefined) {
const mergedClasses = node.children.map(child => child.value.replace(/^\[|\]$/g, '')).join('');
const { normalized, isNegated } = normalizeCharacterClass(mergedClasses);
node.value = `[${isNegated ? '^' : ''}${normalized}]`;
node.children = [];
node.type = 'CHAR_CLASS';
changes.add(`Merged character classes in alternation: ${mergedClasses}`);
break;
}
if (isLiteralGroups && !isSingleLiteralGroups) {
const values = node.children.map(literal => literal.value);
const hasEmptyAlternation = values[values.length - 1] === '';
const prefix = values.reduce(commonPrefix);
const suffix = values.reduce(commonSuffix, values[0]);
if (prefix.length > 0 && suffix.length === 0) {
const groupNode = new RegexNode('GROUP', '', node.parent);
groupNode.addChild(new RegexNode('LITERAL', prefix, groupNode));
const remaining = values.map(value => value.substring(prefix.length));
if (remaining.every(part => part.length === 1)) {
const charClass = `[${remaining.join('')}]`;
groupNode.addChild(new RegexNode('CHAR_CLASS', charClass, groupNode));
changes.add(`Packed into character class after prefix: ${prefix}${charClass}`);
} else {
const alternationNode = new RegexNode('ALTERNATION', '', groupNode);
remaining.forEach(part => {
alternationNode.addChild(new RegexNode('LITERAL', part, alternationNode));
});
groupNode.addChild(alternationNode);
changes.add(`Factored out common prefix with alternation: ${prefix}`);
}
node.type = groupNode.type;
node.value = groupNode.value;
node.children = groupNode.children;
node.children.forEach(child => child.parent = node);
break;
}
if (suffix.length > 1) {
const remainingPrefixes = values.map(value => value.substring(0, value.length - suffix.length));
if (remainingPrefixes.every(prefix => prefix.length === 1)) {
const charClass = `[${remainingPrefixes.join('')}]`;
const groupNode = new RegexNode('GROUP', '', node.parent);
groupNode.addChild(new RegexNode('CHAR_CLASS', charClass, groupNode));
for (let i = 0; i < suffix.length; i++) {
groupNode.addChild(new RegexNode('LITERAL', suffix[i], groupNode));
}
node.type = groupNode.type;
node.value = groupNode.value;
node.children = groupNode.children;
node.children.forEach(child => child.parent = node);
changes.add(`Factored out common suffix in alternation: ${suffix}`);
break;
}
}
}
if (isSingleLiteralGroups) {
const literalValues = node.children.map(child => child.value);
const hasEmptyAlternation = literalValues[literalValues.length - 1] === '';
if (literalValues.length > 0) {
let charClass = `[${literalValues.join('')}]`;
const { normalized, isNegated } = normalizeCharacterClass(charClass);
charClass = `[${isNegated ? '^' : ''}${normalized}]`;
const charClassNode = new RegexNode('CHAR_CLASS', charClass, node.parent);
if (hasEmptyAlternation) {
const quantifierNode = new RegexNode('QUANTIFIER', '?', node.parent);
quantifierNode.addChild(charClassNode);
node.type = quantifierNode.type;
node.value = quantifierNode.value;
node.children = quantifierNode.children;
} else {
node.type = 'CHAR_CLASS';
node.value = charClass;
node.children = [];
}
changes.add(`Optimized alternation of single literals into character class: ${charClass}`);
break;
}
}
break;
}
case 'LOOKAHEAD':
case 'NEGATIVE_LOOKAHEAD':
if (node.children.length === 0) {
node.parent.children = node.parent.children.filter(child => child !== node);
changes.add(`Removed empty lookahead`);
break;
}
let lookaheadContent = node.children.map(child => child.value || '').join('');
let parent = node.parent;
if (parent && parent.children && parent.children.length > 0) {
let nextLiterals = parent.children.slice(parent.children.indexOf(node) + 1)
.filter(child => child.type === 'LITERAL')
.map(child => child.value)
.join('');
if (lookaheadContent === nextLiterals) {
parent.children = parent.children.filter(child => child !== node);
changes.add(`Removed redundant lookahead: (?=${lookaheadContent})`);
}
}
if (parent && parent.children.filter(c => c.type.endsWith('LOOKAHEAD')).length > 1) {
let lookaheadSiblings = parent.children.filter(c => c.type.endsWith('LOOKAHEAD'));
let combinedContent = lookaheadSiblings.map(la => la.children.map(c => c.value).join('')).join('|');
let combinedLookahead = new RegexNode(node.type, null, parent);
let combinedAlternationNode = new RegexNode('ALTERNATION', combinedContent, combinedLookahead);
combinedLookahead.addChild(combinedAlternationNode);
parent.children = parent.children.filter(c => !lookaheadSiblings.includes(c));
parent.addChild(combinedLookahead);
changes.add(`Combined multiple lookaheads into: (?=${combinedContent})`);
}
if (node.children.some(child => child.type === 'QUANTIFIER')) {
node.children.forEach(child => {
if (child.type === 'QUANTIFIER') {
let quantifierValue = child.value;
let quantifiedNode = child.children[0];
changes.add(`Detected quantifier (${quantifierValue}) in lookahead for: ${quantifiedNode.value}`);
}
});
}
break;
case 'LOOKBEHIND':
case 'NEGATIVE_LOOKBEHIND': {
let parent = node.parent;
if (node.children.length === 0) {
node.parent.children = node.parent.children.filter(child => child !== node);
changes.add(`Removed empty lookbehind`);
break;
}
if (parent && parent.children.filter(c => c.type.endsWith('LOOKBEHIND')).length > 1) {
let lookbehinds = parent.children;
let consecutiveLookbehinds = [];
let previousIsLookbehind = false;
lookbehinds.forEach((child, index) => {
if (child.type.endsWith('LOOKBEHIND')) {
if (previousIsLookbehind) {
consecutiveLookbehinds.push(child);
} else {
consecutiveLookbehinds = [child];
}
previousIsLookbehind = true;
} else {
if (consecutiveLookbehinds.length > 1) {
combineLookbehinds(consecutiveLookbehinds, parent);
changes.add(`Combined consecutive lookbehinds into: (?<=${getCombinedContent(consecutiveLookbehinds)})`);
}
previousIsLookbehind = false;
consecutiveLookbehinds = [];
}
});
if (consecutiveLookbehinds.length > 1) {
combineLookbehinds(consecutiveLookbehinds, parent);
changes.add(`Combined consecutive lookbehinds into: (?<=${getCombinedContent(consecutiveLookbehinds)})`);
}
}
function combineLookbehinds(lookbehinds, parent) {
let combinedLookbehind = new RegexNode(lookbehinds[0].type, null, parent);
let alternationNode = new RegexNode('ALTERNATION', '', combinedLookbehind);
lookbehinds.forEach(lb => {
lb.children.forEach(child => {
let literalNode = new RegexNode('LITERAL', child.value, alternationNode);
alternationNode.addChild(literalNode);
});
});
combinedLookbehind.addChild(alternationNode);
let indexOfFirstLookbehind = parent.children.indexOf(lookbehinds[0]);
parent.children = parent.children.filter(c => !lookbehinds.includes(c));
parent.children.splice(indexOfFirstLookbehind, 0, combinedLookbehind);
}
function getCombinedContent(lookbehinds) {
return lookbehinds.map(lb => lb.children.map(c => c.value).join('|')).join('|');
}
let lookbehindContent = node.children.map(child => child.value || '').join('');
if (parent && parent.children.length > 0) {
let lookbehindIndex = parent.children.indexOf(node);
let prevLiterals = parent.children.slice(lookbehindIndex-1, lookbehindIndex)
.filter(child => child.type === 'LITERAL')
.map(child => child.value)
.reverse()
.join('');
if (lookbehindContent === prevLiterals) {
parent.children = parent.children.filter(child => child !== node);
changes.add(`Removed redundant lookbehind: (?<=${lookbehindContent})`);
}
}
if (node.children.some(child => child.type === 'QUANTIFIER')) {
node.children.forEach(child => {
if (child.type === 'QUANTIFIER') {
let quantifierValue = child.value;
let quantifiedNode = child.children[0];
changes.add(`Detected quantifier (${quantifierValue}) in lookbehind for: ${quantifiedNode.value}`);
}
});
}
break;
}
}
return Array.from(changes).join('\n');
}