Regex+ - Extended Regular Expression Engine

Mirzakhmet Syzdykov

5.00/5 (2 votes)

17 Feb 2021MIT2 min read

6.1K

Description of the newly available Regex+ engine for Java

Recent work in the field of the development of regular expression engines doesn't adhere to the extended regular expressions with the support of intersection, subtraction and complement operators in linear time and memory. New development idea and elaborate efforts makes it possible to use the extended operators in regular expression.

Introduction

The modern regular expression engines don't support extended operators like intersection, subtraction and complement. With our new idea of the development and elaborate efforts, we make it possible to use the extended operators in regular expression.

Extended Regular Expression Engine

As stated before, there's more of theoretical background to study to build the fully POSIX-compliant regular expression engine supporting intersection, subtraction and complement. For this purpose, we override the states with counter-state using the decision making when choosing whether to branch for further evaluation of the matching process.

For parsing purposes, we use the following notation for intersection - "&", subtraction - "-" and complement - "~". Thus, to write the expression for subtraction, for example, we can set "(Hello|world)-world".

The present time full support of our engine for the extended regular expressions is as follows in BNF-notation:

Java

R =
    (A | ["~"] /* complement */ (R) | "." | "[" A* "]") ["*", "+", "?"],
    R1 R2,
    R1 | R2,
    R1 & R2 /* intersection */
    R1 - R2 /* subtraction */

Using the Code

The code usage is defined by first parsing the expression by class Parser. As the parsed data are ready, it's time to build automaton for the abstract node produced by parser. The streaming is vital and the author has implemented the streaming abstract classes and interfaces so that the user can use custom-defined stream, for example, by networking.

Let's see the code usage which can be found in tests prepared for Regex+ engine:

Java

import com.regexplus.automaton.model.Automaton;
import com.regexplus.automaton.model.StringStream;
import com.regexplus.match.common.IMatch;

String pattern = "~(((a+|a*)-aa)&aaa)";
String string = "ba";

Automaton automaton = new Automaton();

automaton.build(new StringStream(pattern));

if (automaton.matches(new StringStream(string))) {
	List<imatch> matches = automaton.match(new StringStream(string));
	
	System.out.println("Start: " + matches.get(0).start() + ", 
                        Length: " + matches.get(0).length());
}</imatch>

Basic Principles

Our basic principle is to create the INode-interface and Node-class along the Parser-class as the parsing engine. Thus, we have to re-define the parsing technique in context-free manner and propose the new node class by overriding the defaults (like NodePaired or Node), if we have to create the new branch for our BNF-grammar.

For the purpose of the non-deterministic finite automata, there's the set of classes and interfaces in the package com.regexplus.automaton like IState, State. Thus to create the new state, just override the defaults or re-implement the interface IState.

There's only matching algorithm in the engine, so we can re-develop the existing one.

Step by Step Walk-throughs

As we have researched the topic of ERE, there's an underlying problem before our revolutionary solution which includes the quadratic time and space to solve the ERE-matching problem. Our engine works linearly dealing with decision-making counter overridden states which operate in O(1). Moreover, there's a precise concept behind that.

Conclusion and Points of Interest

So the intersection, subtraction and complement in regular expression engine is possible in linear polynomial time with our revolutionary approach.

The GitHub repository is wide open for our community to be supported and extended by you. So please feel free to drop me a comment or join my project.

History

26^th September, 2020: Initial version

License

This article, along with any associated source code and files, is licensed under The MIT License