Introduction
SR2JLIB is a Netbeans project supplying a Java library for grammar-guided Symbolic Regression (SR) [Koza93], powered by Genetic Programming (GP) [Koz94, WHM+97], and more specifically Grammar-Guided Genetic Programming (GGGP) [GCA+08].
The approach realized in this library differs from standard GP in that the population is placed on a 2D grid. The latter defines a habitat for the individuals which are reproducing in parallel and thus there is no notion of staging. The reproduction process is ensured by multiple parallel threads randomly choosing individuals from the current population and allowing them to reproduce trying to settle the offsprings in the pre-defined neighborhood of their ancestors. Each time a new individual is created, it is attempted to be placed in the neighboring cells based on tournament or probabilistic tournament selection.
It is important to note that one can also limit the life-time of an individual. This is done by defining the minimum and maximum number of reproductions. The actual number of reproductions is then chosen based on the individual's fitness but within the pre-specified bounds.
The realized GP procedure allows to fit (multi-dimensional) vector functions to data. SR can be done for vector function sub-components simultaneously or in parallel. The functions allowed by the library are arbitrary Java numeric expressions and allow using conditional expressions of the form:
<boolean expression> ? <numeric expression> : <numeric expression>
Moreover, numeric expressions allow for using any of the java.lang.Math
class function primitives. Our strong belief is that cross over operation when breeding function mostly makes little sense. Therefore, the only genetic operations supported are mutations which can be of the next two types:
- Mutating an existing expression into a completely new one of the same type
- Mutating the given function into another function with the same arity and with the same arguments
Another feature of the library is that it allows for Just In Time compilation (JIT) of the individuals in a form Java classes and dynamic loading thereof into JVM. The latter is done seamlessly for the library's user and allows one to compute individual's fitness from Java code or using Java Native Interface (JNI) in order to compute fitness from, e.g., C or C++ code.
Last but not least, is that the grammar provided for building expressions is probabilistic, each grammatical expression can be given a weight that defines its likelihood to be taken when choosing between expressions of the same type.
Using Software
Using software requires Netbeans IDE 8.2 or later and JDK v1.8 or later. Sample projects using the library are provided by:
- https://github.com/ivan-zapreev/SR2JLIB_EX
- https://github.com/ivan-zapreev/SCOTS2SR
The former is a Java-only project giving a basic example of how the library can be used, it contains multiple comments and is self-explanatory. The latter is a more involved JNI based project for symbolic fitting of SCOTSv2.0 https://gitlab.lrz.de/matthias/SCOTSv0.2 BDD controllers [Run16]. We also suggest taking a thorough look at the Java Doc of the library stored in the ./api/ folder of the project.
It is a known issue with the Java 8 JIT compiler that its caches can become full thus preventing further individuals' compilation and potentially preventing the library to function properly. Therefore, in case a JIT compilation is to be used, see the section on library interfaces further in the text, we suggest supplying Java with the following command line arguments:
-XX:InitialCodeCacheSize=1024m
-XX:ReservedCodeCacheSize=2048m
-XX:+UseCodeCacheFlushing
These are to be provided as Java command line parameters of the application using SR2JLIB
. If needed, the cache sizes can be increased even further.
Main Concepts
This section explains how the library can be used by specifying:
- The way the grammars are to be specified
- The library interfaces to be used
- The observers and listeners to be implemented
Grammar
The grammar
is needed to define valid numeric and boolean expressions to be used as vector function components. It is defined as a set of new-line-separated grammar entries:
<grammar> := <entry> | <grammar> \n <entry>
Each grammar entry
defines an expression type
and consists of the type name and one or more semicolon-separated expressions of this type:
<entry> := (<type> := <expression>) | <entry> ; <expression>
Each type
is a string
and there are several-predefined (and thus reserved) types available by default:
R - real expression
B - boolean expression
L - logical constant
V - variable
D - double constant
Note that, L
, V
, and D
are build-in and R
with B
are to be specified by the user. Boolean expressions (B
) only require specification if are implicitly or explicitly used to define R
. Specifying R
is compulsory, as any genetically breed function will be of type R
.
Each expression
is defined by:
- a function (some numeric/boolean expression);
- its argument types;
- an optional
weight
- a positive double defining the probability distribution over expressions of the same type (if omitted, the default value is 1.0):
<expression> := [<function>](<types>) |
[<function>](<types>)@<weight>
Here, function
is a valid Java expression (boolean or numerical depending on the context) which uses parameters named x
followed by indexes starting with 1
(i.e. x1, x2, x3). Function parameters define function arguments and their types are given by the types
entry in the definition above. The latter is a comma-separated list:
<types> := <type> | <types>, <type>
where the number of elements in the list must corresponds to the number of distinct function
parameters. Moreover, the parameter index matches the corresponding type
position in the list.
Any function
may use the java.lang.Math
class by referencing it with the $
sign. For instance $sin(x1)
will be interpreted as Math.sin(x1)
. Consider the following valid expression examples:
[x1 - $floor(x1)](D)
[$sin(x1)*x2](V,D)
[$sin(x2)*x1](D,V)
Note that here, [$sin(x1)*x2](V,D)
and [$sin(x2)*x1](D,V)
are equivalent and both result in a variable argument sinus multiplied with a real constant.
The last thing to note is that the textual grammar description allows for comments which should start with //
and last until the end of the current line.
For completeness, below we give an example grammar which assumes uniform probability distribution over all expressions except for [$abs(x1)](R)@0.1
and x1](L) @0.01
which are given a smaller chance to be used (by setting the weights to 0.1 and 0.01 respectively):
R := [x1](D); [x1](V); [$abs(x1)](R)@0.1; [$sin(x1)](R); [$cos(x1)](R);
[$sinh(x1)](R); [$cosh(x1)](R); [$tan(x1)](tR); [$tanh(x1)](R); [$acos(x1)](aR);
[$asin(x1)](aR); [$sqrt(x1)](npR); [$cbrt(x1)](R); [$ceil(x1)](R); [$floor(x1)](R);
[$log(x1)](lR); [$log10(x1)](lR); [$max(x1,x2)](R,R) ; [$min(x1,x2)](R, R) ;
[$pow(x1,x2)](R,pI); [$signum(x1)](R); [-x1](R); [x1/x2](R,nR); [x1*x2](R,R);
[x1+x2](R,R); [x1-x2](R,R) ; [x1 ? x2 : x3](B,R,R)
B := [x1](L) @0.01; [x1!=x2](R,R); [x1==x2](R,R); [x1<x2](R,R); [x1<=x2](R,R);
[x1>x2](R,R); [x1>=x2](R,R); [!x1](B) ; [x1&&x2](B,B); [x1||x2](B,B)
//User defined by demand:
aR := [x1 - $floor(x1)](R) //domain for acos/asin
lR := [(x1<1.0e-4)?1.0e-4:x1](pR) //domain for log/log10
nR := [(x1==0.0)?1.0e-10:x1](R) //non-zero real
pR := [$abs(x1)](nR) //positive real
npR := [$abs(x1)](R) //non-negative real
pI := [$floor(1.0/x1)](nR) //positive integer
tR := [x1 - $floor(x1/$PI)*$PI](R) //domain for tangent
//Form: [function](arguments)@weight
//"$" - is replaced with "Math."
//Arguments are comma separated
//The variable arguments for the java code
//are "args[idx]" where idx begins with 0
//Standard argument types:
//R - real expression
//B - boolean expression
//L - logical constant
//V - variable
//D - double constant
Main Interfaces
SR2JLIB
allows to run multiple SR processes in parallel on separate grids. Each of such processes is encapsulated inside an instance of a nl.tudelft.dcsc.sr2jlib.ProcessManager
class. A single process manager can breed multi-dimensional vector functions. Each of the vector function dimension functions is defined by some grammar. Such a grammar can be individual per dimension or shared with some other dimension within the same or between different instances of ProcessManager
. A grammar is defined by an instance of the nl.tudelft.dcsc.sr2jlib.grammar.Grammar
class. Each instance thereof is to be supplied with a configuration object of type nl.tudelft.dcsc.sr2jlib.grammar.GrammarConfig
. A general pattern for instantiating and setting up grammars is given below by means of an example:
Grammar.clear_grammars();
final GrammarConfig g_cfg00 = new GrammarConfig(...);
final Grammar grammar00 = Grammar.create_grammar(g_cfg00);
Grammar.register_grammar(0, 0, grammar00);
final GrammarConfig g_cfg01 = new GrammarConfig(...);
final Grammar grammar01 = Grammar.create_grammar(g_cfg01);
Grammar.register_grammar(0, 1, grammar01);
Grammar.register_grammar(1, 0, grammar01);
Grammar.prepare_grammars();
Here grammar00
and grammar01
are used to breed two-dimensional vector functions in an instance of a ProcessManager
with a unique identifier 0. In addition, grammar01
is used to breed single-dimensional vector functions in an instance of a ProcessManager
with a unique identifier 1.
Once the grammars are instantiated and registered, one is to instantiate process managers. These are to be configured with instances of the nl.tudelft.dcsc.sr2jlib.ProcessManagerConfig
class. Further, each process manager can be individually started - to begin its symbolic regression; and stopped - to stop the genetic breeding process.
final ProcessManagerConfig config = new ProcessManagerConfig(...)
final ProcessManager manager = new ProcessManager(config);
manager.start();
manager.stop(true);
As one can see, all of the Grammar
and ProcessManager
class parameters are encapsulated in the corresponding configuration objects. The latter, including the suggested and default values will be discussed in the next section. Also note that stopping the process manager is typically done once a sufficiently fit individual is found. Doing that, will be discussed in the subsequent section on listeners, observers and computers.
Configuration Objects
Let us consider the two configuration objects GrammarConfig
and ProcessManagerConfig
.
The GrammarConfig
is used for grammar configuration and has the following constructor:
public GrammarConfig(final String grammar,
final int max_ts, final double ch_vs_rep,
final int num_vars, final double min_node_grow,
final double max_node_grow, final boolean is_prop_pnodes,
final int max_gd, final double tm_vs_ntm) {...}
Here, we suggest the following default values:
ch_vs_rep = 0.5;
min_node_grow = 0.8;
max_node_grow = 1.2;
is_prop_pnodes = false;
tm_vs_ntm = 0.5;
Choosing the value for max_ts
depends on a concrete grammar
in a sense that a grammar using many Math
class methods will result in hard-to-compute functions which will likely have a negative impact on the performance of fitness computations. In such cases, the tree size may be chosen to be smaller than in case of, e.g., a grammar defining polynomials. In general, the tree size defines the number of nodes in the tree. Each expression node adds one to the tree size.
Similarly, max_gd
- maximum grammar depth (MGD) is also grammar
specific. It should be set to an upper bound of the MGD which is defined as the maximum (over all functions) of minimum (over all function instances) expression sizes. For simplicity, one can simply set max_gd
to some rather high value to, e.g. 1000
, and try setting up the grammar. If the actual MGD turns out to be larger, which is very unlikely, a corresponding error will be reported. Setting the value of max_gd
above the actual MGD does not result in any performance overheads. It is possible however to create a grammars with unbounded MGD, such grammars are not supported as they are not feasible, e.g.:
R := [x1+x2](R,R)
Clearly, R
is defined recursively but without explicit or implicit use of any terminal expressions. To fix this grammar and make it bounded, one needs to simply add a terminal expression, such as:
R := [x1](D); [x1+x2](R,R)
The maximum grammar depth in this case will be 3 and as a rule of thumb: "A grammar will have a bounded MGD if each grammar expression can be instantiated as a finite expression tree".
The ProcessManagerConfig
is used to configure a process manager and has the following constructor:
public ProcessManagerConfig(
final int mgr_id,
final double init_pop_mult,
final int num_workers,
final long max_num_reps,
final int num_dofs,
final int size_x, final int size_y,
final int ch_sp_x, final int ch_sp_y,
final SelectionType sel_type,
final boolean is_allow_dying,
final int min_chld_cnt,
final int max_chld_cnt,
final GridObserver observer,
final FinishedCallback done_cb){...}
The default suggested values for this configuration class are as follows:
init_pop_mult = 0.1;
num_workers = 20;
max_num_reps = Long.MAX_VALUE;
size_x = 30;
size_y = 30;
ch_sp_x = 1;
ch_sp_y = 1;
is_allow_dying = false;
min_chld_cnt = 0;
max_chld_cnt = 0;
sel_type = SelectionType.VALUE;
The value of mgr_id
depends on the number of managers, we recommend using continuous (in the non-negative integer domain) manager ids starting from 0
. The value of num_dofs
is problem specific.
Last but not least, done_cb
and observer
provide the call-back objects. The former is just an object realizing a functional interface to be called once the process manager has finished the SR procedure. The latter is a grid observing object allowing to monitor all of the population changes. We shall discuss these and other interfaces in the next section.
Listeners, Observers, and Computers
There are several main interfaces that are to be realized by the application using SR2JLIB
, the first three of them are:
nl.tudelft.dcsc.sr2jlib.FinishedCallback
- a functional interface to be used by the ProcessManager
to notify the user that the SR is finished nl.tudelft.dcsc.sr2jlib.grid.GridObserver
- an interface allowing to monitor the symbolic regression process, i.e., its start, stop and adding/killing new/old individuals on the grid nl.tudelft.dcsc.sr2jlib.ErrorListener
- an optional error listener to monitor exceptions and errors occurring while SR. For instance, a broken grammar can result in non-compilable individual classes and these exceptions will be reported through an instance of this interface.
Objects of classes implementing FinishedCallback
and GridObserver
are provided as arguments for the ProcessManagerConfig
class constructor. An object implementing ErrorListener
is to be set explicitly from the code through calling the:
public ErrorListener set_listener(final ErrorListener new_el){...}
method of the nl.tudelft.dcsc.sr2jlib.err.ErrorManager
class.
It is important to note that since GridObserver
provides an interface for monitoring the population manager's individuals, it is a custom to check for individuals' fitness in order to stop the corresponding process manager from within this interface implementation.
Other interfaces, at least one of which needs to be implemented, are used for individual's fitness computations. These are given by the abstract
classes located in the nl.tudelft.dcsc.sr2jlib.fitness
package and require implementing their overloaded:
public abstract Fitness compute_fitness(final int mgr_id, ...);
methods. The concrete abstract
class to inherit from, and thus the compute_fitness
method to be implemented, depends on user preferences. We classify them based on what is provided as an argument for the compute_fitness
method:
FitnessComputerExpression
- the individual's vector function expression trees FitnessComputerString
- the individual's vector function serialized as Java expression strings FitnessComputerClass
- the individual's vector function compiled into a java class FitnessComputerInstance
- the individual's vector function compiled and instantiated as a java object
In order to set the fitness computer class as the one to be used, it is required using the:
public static void set_inst(final FitnessComputerExpression inst) {...}
method of the FitnessManager
class.
Expression Trees
Each individual's vector function dimension is represented in a form of a Java numeric expression. The latter is initially stored in a form of a tree where non-terminal nodes correspond to functions (numeric or boolean expressions) and terminal nodes correspond to numerical or boolean constants, or free variables. The classes used to form expression trees are stored in the nl.tudelft.dcsc.sr2jlib.grammar.expr
package. Each tree node is an instance of the Expression
class. Non-terminal nodes are instances of the FunctExpr
class and terminal ones of the TermExpr
class. The latter has three child classes:
BConstExpr
- a boolean constant expression NConstExpr
- a numerical constant expression VarExpr
- a double free variable expression
All of these have interface functions in order to traverse through the expression tree and get information of each of its nodes. For more detail consider reading the Java Doc.
Licensing
This is a free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License version 3 (GPL3) for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
Literature
- [Koza93] John R. Koza; Martin A. Keane; James P. Rice (1993). "Performance improvement of machine learning via automatic discovery of facilitating functions as applied to a problem of symbolic system identification" (PDF). IEEE International Conference on Neural Networks. San Francisco: IEEE. pp. 191–198.
- [Koz94] John R. Koza. Genetic programming as a means for programming computers by natural selection. Statistics and Computing, 4(2):87–112, 1994.
- [WHM+97] M. J. Willis, H. G. Hiden, P. Marenbach, B. McKay, and G. A. Montague. Genetic programming: an introduction and survey of applications. In Second International Conference On Genetic Algorithms In Engineering Systems: Innovations And Applications, pages 314–319, Sep 1997.
- [GCA+08] Manrique Gamo, Daniel and Ríos Carrión, Juan and Rodríguez-Patón Aradas, Alfonso (2008). Grammar-Guided Genetic Programming. In: "Encyclopedia of Artificial Intelligence". Information Science Reference, EEUU, pp. 767-773.
- [Run16] Rungger, M. and Zamani, M. (2016). SCOTS: A tool for the synthesis of symbolic controllers. In Proceedings of the 19th International Conference on Hybrid Systems: Computation and Control, HSCC, 99–104.