This article introduces an embedded AWK interpreter that can be called from a C/C++ program.
Introduction
AWK is a language of considerable influence and prestigious lineage. For the shortest (and funniest) of introductions, check the above comic of fellow Montrealer Julia Evans. AWK is an easy to use scripting language created by Alfred Aho (of Dragon Book fame), Peter Weinberger, and Brian Kernighan (of K&R fame). With such illustrious parents, it's not surprising that AWK quickly became one of the most popular scripting languages. It also promoted innovative concepts, like associative arrays, way before they became mainstream.
AWK is at its best when you have some data that you have to "massage" to fit into a database. Of course, you could run the data through AWK before passing it to the database engine but if you want to use a small embedded database engine like SQLITE and you want to have everything in the same program, it would be nice to have the scripting language embedded in your C/C++ program. I couldn't find an AWK interpreter that could be embedded in C so I decided to make my own. This article describes the resulting code. A passing familiarity with AWK helps in following along but it's not required. If you want to brush-off your AWK skills, you can check this tutorial.
Sample Usage
Let's start with a short example as an "executive summary" of what you can do with the co.
The popular Unix wc program counts the lines, words and bytes in a file. Here is a stripped-down implementation using my AWK library:
#include <awklib.h>
int main (int argc, char **argv)
{
AWKINTERP* interp = awk_init (NULL); awk_setprog (interp, "{wc += NF; bc += length($0)}\n"
"END {print NR, wc, bc, ARGV[1]}");
awk_compile (interp); if (argc > 1)
awk_addarg (interp, argv[1]); awk_exec (interp); awk_end (interp); }
The program outlines the basic steps to have to follow to use the AWK library:
- First, you must create an interpreter object. All the other API functions take an interpreter as the first argument.
- You pass a program to the interpreter. As we will see, there are more ways of passing programs to the interpreter, but this is the simplest one.
- The program must be "compiled". The interpreter doesn't really compile it, but it produces a syntax tree that will be used in the execution phase.
- You can add input files to be processed by the AWK program. In this case, we add
argv[1]
, the first argument that was received by the C program. - The AWK program gets executed. By default, the output is sent to
stdout
. - In the end, the interpreter object is deleted.
Design Decisions
The first question I had to answer was what version of AWK code should I use. There are many variants of awk: gawk, mawk, etc. In the end, I decided to use Dr. Kernighan's original onetrueawk. The code can be found at https://github.com/onetrueawk/awk and it gives the sensation of being in a computer science museum. Just an example: one of the test files seems to be the etc/passwd file of a Unix system. It goes like this:
/dev/rrp3:
17379 mel
16693 bwk me
16116 ken him someone else
...
8895 dmr
Assuming bwk stands for Brian W. Kernighan, I'll let you figure who ken and dmr might be 😊. For added antique flavor, /dev/rrp3: indicates the file was residing on a DEC RP03, RP04 or RP06. The RP06 were huge disks for their time with 128MB of data. Yay!
Given the historical quality of the code, I would have liked to keep it as a pure C project. Unfortunately, this was not possible mainly due to error handling issues. For a stand-alone program, it is perfectly acceptable to bail out in case of error; an embedded interpreter doesn't have this option. My solution was to wrap large portions of code in try
...catch
blocks. This idea of keeping it as a C product, at least on the outside, is the reason why I didn't organize the API as a C++ object. One part that had to be completely rewritten was the function call mechanism.
The API is not very large; I've tried to keep it to a minimum and the plan is to add new functions only if really needed.
API Description
The API is structured around an opaque AWKINTERP
object representing the AWK language interpreter. Its life cycle follows a series of irreversible state transitions: initialization, program compilation, program execution and destruction.
Access to interpreter's variables is done through an awksymb
structure:
struct awksymb {
const char *name; const char *index; unsigned int flags; double fval; char *sval; };
The same structure is used to pass parameters to an AWK callable C function (see awk_setfunc
API call).
In AWK, variables don't have a defined type or, better said, they are all strings and sometimes get converted to numbers if they are needed for a numerical operation. In the awksymb
structure, the flags
member indicates if the symbol is a string
, in which case sval
member is valid, or a number with the value in fval
.
Arrays are also special in AWK. As I said before, all arrays are associative and are "indexed" by a character string. If the flag AWK_ARRAY
is set in the flags
member, the variable is an array and the index
member represents the array index.
Following is a brief description of each API function.
awk_init
AWKINTERP* awk_init (const char **vars);
The function initializes a new AWK interpreter object. It takes an array of variable definitions with the same format as the -v
command-line arguments of stand-alone AWK interpreter. The array is terminated with a NULL string
.
awk_setprog
int awk_setprog (AWKINTERP* pi, const char *prog);
Set the program text for an interpreter. This function can be called only once for an interpreter.
awk_addprogfile
int awk_addprogfile (AWKINTERP* pi, const char *progfile);
Adds the content of a file as AWK program. The functionality is equivalent with the -f
switch on the command line of stand-alone interpreter. Just like the -f
switch, this function can be called repeatedly to add multiple programs.
awk_compile
int awk_compile (AWKINTERP* pi);
Compiles the AWK language program(s) that have been specified using awk_setprog
or awk_addprogfile
functions.
awk_addarg
int awk_addarg (AWKINTERP* pi, const char *arg);
Add a new argument to the interpreter. The argument can be an input file name or a variable definition, if it has the syntax var=value
. Arguments can be added at any time before starting execution of the AWK program.
Example
AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "{print pass+1 \"-\" NR, $0}");
awk_compile (pi);
awk_addarg (pi, "infile.txt");
awk_addarg (pi, "pass=1");
awk_addarg (pi, "infile.txt");
The output is (assuming infile.txt has 25 lines):
1 - 25
2 - 25
awk_exec
int awk_exec (AWKINTERP* pi);
Execute a compiled program. The function returns the value returned by exit statement or a negative error code if something went wrong. If a program terminates without an exit statement, the returned value is 0
. Small negative values should be considered reserved for error conditions.
Example
AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "{print NR, $0}");
awk_compile (pi);
awk_addarg (pi, "infile.txt);
awk_exec (pi);
awk_run
int awk_run (AWKINTERP* pi, const char *progfile);
This function combines in one call the calls to awk_setprog
, awk_compile
and awk_exec
functions.
If a program terminates without an exit statement, the returned value is 0
. Otherwise, the function returns the value specified in the exit statement. Small negative values should be considered reserved for error conditions. If the program requires any arguments, they can be added using awk_addarg
function before calling awk_run
.
Example
AWKINTERP *pi = awk_init (NULL);
awk_addarg (pi, "infile.txt");
awk_run (pi, "{print NR, $0}");
awk_end
void awk_end (AWKINTERP* pi);
Releases all memory allocated by the interpreter object.
awk_setinput
int awk_setinput (AWKINTERP* pi, const char *fname);
Forces interpreter to read input from a file. By default, an interpreter reads from stdin
. This function redirects the input to another file.
awk_infunc
Change the input function with a user-defined function.
void awk_infunc (AWKINTERP* pi, inproc fn);
Replaces the input function, by default getc
or fgetc
, with a user defined function. The inproc
function has the same signature as getc
:
typedef int (*inproc)();
It returns the next character or EOF if there are no more characters.
Here is an example of how to use AWK to process some in-memory data:
std::istrstream instr{
"Record 1\n"
"Record 2\n"
};
AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "{print NR, $0}");
awk_compile (pi);
awk_infunc (pi, []()->int {return instr.get (); });
awk_setoutput
int awk_setoutput (AWKINTERP* pi, const char *fname);
Redirect interpreter output to a file. By default, the interpreter output goes to stdout
. Using this function, you can redirect it to a different file.
Example
AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "BEGIN {print \"Output redirected\"}");
awk_compile (pi);
awk_setoutput (pi, "results.txt");
awk_exec (pi);
awk_outfunc
void awk_outfunc (AWKINTERP* pi, outproc fn);
Change the output function with a user-defined function. The outproc
function signature is:
typedef int (*outproc)(const char *buf, size_t len);
Example
std::ostringstream out;
int strout (const char *buf, size_t sz)
{
out.write (buf, sz);
return out.bad ()? - 1 : 1;
}
...
AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "BEGIN {print \"Output redirected\"}");
awk_compile (pi);
awk_outfunc (pi, strout);
awk_getvar
int awk_getvar (AWKINTERP *pi, awksymb* var);
Retrieves the value of an AWK variable.
The function returns 1
if successful or a negative error code otherwise.
If the variable is an array and the index
member is NULL
, the function returns AWK_ERR_ARRAY
error code.
For string
variables, the AWKSYMB_STR
flag is set and the function allocates the memory needed for the string
by calling malloc
. The user has to release the memory by calling free
.
Example
AWKINTERP *pi = awk_init (NULL);
awksymb var{ "NR" };
awk_setprog (pi, "{print NR, $0}\n");
awk_compile (pi);
awk_getvar (pi, &var);
awk_setvar
int awk_setvar (AWKINTERP *pi, awksymb* var);
Changes the value of an AWK variable. The function takes a pointer to an awksymb
structure with information about the variable. The user must set the flags
member of the awksymb
structure to indicate which values are valid (string or numerical). In addition, for array members, the user must specify the index and set the `AWKSYMB_ARR
flag.
If the variable does not exist, it is created.
Example
AWKINTERP *pi = awk_init (NULL);
awksymb v{ "myvar", NULL, AWKSYMB_NUM, 25.0 };
awk_setprog (pi, "{myvar++; print myvar}\n");
awk_compile (interp);
awk_compile (pi);
awk_setvar (pi, &v);
awk_exec (pi);
awk_addfunc
Add a user defined function to the interpreter.
int awk_addfunc (AWKINTERP *pi, const char *name, awkfunc fn, int nargs);
Parameters:
pi
- pointer to an interpreter object name
- function name fn
- pointer to functype nargs
- number of function arguments
The function returns 1
if successful or a negative error code otherwise.
External user-defined functions can be called from AWK code just like any AWK user-defined function. The nargs
parameter specifies the expected number of parameters but, like with any AWK function, the number of actual arguments can be different. The interpreter will provide null
values for any missing parameters. The function prototype is:
typedef void (*awkfunc)(AWKINTERP *pinter, awksymb* ret, int nargs, awksymb* args);
The function can return a value by setting it into the ret
variable and setting the appropriate flags. String values must be allocated using malloc
.
It should be called only after the AWK program has been compiled.
Example
void fact (AWKINTERP *pi, awksymb* ret, int nargs, awksymb* args)
{
int prod = 1;
for (int i = 2; i <= args[0].fval; i++)
prod *= i;
ret->fval = prod;
ret->flags = AWKSYMB_NUM;
}
...
awk_setprog (pi, " BEGIN {n = factorial(3); print n}");
awk_compile (pi);
awk_addfunc (pi, "factorial", fact, 1);
awk_exec (pi);
Final Thoughts
The source code has been compiled with Visual Studio 2017. There is also a small makefile for gcc. The syntax analyzer uses YACC so you will need a YACC compiler if you want to do a full rebuild. I have included however the files generate by YACC (ytab.cpp and ytab.h) so you can build it even if you don't have a YACC compiler.
This concludes the presentation of my embedded AWK interpreter. It can be easily incorporated into a C/C++ program and has a good communication with host program. The host can access any interpreter variable and the interpreter can call external functions defined by host program. Size-wise, the interpreter is very small. You can expect an overhead of about 100 KB which is a decent number when compared with other interpreters (Lua takes about twice as much).
I will continue to improve the embedded AWK interpreter. If you want to contribute to this project or just get the latest version, you can find it at https://github.com/neacsum/awk.
History
- 05-Apr-2020 - Initial version