Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages

Adaptable Syntax Assembler for a Stack Machine

2.89/5 (3 votes)
18 Jul 2021GPL38 min read 6.9K  
A context sensitive language
A brief about an infix assembler language with an adaptable syntax for a calculator-like stack machine.

The purpose of this paper is to briefly describe an assembler language with an adaptable syntax that runs on a stack machine. Initially, all instructions are characters; in other words, the machine language instructions are character codes. The focus of this paper is on differences between this and other stack machines and between this language and others, because most readers already know stack machines and languages that run on them.

Some characters, such as 1, have a natural meaning, and others, such as A, do not. Since some characters do not have a natural meaning, they can be defined to give them meaning. This machine assigns a meaning to some characters, for example Ð means define. The language is called Ameba; it runs on a stack machine called the calculator.

The inherent meaning of a character is the calculator definition as an operator or operand. The natural meaning of a character is its cultural connotation or common understanding among people. The calculator defines the inherent meaning of characters as their natural meaning if possible. For example, the character 1 has a numeric value or meaning of 1, + means add, but G does not have a specific and consistent meaning. On the other hand, 1 and +, each have the same meaning in many cultures.

The calculator inherent symbols are single characters with inherent meaning. Additional single- or multi-character symbols can be declared or defined. A declared symbol exists in the symbol table but has no definition. Symbol definitions are expressions made of defined symbols. In other words, once a symbol is defined, it may be used to define other symbols. As more and more symbols are defined, the body of definitions grows. Eventually, a calculator symbol may represent a whole system.

Use the define operator, Ð, to declare or define single or multi-character symbols. A declared symbol exists in the symbol table but has no methods or properties to define it. A defined symbol has methods or properties that define it. The form of a define expression is Ðsym defð, where sym is the symbol being defined that is terminated by a space, and def is an expression, which is a string of defined symbols, for example  Ðfifteen 9+6ð. Symbols must be declared before use.

A define may be internal to a define, as the following example illustrates, Ðfifteen Ðsix 6ð 9+sixð. The scope of a define within a define, e.g., six, is local to outer define, e.g., fifteen. A calculator implementation may or may not support threads; if it does, each thread will have a local symbol table.

Ameba evaluates expressions in the order they are written, with no-precedence and an infix-like syntax. For example, a+b$c, means add a to b and save the result in c. There are no lines or statements, just expressions, including parenthetical subexpressions, loops and conditionals. The loop syntax uses characters £, ¤ and Ø. The conditional syntax uses characters ?, :, ; and ¿. The calculator may partially evaluate expressions with the result being an expression, e.g., 5+6+B -> 11+B.

The subexpression form is as follows: (e), where e is an expression. Ameba calls itself to evaluate subexpressions, including subexpressions in loops and conditionals.

The loop form is as follows: £ a ØB c ¤. Where £ starts a loop and ¤ terminates it, B is a Boolean operand-subexpression, a and c are subexpressions that evaluate each iteration, except c evaluates one time fewer than a, because ØB exits the loop when B is true. The Ø operator tests its operand B and exits the loop when B=true.

The conditional form is as follows: ?B:a;c¿. where a and c are subexpressions and B is a Boolean operand-subexpression; the conditional is read if B is true, then the value of the conditional is determined by evaluating either a or c. Or it can be read as follows: if B then a else c.

As the calculator translates text-code into token-code, a kind of p-code, it makes a symbol table. A token is an integer that points to a symbol. A symbol is a language element such as a literal, variable, expression, conditional or loop. Given the example from above, a+b$c, the symbols are a, +, b, $, c, +b, $c, and a+b$c. The symbol a+b$c contains three tokens, the tokens assigned to a, +b and $c. These three tokens translate into the character string a+b$c.

Symbols are strings of tokens, i.e., token code, with various lengths; thus, the start and end addresses of symbols are irregular in memory. Token-strings are terminated by a Null token (i.e., Null token = 0). A vector of symbol addresses is indexed by token. The token for a character-symbol is its character code (e.g., “0” is token #30, hexadecimal). Reverse translating token-code yields text-code, which means token-code is suitable for evaluation, partial evaluation, and programmatic static analysis.

Symbols cannot contain a leading or trailing space, rather the space terminates the symbol. Except, the space character and other whitespace characters do begin and end with whitespace. Symbols not separated by spaces may or may not be identified correctly. An ambiguity is possible when a longer symbol contains a shorter one. For example: assume the defined symbol anyway contains the defined symbols a, an, any and way. Searching for the text anyway finds the symbol anyway instead of any and way. In other words, the search algorithm prefers a longer symbol rather than shorter ones. To force the search to find any and way, separate them with a space, e.g., any way.

The symbol anyway is assigned a token (e.g., 9037). It is made from two symbols, any and way, which have their own tokens, for example 582 for any and 620 for way. Thus, the multi-token string 582 620 is stored in the symbol table instead of the characters “anyway.” In turn, any and way are also in the symbol table as token strings. For example, way consists of three tokens that are the character codes for the characters w a and y, which are hexadecimal 77 61  and 79.

The symbol table is a syntax tree that captures the essence of expressions made of subexpressions that are also made of subexpressions … made of character-strings that are made of characters. Every expression, subexpression, character-string and character is a unique symbol represented in the symbol table. If a server manages the symbol table, many people can access it, and it becomes an object repository, with security and possibly version control. Such a server can help maximize code reuse because its symbols are shared by all who access the server, however big the organization.   

Symbol table access is either direct by token or search by symbol. Translation of token-code into text-code requires direct access. Translation of text-code into token-code requires search access. Evaluating token-code requires direct access.

When an infix operator is called (e.g., +) the leading operand has been evaluated and its value will be on top of the stack. Its trailing operand is not yet evaluated or parsed. The operator (e.g., +) must get and evaluate its trailing operand by calling the Ç function, which is named cull-operand. The process of culling moves parsing into the method of any operator (e.g., +). In other words, cull-operand puts syntax processing with semantic processing into method code. Because it gives control of syntax to programmers it makes Ameba a meta language. Cull-operand gets the next symbol past the operator that calls Ç to get an operand. In the expression 1+2, the + operator calls Ç to get its trailing operand, 2.

The function Ç uses the return address on the stack, which is the location of the trailing operand, to set the program counter of the calculator. It recursively calls the calculator to evaluate the trailing operand, which may be simple as a literal or complex as a subexpression. Cull-operand also advances the return address to point past the trailing operand it just culled, which means an operator may cull several operands.

Ameba treats syntax forms such as literals, comments, subexpressions, loops and conditionals as self-declaring symbols. The first character of these syntax forms is an operator that may cull several trailing operands. For example, the leading digit of an integer is an operator that culls trailing digits to make an integer, which it declares in the symbol table. To process a subexpression is more complex because the culling between the parentheses requires calling the calculator to process symbols within them.

In summary, this is a sketch of Ameba, an introduction, not intended to address all nuances of its design. The important issues are how to make a language with an adaptable syntax, by eliminating a monolithic rules-based parser. The calculator can evaluate text-code and simultaneously translate it into token-code. It can also evaluate token-code independently from text-code, which gives better performance. The adaptable syntax makes Ameba a meta language, capable of compiling other languages, e.g., C, into token-code. In addition, Ameba can enhance its own syntax and should provide partial evaluation for specialization and optimization. It needs an Integrated Development Environment (IDE), with integral test manager, static analysis system, and CASE extensions. It is limited only by processor resources, time and memory, and by our imagination.

I wish to thank my wife, Anna, for proofreading, and making this document much better.

(c) Edwin E. Ross II July 2021

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)