Introduction
A scanner breaks a stream of characters into a sequence of tokens. This is comparable with a human reader who groups characters into words, numbers and punctuation thereby reaching a higher abstraction level. The text:
0-201-05866-9, cancelled, "Parallel Program Design"
e.g. could be translated into the tokens:
T_ISBN T_COMMA T_CANCELLED T_COMMA T_TITLE
where T_ISBN
, T_COMMA
and so on are integer constants. There are two approaches for implementing general scanners.
- The scanner is an object of a class.
The searched tokens are specified via calls to member functions.
- The scanner is automatically generated from regular expressions.
This is a two-phase approach. First, you specify the scanner, and then you run the generator, which outputs "C" source code.
The first approach is better suited for a project with frequent changes. The second approach gives you superior performance but has the disadvantage that the generated "C"- code is nearly unreadable for human beings and therefore shouldn't be edited.
The scanner and scanner generator presented in this article combines both approaches and provides one interface for both implementation strategies.
Interface for the scanner class
The scanner
REXI_Scan
is based on the regular expression facility already presented in the article 'Fast regular expressions'. To use it, you
- specify a regular expression for each token to recognize.
- set the source string.
- call
Scan
repeatedly until it returns REXI_Scan::eEos
.
class REXI_Scan : public REXI_Base
{
public:
REXI_Scan(char cLineBreak= '\n');
REXI_DefErr AddSymbolDef (string strRegExp,int nIdAnswer);
REXI_DefErr AddHelperRegDef (string strName,string strRegExp);
REXI_DefErr SetToSkipRegExp (string strRegExp= "[ \r\n\v\t]*");
inline void SetSource (const char* pszSource);
int Scan ();
inline string GetTokenString ()const;
inline void SkipChars (int nOfChars=1);
inline int GetLastSymbol ()const;
inline int GetNofLines ()const;
};
Example Usage
enum ESymbol{T_ERR,T_AVAILABLE,T_CANCELLED,
T_COMMA,T_LINEBREAK,T_ISBN,T_TITLE};
struct Info{
Info():m_eKey(T_ERR){}
string m_sISBN; ESymbol m_eKey; string m_sTitle;
};
int main(int argc,char* argv[])
{
const int ncOk= REXI_DefErr::eNoErr;
const char szTestSrc[]=
"3-8272-5737-9,AVAILABLE, \"XML praxis und referenz\"\r\n"
"0-201-05866-9,cancelled, \"Parallel Program Design\"\r\n";
REXI_Scan scanner;
REXI_DefErr err;
err= scanner.AddSymbolDef ("(AVAILABLE)\\i",T_AVAILABLE);
assert(err.eErrCode==ncOk);
err= scanner.AddSymbolDef ("(CANCELLED)\\i",T_CANCELLED);
assert(err.eErrCode==ncOk);
err= scanner.AddSymbolDef (",",T_COMMA);
assert(err.eErrCode==ncOk);
err= scanner.AddSymbolDef ("\\n",T_LINEBREAK);
assert(err.eErrCode==ncOk);
err= scanner.AddHelperRegDef("$Int_","[0-9]+\\-");
assert(err.eErrCode==ncOk);
err= scanner.AddSymbolDef ("$Int_ $Int_ $Int_ [0-9]+", T_ISBN);
assert(err.eErrCode==ncOk);
err= scanner.AddSymbolDef (" \"( [^\"\\n] | \\\"] )* \"", T_TITLE);
assert(err.eErrCode==ncOk);
err= scanner.SetToSkipRegExp("[ \\t\\v\\r]*");
assert(err.eErrCode==ncOk);
scanner.SetSource(szTestSrc);
int nNofLines=0;
int nRes;
Info info;
vector<Info> vecInfos;
while( (nRes=scanner.Scan())!=REXI_Scan::eEos ){
switch(nRes){
case T_AVAILABLE:
case T_CANCELLED:
info.m_eKey= (enum ESymbol)nRes;
break;
case T_TITLE:
info.m_sTitle= scanner.GetTokenString();
break;
case T_ISBN:
info.m_sISBN= scanner.GetTokenString();
break;
case T_LINEBREAK:
vecInfos.push_back(info); info= Info();
break;
case REXI_Scan::eIllegal:
cout << "Illegal:" <<
scanner.GetTokenString() << endl;
while( (nRes=scanner.Scan())!=REXI_Scan::eEos
&& nRes!= T_LINEBREAK);
info= Info();
break;
}
}
cout << "Number of correct read records: "
<< vecInfos.size() << endl;
char c; cin >> c;
return 0;
}
Interface for the scanner generator
The scanner generator is a very simple GUI program. It allows you to specify and run a test scanner and finally generates the source code for the specified scanner. The generated code uses a REXI_Scan
derived scanner and provides two different code parts. Controlled by the conditional directive #ifdef REXI_STATIC_SCANNER
, either an efficient hard coded scanner or a scanner working like the one described above is activated.
The specification for the scanner to be generated uses regular expressions and supports 4 different ways to specify a token, which are shown below.
if #T_Quote= '[^']' $Int= [0-9]+ ##T_FLOAT= $Int (\. $Int)?
It is important, that you separate the token definitions by tabs. Now, let's see what the 4 definitions above mean.
1. Token if The scanner searches for exactly the word 'if'
and automagically creates a constant T_if for the token
2. Token #T_Quote= '[^']' The leading # means:
The next identifier up to the = is the name of the token constant,
then the token definition follows.
3. Helper $Int= [0-9]+ Defines a helper definition,
which can be used later.
4. Token ##T_FLOAT= $Int (\. $Int)?
The leading ## means: Same as # but do postprocessing
after recognizing this token.
Fragment of a generated scanner
int Simple::Scan()
{
#ifdef REXI_STATIC_SCANNER
int nRes= FastScan();
#else
int nRes= REXI_Scan::Scan();
#endif
switch(nRes){
case eIllegal:{
m_sIllegal= GetTokenString();
return nRes;
}
case T_PRICE:{
return nRes;
}
default: return nRes;
}
}
Intended Use
Scanning a comma separated file, implementing a pretty printer for C++-source code or building a scanner for an interpreter are potential application areas. There are also quite a lot of freely available scanner generators (lex, bison) out there, but as far as I know, no one generates scanners with such a neat interface as this one.