Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

A scanner and scanner generator

0.00/5 (No votes)
9 Apr 2001 1  
Supports both common approaches to scanners in one object.

Sample Image - ScanGen.gif

Introduction

A scanner breaks a stream of characters into a sequence of tokens. This is comparable with a human reader who groups characters into words, numbers and punctuation thereby reaching a higher abstraction level. The text:

0-201-05866-9, cancelled,    "Parallel Program Design"

e.g. could be translated into the tokens:

T_ISBN T_COMMA T_CANCELLED T_COMMA T_TITLE 

where T_ISBN, T_COMMA and so on are integer constants. There are two approaches for implementing general scanners.

  • The scanner is an object of a class.

    The searched tokens are specified via calls to member functions.

  • The scanner is automatically generated from regular expressions.

    This is a two-phase approach. First, you specify the scanner, and then you run the generator, which outputs "C" source code.

The first approach is better suited for a project with frequent changes. The second approach gives you superior performance but has the disadvantage that the generated "C"- code is nearly unreadable for human beings and therefore shouldn't be edited.

The scanner and scanner generator presented in this article combines both approaches and provides one interface for both implementation strategies.

Interface for the scanner class

The scanner REXI_Scan is based on the regular expression facility already presented in the article 'Fast regular expressions'. To use it, you
  • specify a regular expression for each token to recognize.
  • set the source string.
  • call Scan repeatedly until it returns REXI_Scan::eEos.
class REXI_Scan : public REXI_Base
{
public:
    REXI_Scan(char cLineBreak= '\n'); //related function 'GetNofLines'


/*initialize scanner with symbol definitions    1.STEP    */
    REXI_DefErr     AddSymbolDef        (string strRegExp,int nIdAnswer);
    REXI_DefErr     AddHelperRegDef     (string strName,string strRegExp);

    REXI_DefErr     SetToSkipRegExp     (string strRegExp= "[ \r\n\v\t]*");

/* set source                     2.STEP    */
    inline  void    SetSource           (const char* pszSource);

/* Read next token, then return symbolId ('nIdAnswer' from 'AddSymbolDef') 
                        3.STEP    */
    int            Scan                ();

/* retrieve,set information after a call to 'Scan'    */
    inline  string    GetTokenString    ()const;
    inline  void    SkipChars    (int nOfChars=1);
    inline  int        GetLastSymbol    ()const;
    inline  int        GetNofLines    ()const;
};

Example Usage

enum ESymbol{T_ERR,T_AVAILABLE,T_CANCELLED,
            T_COMMA,T_LINEBREAK,T_ISBN,T_TITLE};
struct Info{
        Info():m_eKey(T_ERR){}  
        string  m_sISBN;    ESymbol m_eKey;  string  m_sTitle;
};
int main(int argc,char* argv[])
{
    const int ncOk= REXI_DefErr::eNoErr;
    const char szTestSrc[]= 
    "3-8272-5737-9,AVAILABLE,    \"XML praxis und referenz\"\r\n"
    "0-201-05866-9,cancelled,    \"Parallel Program Design\"\r\n";

    REXI_Scan scanner;
    REXI_DefErr err;
/* STEP 1: initialize scanner with symbol definitions */
    err= scanner.AddSymbolDef ("(AVAILABLE)\\i",T_AVAILABLE);
    assert(err.eErrCode==ncOk);
    err= scanner.AddSymbolDef ("(CANCELLED)\\i",T_CANCELLED);
    assert(err.eErrCode==ncOk);
    err= scanner.AddSymbolDef (",",T_COMMA);
    assert(err.eErrCode==ncOk);
    err= scanner.AddSymbolDef ("\\n",T_LINEBREAK);
    assert(err.eErrCode==ncOk);
    err= scanner.AddHelperRegDef("$Int_","[0-9]+\\-");
    assert(err.eErrCode==ncOk);
    err= scanner.AddSymbolDef ("$Int_ $Int_ $Int_ [0-9]+", T_ISBN);
    assert(err.eErrCode==ncOk);
    err= scanner.AddSymbolDef (" \"( [^\"\\n] | \\\"] )* \"", T_TITLE);
    assert(err.eErrCode==ncOk);
    err= scanner.SetToSkipRegExp("[ \\t\\v\\r]*");
    assert(err.eErrCode==ncOk);
/* STEP 2 : set source */
    scanner.SetSource(szTestSrc);
    int nNofLines=0;
    int nRes;
    Info info;
    vector<Info> vecInfos;
/* STEP 3: read until eos */
    while( (nRes=scanner.Scan())!=REXI_Scan::eEos ){
        switch(nRes){
        case T_AVAILABLE: 
        case T_CANCELLED:
            info.m_eKey= (enum ESymbol)nRes;
            break;
        case T_TITLE:
            info.m_sTitle= scanner.GetTokenString();
            break;
        case T_ISBN:
            info.m_sISBN=  scanner.GetTokenString();
            break;
        case T_LINEBREAK:
            vecInfos.push_back(info); info= Info();
            break;
        case REXI_Scan::eIllegal: 
            cout    <<  "Illegal:"    <<  
                scanner.GetTokenString()  <<  endl;
            while( (nRes=scanner.Scan())!=REXI_Scan::eEos 
                                && nRes!= T_LINEBREAK);
            info= Info();
            break;
        }
    }
    cout   << "Number of correct read records: "  
           <<  vecInfos.size() <<  endl;
    char c; cin >> c;
    return 0;
}

Interface for the scanner generator

The scanner generator is a very simple GUI program. It allows you to specify and run a test scanner and finally generates the source code for the specified scanner. The generated code uses a REXI_Scan derived scanner and provides two different code parts. Controlled by the conditional directive #ifdef REXI_STATIC_SCANNER, either an efficient hard coded scanner or a scanner working like the one described above is activated.

The specification for the scanner to be generated uses regular expressions and supports 4 different ways to specify a token, which are shown below.

if    #T_Quote= '[^']'    $Int= [0-9]+    ##T_FLOAT= $Int (\. $Int)?

It is important, that you separate the token definitions by tabs. Now, let's see what the 4 definitions above mean.

1. Token   if    The scanner searches for exactly the word 'if' 
          and automagically creates a constant T_if for the token
2. Token   #T_Quote= '[^']'    The leading # means: 
          The next identifier up to the = is the name of the token constant, 
          then the token definition follows.
3. Helper  $Int= [0-9]+    Defines a helper definition, 
          which can be used later.
4. Token   ##T_FLOAT= $Int (\. $Int)? 
          The leading ## means: Same as # but do postprocessing 
          after recognizing this token.

Fragment of a generated scanner

int    Simple::Scan()
{
#ifdef REXI_STATIC_SCANNER
    int nRes= FastScan();
#else
    int nRes= REXI_Scan::Scan();
#endif
    switch(nRes){
        case eIllegal:{
            m_sIllegal= GetTokenString();
            return nRes;
        }
        case T_PRICE:{
            // add your postprocessing code here

            return nRes;
        }
        default: return nRes;
    }
}

Intended Use

Scanning a comma separated file, implementing a pretty printer for C++-source code or building a scanner for an interpreter are potential application areas. There are also quite a lot of freely available scanner generators (lex, bison) out there, but as far as I know, no one generates scanners with such a neat interface as this one.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here