BEncode Lexing in C++

airlobster

4.67/5 (3 votes)

21 Jun 2009CPOL2 min read

24.5K

204

A very efficient BEncode Lexer in C++.

Download source code - 1.76 KB

Introduction

BENCODE is a data encoding method which is strongly associated with .torrent files. This article introduces a C++ class that can decode streams in that format.

Background

Actually, I needed a BENCODE lexing code in a project I'm working on, and no matter how extensive my web searches were, all I could come up with was either non-C/C++ code, the code was too dependant on other files which I couldn't locate, or the code took care of things I didn't want it to. So, as usual, the only sensible solution for me was to spend couple of hours and do it myself, the way I want it.

Using the code

Before we go to the actual usage part which is totally straightforward, I would like to emphasis a few points about the bencode_lexer class:

The bencode_lexer is stream-aware, which means you can feed it arbitrarily-sized chunks of data in the order they are read in by your code.
The class acts as the parsing context as well. In other words, a single bencode_lexer instance cannot process more than one content stream at a time.
The class is not thread-safe, but since it's not supposed to run on parallel streams, that is perfectly acceptable in my opinion.

Here is a simple code that dumps the content of a bencode stream onto the standard output device. It might not be too useful in the real world, but it shows how to use class:

C++

#include <stdio.h>
#include "bencode.h"

int g_iIdent = 0;

bool Callback(bencode_lexer::Event e, const void* pv, int cb, long lParam)
{
 switch( e )
 {
  case bencode_lexer::eBeginDictionary:
  {
   for(int i=0; i < g_iIdent; ++i) fputc('\t', stdout);
   fputs("dictionary {\n", stdout);
   ++g_iIdent;
   break;
  }
  case bencode_lexer::eEndDictionary:
  {
   --g_iIdent;
   for(int i=0; i < g_iIdent; ++i) fputc('\t', stdout);
   fputs("}\n", stdout);
   break;
  }
  case bencode_lexer::eBeginList:
  {
   for(int i=0; i < g_iIdent; ++i) fputc('\t', stdout);
   fputs("list {\n", stdout);
   ++g_iIdent;
   break;
  }
  case bencode_lexer::eEndList:
  {
   --g_iIdent;
   for(int i=0; i < g_iIdent; ++i) fputc('\t', stdout);
   fputs("}\n", stdout);
   break;
  }
  case bencode_lexer::eInteger:
  {
   for(int i=0; i < g_iIdent; ++i) fputc('\t', stdout);
   fprintf(stdout, "integer=%lu\n", *(int*)pv);
   break;
  }
  case bencode_lexer::eString:
  {
   for(int i=0; i < g_iIdent; ++i) fputc('\t', stdout);
   fprintf(stdout, "string=%s\n", (const char*)pv);
   break;
  }
 }
 return true;
}

int main(int argc, char* argv[])
{
 if( argc < 2 )
  return 1;

 FILE* pf = fopen(argv[1], "rb");
 if( pf )
 {
  bencode_lexer bed(Callback);
  char chBuf[1024];

  for(;;)
  {
   int cbRead = fread(chBuf, 1, sizeof(chBuf), pf);
   if( cbRead < 1 )
    break;
   int cbParsed = bed.lex_chunk(chBuf, cbRead, 0);
   if( cbParsed < cbRead )
    break;
  }

  fclose(pf);
 }

 return 0;
}

Again, since bencode_lexer doesn't create any data-structures, it is up to you to construct the extracted data according to your project's needs.

Points of Interest

The bencode_lexer uses a hand-written stack class for managing its internal state machine. This class mimics the STL's stack class in a lot more limited but efficient way. In a real project, you will have to replace the hard-coded breakpoints I put into the ASSERT calls.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)