Background
Before I started writing this utility program, I was trying to post some C++ source code to my blog. When I tried to do so, the result was almost disastrous. Then I realized that the editor I was using could only be used to copy and paste plain HTML source. I immediately realized it would be a cool thing to do to create/use a transformation program to convert C++ source code to HTML source with the correct syntax highlighting.
I tried at first to create this program without consulting any outside sources. It didn't get me very far. Then I searched CodeProject. And there it was, a brilliant article written by q123456789. It is called "A C++ to HTML conversion utility". This program had everything I needed. But, I didn't like it due to the fact that it was not extendable. What I really wanted was a GUI base program with two edit boxes; I paste the C++ code into one of the edit boxed and click "Convert"; in the second edit box, I will see the C++ code transformed into HTML source ready to be pasted into my blog. I decided to change q123456789's wonderful code into a simple GUI program done with C#.
Using the code
You don't need to take the "Programming Language" class to understand this code. The original article explains how it works very well. The idea, let me summarize it for you, is to go through the C++ source code one character at a time, identify what the character is, and store it in a buffer until you hit a character that does not belong to the type of string you are attempting to recognize. Then you put the character back to the stream. Finally, you identify what type of string it is. The types can be:
- code
- comment
- CPP string
- keyword
- pre processor
- don't know what it is.
They are defined in my CS code as follows:
public enum TokenType { code, comment, cppstring, keyword, pp, none };
Comments are basically any string that start with the character "//" to the the point you find a "\n", or any string enclosed in "/*" and "*/".
A CPP string is any string that is enclosed in double quotes. For example:
"This is a string."
"This is a string with character\"double quotes\" in it."
Keywords are C++ keywords. It is a long list so I am not going to list them here.
I used "pp
" as a type name for preprocessors. The preprocessors are something like:
#include
#ifndef
#else
#endif
...
Anything that can't be fit into the above four categories, I call them code.
When I started working on this, the biggest problem I faced was that in the original article (written by q123456789), an I/O stream is used for getting a character one at a time (using basic_istream::get()
) to advance through the code, and parse for tokens or bits of code. It also rewinds (basic_istream::unget()
) the characters from time to time. When I tried to do such an action with C#, I was kind of stunned and didn't know what to do for a moment. Then I realized that in C#, there is a class called StringReader
, located in the System.IO
namespace. Once I realized that, I was able to go on coding.
Then I had to stop a second time. The reason was that the StringReader
class did not provide an unget()
method. Well, I was not going to be stopped by these small problems. After some reading, I realized that the way q123456789 has done this in his program, get a character from the stream, then rewind the stream, is not going to work when using the StringReader
class.
The alternative is pretty simple -- the program will peek the character but not move forward. If the character is a keeper, then the program will advance one character in the stream. When I looked into the StringReader
class, I found the methods I wanted to use, Peek()
and Read()
. The Peek()
method returns an integer that represents the character in the current position. But it does not move to the next character. The Read()
method returns the integer that represents the character at the current position, then moves to the next character. In my program, it would do a peek to get the integer value of the character at the current position, make sure it is not -1, then uses Converter.ToChar(int)
to convert the integer to a character. If the program likes this character, it will call Read()
to move to the next character in the stream, then appends the character in to a buffer. If it finds a character it doesn't like, it will not call Read() to advance the position of the pointer in the stream.
Here it is, the C++ source code parsing method I wrote:
public bool ParseCode(StringReader orig_code_stream)
{
int iCharVal = orig_code_stream.Peek();
if (iCharVal != -1)
{
orig_code_stream.Read();
char c = Convert.ToChar(iCharVal);
switch (c)
{
case '/':
iCharVal = orig_code_stream.Peek();
if (iCharVal == -1)
return false;
c = Convert.ToChar(iCharVal);
if (c == '*')
{
orig_code_stream.Read();
token_val.Append("/*");
this.token_type = TokenType.comment;
iCharVal = orig_code_stream.Peek();
while (iCharVal != -1)
{
orig_code_stream.Read();
c = Convert.ToChar(iCharVal);
if (c == '/')
{
if (token_val.Length > 2
&& token_val[token_val.Length-1] == '*')
{
token_val.Append("/");
return true;
}
}
else
{
token_val.Append(c);
}
iCharVal = orig_code_stream.Peek();
}
}
else if (c == '/')
{
orig_code_stream.Read();
token_val.Append("//");
this.token_type = TokenType.comment;
iCharVal = orig_code_stream.Peek();
while (iCharVal != -1 &&
(c = Convert.ToChar(iCharVal)) != '\n')
{
orig_code_stream.Read();
token_val.Append(c);
iCharVal = orig_code_stream.Peek();
}
if (c == '\n')
{
orig_code_stream.Read();
token_val.Append(c);
}
return true;
}
token_val.Append("/");
return false;
case '#':
this.token_val.Append('#');
iCharVal = orig_code_stream.Peek();
if (iCharVal == -1)
return false;
c = Convert.ToChar(iCharVal);
while (c == ' ' || c == '\r' || c == '\n' || c == '\t')
{
this.token_val.Append(c);
orig_code_stream.Read();
iCharVal = orig_code_stream.Peek();
if (iCharVal == -1)
return false;
c = Convert.ToChar(iCharVal);
}
while (Char.IsLetter(c) && Char.IsLower(c))
{
this.token_val.Append(c);
orig_code_stream.Read();
iCharVal = orig_code_stream.Peek();
if (iCharVal == -1)
break;
c = Convert.ToChar(iCharVal);
}
if (IsTokenPrePorcessor(this.token_val.ToString()))
{
this.token_type = TokenType.pp;
return true;
}
return false;
case '\'':
case '\"':
{
char q = c;
token_val.Append(q);
while(true)
{
iCharVal = orig_code_stream.Peek();
if (iCharVal == -1)
return false;
c = Convert.ToChar(iCharVal);
if (c == q)
{
if (token_val.Length >= 2)
{
if (!(token_val[token_val.Length - 1] == '\\' &&
token_val[token_val.Length - 2] != '\\'))
{
token_val.Append(q);
orig_code_stream.Read();
this.token_type = TokenType.cppstring;
return true;
}
}
else
{
token_val.Append(q);
orig_code_stream.Read();
this.token_type = TokenType.cppstring;
return true;
}
}
token_val.Append(c);
orig_code_stream.Read();
}
}
case 'a':
case 'b':
case 'c':
case 'd':
case 'e':
case 'f':
case 'g':
case 'i':
case 'l':
case 'm':
case 'n':
case 'o':
case 'p':
case 'r':
case 's':
case 't':
case 'u':
case 'v':
case 'w':
case 'x':
token_val.Append(c);
iCharVal = orig_code_stream.Peek();
if (iCharVal == -1)
return false;
c = Convert.ToChar(iCharVal);
while (Char.IsLetter(c) ||
Char.IsDigit(c) || c == '_')
{
token_val.Append(c);
orig_code_stream.Read();
iCharVal = orig_code_stream.Peek();
if (iCharVal == -1)
return false;
c = Convert.ToChar(iCharVal);
}
if (IsTokenKeyword(token_val.ToString()))
{
this.token_type = TokenType.keyword;
return true;
}
else
{
this.token_type = TokenType.code;
return true;
}
//return false;
default:
token_val.Append(c);
iCharVal = orig_code_stream.Peek();
if (iCharVal == -1)
return false;
c = Convert.ToChar(iCharVal);
while (c != '/' && c != '#' &&
!(Char.IsLetter(c) && Char.IsLower(c))
&& c != '\'' && c != '\"')
{
token_val.Append(c);
orig_code_stream.Read();
iCharVal = orig_code_stream.Peek();
if (iCharVal == -1)
{
if (token_val.Length > 0)
{
this.token_type = TokenType.code;
return true;
}
else
return false;
}
c = Convert.ToChar(iCharVal);
}
this.token_type = TokenType.code;
return true;
}
}
return false;
}
How do you use this method? Here is the C# code block that you will find in the Form
class:
private string ConvertCppToHtml(string orig_line_val)
{
StringReader strm = new StringReader(orig_line_val);
StringBuilder bld = new StringBuilder();
bld.Append("<table border=1" +
" bordercolor=#000000 bordercolordark=#000000>");
bld.Append("<tr>");
bld.Append("<td>");
bld.Append("<font size=2 face=\'Courier New\'>");
while (true)
{
Tokenizer tok = new Tokenizer();
if (tok.ParseCode(strm))
{
string outputVal = ChangeCharToHtml(tok.Value);
switch (tok.Type)
{
case Tokenizer.TokenType.comment:
bld.Append("<font size=2 color=\'#008000\'>");
bld.Append(outputVal);
bld.Append("</font>");
break;
case Tokenizer.TokenType.cppstring:
bld.Append("<font size=2 color=\'#800000\'>");
bld.Append(outputVal);
bld.Append("</font>");
break;
case Tokenizer.TokenType.keyword:
case Tokenizer.TokenType.pp:
bld.Append("<font size=2 color=\'#0000ff\'>");
bld.Append(outputVal);
bld.Append("</font>");
break;
default:
bld.Append(outputVal);
break;
}
}
else
break;
}
bld.Append("</font>");
bld.Append("</td>");
bld.Append("</tr>");
bld.Append("</table>");
return bld.ToString();
}
What this code does is, using a while
loop, it continues to create a Tokenizer
object. This object will parse one piece of C++ code at a time. Depending on what the type of token is, it will highlight the token using HTML tags. When it is done, a simple table will be generated and displays the C++ source code with a pretty syntax highlighting.
One last thing I need to tell you about is regarding this method called ChangeCharToHtml()
. What this method does is, it changes some characters to HTML characters. These characters include:
- "&" change to "&"
- "<" change to "<"
- ">" change to ">"
- "\"" change to """
- " " change to " "
- "\t" change to " "
- "\r" change to ""
- "\n" change to "<br/>"
To replace these characters, I used the Regex.Replace()
method. Here is something you need to know, the order of the replacement should be listed as above in the list. Here is the code:
private string ChangeCharToHtml(string orig_val)
{
string retVal = orig_val;
retVal = Regex.Replace(retVal, "&",
"&", RegexOptions.IgnoreCase);
retVal = Regex.Replace(retVal, "<",
"<", RegexOptions.IgnoreCase);
retVal = Regex.Replace(retVal, ">",
">", RegexOptions.IgnoreCase);
retVal = Regex.Replace(retVal, "\"",
""", RegexOptions.IgnoreCase);
retVal = Regex.Replace(retVal, " ",
" ", RegexOptions.IgnoreCase);
retVal = Regex.Replace(retVal, "\t",
" ", RegexOptions.IgnoreCase);
retVal = Regex.Replace(retVal, "\r", "",
RegexOptions.IgnoreCase);
retVal = Regex.Replace(retVal, "\n",
"<br/>", RegexOptions.IgnoreCase);
return retVal;
}
Points of Interest
I hope you've enjoyed this little tutorial. After you download the source code and get it compiled, you should take a look at the code. You will find that there are a lot of improvements that can be made to the Tokenizer
class. For example, if you are really into refactoring, you should clean up the ParseCode()
method. Each of the case
s can be separated into private
or protected
methods so that you can unit test each of the parsing methods before integrating them into the ParseCode()
method. If you have any questions, leave it in the forum below. Thank you for visiting.
History
- 1/28/2006 -- First draft.