Introduction
My scanners/tokenizers all take IEnumerable<char>
as their streaming character source. The reason is that this allows for a simple, ubiquitous streaming interface that is no frills but adaptable. It can take a string
or a char[]
out of the box, or you can provide your own source. This library contains sources for files, URLs, console input and a generic TextReader
. All of my major Unicode support is UTF-32 internally to allow for surrogate pairs to be treated as single characters, thus representing the Unicode graphemes properly instead of as individual UTF-16 Unicode code units. This is critical for proper Unicode support all the way up through the 21 bit range that Unicode provides. This library provides a simple enumerator for converting an IEnumerator<char>
to an IEnumerator<int>
of UTF-32 code units.
Using the Code
The code is straightforward to use allowing for a couple of gotchas which I'll cover.
var fr = new FileReaderEnumerable(@"..\..\Program.cs");
foreach (var ch in fr)
Console.Write(ch);
Console.WriteLine();
Console.WriteLine();
var ur = new UrlReaderEnumerable(@"http://www.google.com");
var i = 0;
foreach (var ch in ur)
{
if(79==i)
{
Console.Write("...");
break;
}
Console.Write(ch);
++i;
}
Console.WriteLine();
Console.WriteLine();
var test = "This is a test \U0010FFEE";
var uni = new Utf32Enumerable(test);
foreach(var uch in uni)
{
var str = char.ConvertFromUtf32(uch);
Console.Write(str);
}
Console.WriteLine();
Console.WriteLine();
var reader = new StringReader("This is a demo of TextReaderEnumerable");
foreach (char ch in TextReaderEnumerable.FromReader(reader))
Console.Write(ch);
Console.WriteLine();
Console.WriteLine();
The gotchas here are this: TextReaderEnumerable
must be created via FromReader()
, unlike the others which are created via a constructor. The other gotcha with it is that it cannot be reset and it can only be enumerated once (no seeking). Attempting to enumerate it a second time will throw. Furthermore, it's important to call Dispose()
on the IEnumerator<>
instances if you are using them manually. foreach
does this for you automatically. Finally, when using ConsoleReaderEnumerable
, it doesn't know when to stop short of you typing Ctrl-Z into the console. It's usually used for file piping.
The nice thing about these interfaces is it's super easy to load them into List<char>
and List<int>
classes if you need to stash them, and it's easy to write adapters for them - Utf32Enumerable
is one such adapter but you can do pretty much anything you can do with an enumerator. You can also use LINQ on file streams this way.
Have fun!
History
- 2nd February, 2020 - Initial submission