Introduction
Recently I needed an application which can generate random, human-readable names. My searches lead me to Markov Chains, and how they can be built and used for random words or names generation.
I found an old post exposing a C++ implementation, on which I based my approach. The map-building implementation mainly inherits from this C++ implementation, but I completely changed the random generation part, which I considered as unoptimized (relying on huge strings manipulations), at least for a .NET managed solution.
This article describes the resulting implementation, and provides a basic Winforms application around it.
Background
I will not describe here the basic concept behind Markov chains; this has already been done by people who have much more expertise than me on the subject, including here on CodeProject. You will find in the Links section below a list of relevant articles about them.
Let's just explain the general principle of the random generation:
- A list of known/acceptable words has to be provided first so that a map can be built.
- A map is built, referencing each letters-sequence (to a given depth) and how frequently it appears in the samples, as well as the place in the words a sequence can appear at (at start, in the middle, or at the end of the word).
- This map is then used to generate random words. A former list of already generated and/or forbidden words can be provided, so that the algorithm does not generate the same word twice.
What does a map look like?
A map associates a sequence of one or more letters to a list of possible next characters; for each of these possible following characters, a counter is maintained about the number of occurrences as well as the position in the string.
Let's see a depth-1 map built on the single word 'test'. This would look like:
Sequence Next First Inner Last
-------- ---- ----- ----- ----
t e 1 0 0
e s 0 1 0
s t 0 0 1
If we increase the depth of the map to two characters, the following occurrences will be added to the above list:
Sequence Next First Inner Last
-------- ---- ----- ----- ----
te s 1 0 0
es t 0 0 1
How is the map used to generate random words?
- First, as sequence is randomly chosen amongst all available sequences.
- Then, a list of all following characters (for all sequences, not only for the one randomly chosen) is built; if a character belongs to the possible following characters of chosen sequence, its frequency is taken from the one in the map; if it does not belong to possible following characters for this sequence, it is given a minimal frequency instead (which can be chosen). So that unavailbale combinations in the samples still have a slight chance to appear in the results.
- Finally, a random character is chosen from above list, and its chance to be chosen is proportional to its frequency.
Implementation
The Position enumeration
The Position
enumeration is used to query a map for a specific frequency value.
public enum Position : byte
{
First,
Inner,
Last
}
The StateFrequency class
The StateFrequency
class is quite simple, and just holds three integer variables to store frequency information about a given state. It also has a few methods allowing to increment internal fields.
public class StateFrequency : IEquatable<StateFrequency>
{
public int FirstCount { get; private set; }
public int InnerCount { get; private set; }
public int LastCount { get; private set; }
public StateFrequency()
: this(0, 0, 0) { }
private StateFrequency(int firstCount, int innerCount, int lastCount)
{
FirstCount = firstCount;
InnerCount = innerCount;
LastCount = lastCount;
}
public void Add(StateFrequency other)
{
IncrementFirst(other.FirstCount);
IncrementInner(other.InnerCount);
IncrementLast(other.LastCount);
}
public bool Equals(StateFrequency other)
{
if (other == null)
{
return false;
}
if (ReferenceEquals(this, other))
{
return true;
}
return ((other.FirstCount == FirstCount) && (other.InnerCount == InnerCount) && (other.LastCount == LastCount));
}
public override bool Equals(object obj)
{
return (obj is StateFrequency) ? Equals((StateFrequency)obj) : false;
}
public overrride bool GetHashCode()
{
int hash = 351;
hash += FirstCount.GetHashCode();
hash *= 13;
hash += InnerCount.GetHashCode();
hash *= 13;
return hash + LastCount.GetHashCode();
}
public int GetValue(Position position)
{
switch (position)
{
case Position.First:
return FirstCount;
case Position.Last:
return LastCount;
case Position.Inner:
default:
return InnerCount;
}
}
public void Increment(int count = 1)
{
IncrementFirst(count);
IncrementInner(count);
IncrementLast(count);
}
public void IncrementFirst(int count = 1)
{
FirstCount += count;
}
public void IncrementInner(int count = 1)
{
InnerCount += count;
}
public void IncrementLast(int count = 1)
{
LastCount += count;
}
public override string ToString()
{
return $"First: {FirstCount}; Inner: {InnerCount}; Last: {LastCount}";
}
}
The Map class
The Map
class inherits Dictionary<TKey, TValue>
class, and exposes a single method allowing to add an element to the dictionary without taking care of a preexisting key. This will later make operations a little simpler to write. Both classes presented below will inherit this base Map
class.
public class Map<TKey, TValue> : Dictionary<TKey, TValue>
{
public bool EventuallyAdd(TKey key, TValue value)
{
bool add = !ContainsKey(key);
if (add)
{
Add(key, value);
}
return add;
}
}
The FrequencyMap class
The FrequencyMap<T>
class is defined as a Map<T, StateFrequency>
. T
being the type of states we will work on (here, string
).
It presents a Merge
method which allows to merge a FrequencyMap<T>
with any IDictionary<T, StateFrequency>
, and returns the number of pairs that have actually been added.
public class FrequencyMap<T> : Map<T, StateFrequency>
{
public int Merge(IDictionary<T, StateFrequency> dictionary)
{
int count = 0;
T key;
StateFrequency value;
foreach (var pair in dictionary)
{
key = pair.Key;
value = pair.Value;
if (!EventuallyAdd(key, value))
{
this[key].Add(value);
}
else
{
count++;
}
}
return count;
}
}
The MarkovChainMap class
The MarkovChainMap<T>
class is defined as a Map<T, FrequencyMap<T>>
. It is the map class to which I refered earlier to describe the whole random generation process.
It implements the ISerializable
interface so that it can be saved/loaded to/from a file.
It also exposes (not shown here) methods to:
- dump a map to a string.
- dump a map to a string and save it to a text file.
- get the total number of frequencies recorded in the map.
- get the average number of frequency per entry in the map.
- load and save a map from/to a binary file.
- save a map to a
byte
array.
public sealed class MarkovChainMap<T> : Map<T, FrequencyMap<T>>, ISerializable
{
public const int DefaultDepth = 1;
public int Depth { get; private set; }
public MarkovChainMap(int depth = DefaultDepth)
: base()
{
Depth = depth;
}
private MarkovChainMap(SerializationInfo info, StreamingContext context)
: base(info, context)
{
Depth = info.GetInt32(nameof(Depth));
}
public List<T> GetNextStates()
{
HashSet<T> states = new HashSet<T>();
T key;
foreach (var pair in this)
{
key = pair.Key;
foreach (var pair2 in pair.Value)
{
states.Add(pair2.Key);
}
}
return states.ToList();
}
public override void GetObjectData(SerializationInfo info, StreamingContext context)
{
base.GetObjectData(info, context);
info.AddValue(nameof(Depth), Depth);
}
public List<T> GetStartStates()
{
return Keys.ToList();
}
public void Merge(MarkovChainMap<T> other)
{
T key;
foreach (var pair in other)
{
key = pair.Key;
EventuallyAdd(key, new FrequencyMap<T>());
this[key].Merge(pair.value);
}
}
}
Delegates
In order to build a frequency map for a given type, we will need two functions which will allow to:
- partition a
T
value (a sequence) into individual, single parts. - aggregate two or more
T
values to form a sequence.
Both these methods are defined by two delegates, respectively:
public delegate T[] PartitionFunc<T>(T value);
public delegate T AggregationFunc<T>(IEnumerable<T> values, int start, int count);
The MarkovChainMapBuilder and MarkovChainStringMapBuilder classes
The MarkovChainMapBuilder
class is a generic class which holds the delegates defined above. The depth of the desired map can also be specified.
public class MarkovChainMapBuilder<T>
{
private AggregationFunc<T> _aggregate;
private PartitionFunc<T> _partition;
private int _depth;
protected AggregationFunc<T> Aggregate
{
get { return _aggregate; }
set
{
if (value == null)
{
throw new ArgumentNullException(nameof(value));
}
_aggregate = value;
}
}
protected PartitionFunc<T> Partition
{
get { return _partition; }
set
{
if (value == null)
{
throw new ArgumentNullException(nameof(value));
}
_partition = value;
}
}
public virtual int Depth
{
get { return _depth; }
set
{
if (value < 1)
{
throw new ArgumentException(nameof(value));
}
_depth = value;
}
}
public MarkovChainMapBuilder(int depth, PartitionFunc<T> partition, AggregationFunc<T> aggregate)
{
if (depth < 1)
{
throw new ArgumentException(nameof(depth));
}
if (partition == null)
{
throw new ArgumentNullException(nameof(partition));
}
if (aggregate == null)
{
throw new ArgumentNullException(nameof(aggregate));
}
_depth = depth;
_partition = partition;
_aggregate = aggregate;
}
}
It also exposes a public Train
method which returns a MarkovChainMap<T>
from a list of samples, and a protected Train
method to qualify an existing MarkovChainMap<T>
with a given, single sample.
public MarkovChainMap<T> Train(IEnumerable<T> samples)
{
MarkovChainMap<T> map = new MarkovChainMap<T>(_depth);
foreach (T sample in samples)
{
if (sample != null)
{
Train(map, sample);
}
}
return map;
}
protected virtual void Train(MarkovChainMap<T> map, T sample)
{
T[] array = _partition(sample);
int length = array.Length;
int depth = 1;
FrequencyMap<T> frequencyMap;
StateFrequency stateFrequency;
T key, value;
int position, current, limit;
while (depth <= _depth)
{
if (depth < length)
{
limit = length - depth;
for (position = 0; (current = position + depth) < length; position++)
{
key = _aggregate(array, position, depth);
value = array[position + depth];
map.EventuallyAdd(key, new FrequencyMap<T>());
frequencyMap = map[key];
frequencyMap.EventuallyAdd(value, new StateFrequency());
stateFrequency = frequencyMap[value];
if (position == 0)
{
stateFrequency.IncrementFirst();
}
else if (current < limit)
{
stateFrequency.IncrementInner();
}
else
{
stateFrequency.IncrementLast();
}
}
}
depth++;
}
}
The MarkovChainStringMapBuilder
class is a subclass of the MarkovChainMapBuilder
specific to string
type.
It provides basic PartitionFunc
and AggregationFunc
functions dedicated to string
type.
It also allows to specify the way we want to handle whitespaces in samples. Whether we want to skip the whitespaces (i.e., split the sample into several on the whitespace character) or not (i.e., a whitespace is a valid character to be considered in the final output).
It provides a TrainFromFile
method which allows to build a map by providing the path to the samples text file. A comment specifier can also be specified; lines beginning with this string will not be considered as a sample and will be skipped.
The PartitionFunc
and AggregationFunc
functions are implemented as static methods.
It finally provides a ReadWord
method from a StreamReader
; this allows to split a given sample when it contains whitespace(s) and the SkipWhitespace
property is set to true
.
A static Lazy<StringBuilder>
is also present, preventing to have to create and destroy a StringBuilder
anytime we need one in ReadWord
method.
public class MarkovChainStringMapBuilder : MarkovChainMapBuilder<string>
{
private static readonly Lazy<StringBuilder> _sb = new Lazy<StringBuilder>(() => new StringBuilder(16));
public bool SkipWhitespace { get; set; }
public MarkovChainStringMapBuilder(int depth, bool skipWhitespace = true)
: this(depth, (v) => ToStringArray(v), (v, s, c) => Join(v, s, c), skipWhitespace) { }
public MarkovChainStringMapBuilder(int depth, PartitionFunc<string> partition, AggregationFunc<string> aggregate, bool skipWhitespace = true)
: base(depth, partition, aggregate)
{
SkipWhitespace = skipWhitespace;
}
public MarkovChainMap<string> TrainFromFile(string path, string commentSpecifier = null)
{
commentSpecifier = commentSpecifier ?? string.Empty;
List<string> values = new List<string>();
string value;
using (StreamReader reader = new StreamReader(path))
{
while (!reader.EndOfStream)
{
value = (SkipWhitespace) ? ReadWord(reader) : reader.ReadLine();
if (!string.IsNullOrWhiteSpace(value))
{
if (string.IsNullOrEmpty(commentSpecifier) || !value.StartsWith(commentSpecifier))
{
values.Add(value);
}
}
}
}
return Train(values);
}
private static string Join(IEnumerable<string> slices, int start, int count)
{
string result = string.Empty;
switch (slices.Count())
{
case 0:
break;
case 1:
result = slices.First();
break;
default:
string[] values = slices.Skip(start).Take(count).ToArray();
result = string.Join(string.Empty, values);
break;
}
return result;
}
private static string ReadWord(StreamReader reader)
{
StringBuilder sb = _sb.Value;
sb.Clear();
char c;
int read;
while ((read = reader.Read()) != -1)
{
c = (char)read;
if (char.IsWhiteSpace(c))
{
break;
}
sb.Append(c);
}
return sb.ToString();
}
private static string[] ToStringArray(string input)
{
List<string> characters = new List<string>(input.Length);
foreach (char c in input)
{
characters.Add(c.ToString());
}
return characters.ToArray();
}
}
The Sieve class
The Sieve<T>
class is responsible for returning a T
random value from a specified MarkovChainMap<T>
and a Position
.
It does that by creating an int
array, which contains as many elements as there are in the map, and in which each element is the cumulative sum of frequencies in the map (minored by the MinFrequency
property).
It then choose a random number between zero and the value at the last index in the array, and returns the T
value which corresponds to this random number.
public sealed class Sieve<T>
{
public const int DefaultMinFrequency = 1;
private int _minFrequency;
private Random _random;
public int MinFrequency
{
get { return _minFrequency; }
set { _minFrequency = Math.Max(0, value); }
}
public Random Random
{
get { return _random; }
set { _random = value ?? new Random(); }
}
public Sieve(int minFrequency = DefaultMinFrequency, Random random = null)
{
Random = random;
MinFrequency = minFrequency;
}
public int[] GetBuckets(FrequencyMap<T> map, Position position)
{
int[] buckets = new int[map.Count];
int sum = 0;
int index = 0;
StateFrequency frequency;
int value;
foreach (var pair in map)
{
frequency = pair.Value;
value = Math.Max(_minFrequency, frequency.GetValue(position));
sum += value;
buckets[index++] = sum;
}
return buckets;
}
public T GetValue(FrequencyMap<T> map, Position position)
{
int[] buckets = GetBuckets(map, position);
int length = buckets.Length;
int total = buckets[length - 1];
int value = _random.Next(0, total);
int index;
for (index = 0; index < length; index++)
{
if (value < buckets[index])
{
break;
}
}
return map.Keys.ToList()[index];
}
}
The MarkovChainsNameGenerator class
The MarkovChainsNameGenerator
class is the one that is responsible to generate random words using above described objects. It is the only one that has to be used for someone who is only interested in the functionality.
It allows to specify:
- whether we want to capitalize generated names;
- an already built
MarkovChainMap<string>
to use, eventually. - the depth of the map.
- the minimum and maximum number of characters of desired outputs.
- the minimum frequency to take into account while generating random names.
- a
Random
object to use, eventually. - whether we want to skip whitespaces in samples while building the map.
Again, the code presented here is a summarized version of the actual one in the solution. Property-setters actually validate provided values, to prevent some common inconsistencies:
- The
MapDepth
property is set whenever a valid map is provided to the setter of the Map
property. - The
MapDepth
property cannot be lower than one. - The
MinLength
property is lower-bounded to one. - The
MaxLength
property is lower-bounded to MinLength
property. - The
MinFrequency
property is lower-bounded to zero. - The
Random
property initializes a new default Random
object whenever a null value is provided to the setter. - Setting the
Map
property also sets the MapDepth
property accordingly. - Setting the
MinFrequency
property sets the corresponding property of the internal Sieve<string>
object. - Setting the
SkipWhitespace
property sets the corresponding property of the internal MarkovChainStringMapBuilder
object.
public sealed class MarkovChainsNameGenerator
{
private const int DefaultDepth = MarkovChainMap<string>.DefaultDepth;
private const int DefaultMinFrequency = Sieve<string>.DefaultMinFrequency;
private const int DefaultMaxLength = 9;
private const int DefaultMinLength = 3;
private const int Maximum = int.MaxValue - 1;
private readonly HashSet<string> _blacklistedNames;
private readonly MarkovChainStringMapBuilder _mapBuilder;
private readonly StringBuilder _sb;
private readonly Sieve<string> _sieve;
public bool Capitalize { get; set; }
public MarkovChainMap<string> Map { get; set; }
public int MapDepth { get; set; }
public int MaxLength { get; set; }
public int MinFrequency { get; set; }
public int MinLength { get; set; }
public Random Random { get; set; }
public bool SkipWhitespace { get; set; }
public int Seed
{
set { Random = new Random(value); }
}
private double RangeLength
{
get { return MaxLength - MinLength + 1d; }
}
public MarkovChainsNameGenerator(
Random random = null,
int mapDepth = DefaultDepth,
int minFrequency = DefaultMinFrequency,
int minLength = DefaultMinLength,
int maxLength = DefaultMaxLength,
bool capitalize = true,
bool skipWhitespace = true,
IEnumerable<string> blacklistedNames = null)
{
Random = random ?? new Random(Environment.TickCount);
MapDepth = mapDepth;
MinFrequency = Math.Max(0, minFrequency);
MinLength = Math.Max(1, minLength);
MaxLength = Math.Max(MinLength, maxLength);
Capitalize = capitalize;
SkipWhitespace = skipWhitespace;
_mapBuilder = new MarkovChainStringMapBuilder(MapDepth, SkipWhitespace);
_sieve = new Sieve<string>(MinFrequency, Random);
_sb = new StringBuilder(_maxLength);
_blacklistedNames = new HashSet<string>(blacklistedNames) ?? new HashSet<string>();
}
public MarkovChainsNameGenerator(
MarkovChainMap<string> map,
Random random = null,
int mapDepth = DefaultDepth,
int minFrequency = DefaultMinFrequency,
int minLength = DefaultMinLength,
int maxLength = DefaultMaxLength,
bool capitalize = true,
bool skipWhitespace = true,
IEnumerable<string> blacklistedNames = null)
: this(random, map.Depth, minFrequency, minLength, maxLength, capitalize, skipWhitespace, blacklistedNames)
{
_map = map;
}
public string GetName()
{
StringBuilder sb = _sb;
sb.Clear();
List<string> startLetters = Map.GetStartStates();
List<string> nextCharacters = Map.GetNextStates();
int startLetterCount = startLetters.Count;
int nextCharacterCount = nextCharacters.Count;
int totalLength, currentLength, selectionLength;
string initial, key, result;
do
{
totalLength = (int)(MinLength + RangeLength * Random.Next() / Maximum);
initial = startLetters[(int)(startLetterCount * (double)Random.Next() / Maximum)];
sb.Append(initial);
currentLength = initial.Length;
while (currentLength < totalLength)
{
selectionLength = Random.Next(1, Math.Min(currentLength, MapDepth) + 1);
do
{
key = sb.ToString().Substring(currentLength - selectionLength, selectionLength);
} while (!Map.ContainsKey(key) && (--selectionLength > 0));
if (selectionLength == 0)
{
key = nextCharacters[(int)(nextCharacterCount * (double)Random.Next() / Maximum)];
}
sb.Append(_sieve.GetValue(Map[key], (currentLength == 1) ? Position.First : (currentLength == totalLength - 1) ? Position.Last : Position.Inner));
currentLength++;
}
result = sb.ToString();
if (Capitalize)
{
result = MakeProperCase(result);
}
} while (_blacklistedNames.Contains(result));
_blacklistedNames.Add(result);
return result;
}
public IEnumerable<string> GetNames(int count)
{
if (count < 1)
{
yield break;
}
for (int i = 0; i < count; i++)
{
yield return GetName();
}
}
public void TrainMapBuilder(string path, string commentSpecifier = null)
{
Map = _mapBuilder.TrainFromFile(path, commentSpecifier);
}
private static string MakeProperCase(string input)
{
string initial = input.Substring(0, 1).ToUpper(Thread.CurrentThread.CurrentCulture);
return $"{initial}{input.Substring(1, input.Length - 1)}";
}
}
Using the code
The MarkovChainsNameGenerator
class is the only class that has to be used.
First, one has to add a reference to the RandomNameGenerator.Core.dll library.
Then, the library can be used this way:
using RandomNameGenerator.Core;
int Main()
{
MarkovChainsNameGenerator generator = new MarkovChainsNameGenerator();
generator.TrainMapBuilder(@"C:\path\to\the\samples\file");
IEnumerable<string> names = generator.GetNames(100);
}
Using the provided Windows Forms application
Here is how application's main screen looks like:
- The ListView where generated names will be displayed. List's items can be checked or not.
- Here you can specify in which directory you wish to save the lists of accepted and rejected names. Only accepted names' file name can be specified, rejected names' file name being built relative to the former.
- Here you have a list of already built and saved maps, as well as buttons allowing to manage them. A specific map has to be activated (with corresponding button) before names can begin being generated.
- Here you can configure how you want the names to be generated. Batch size corresponds to the number of generated names when you click on 'Generate Names' and 'Commit' buttons below. Total corresponds to the total number of names you finally need. Other parameters already have been explained above.
- When you click on this button, random names are generated and populate the list. Previous displayed list is discarded, and nothing is added to the accepted and rejected names lists.
- When you click on this button, a window is opened and displays the list of accepted names so far.
- When you click on this button, a window is opened and displays the list of rejected names so far.
- When you click on this button, checked names in the list are added to accepted names list, and unchecked ones are added to rejected names list. Finally the list is populated with a new batch of random names, which cannot already be in accepted and rejected names lits.
- The status bar displaying a few counters: the total number of names finally needed, the number of names that have been generated so far (along with the ratio related to total number needed), the remaining number of names to be accepted (along with the ratio related to total number needed), the number of accepted names so far (along with the ratio related to the number of names already generated), the number of rejected names so far (along with the ratio related to the number of names already generated), the time spent so far building maps, and the time spent so far generating names.
There is also an Options screen:
It allows to configure:
- the extension of maps save files.
- the extension of maps string dump files.
- whether maps should be dumped upon generation.
- an eventual comment specifier for samples files.
- whether map generation should skip whitespaces.
- the extension of output files (accepted and rejected names).
- the flag added to file name for rejected names.
- the extension of backup files when accepted and rejected lists are being reset.
- the font of main screen's names list.
In this screen you can also reset lists and counters to eventually get back to initial state of application.
And here is the map creation screen:
Dictionary relates to sample names file.
Finally, here is the credits screen:
Blue items in the list are clickable links, and the link they refer to is displayed as a tooltip.
Notes
- The generic solution provided can theorically be used on other types than strings. Though, I did not try to, nor do I plan to do it someday. If you want to use/adapt it to another type of sequences, you have to provide corresponding delegates (
PartitionFunc<T>
and AggregationFunc<T>
) for given type. - The code presented in this article is a shortened version of the one provided in the solution. Nothing has been changed in the core implementation, but code in the solution is (almost) fully documented, at least for the reusable part. Windows Forms project is not documented, though, as it is not the subject of this article.
- I used auto-implemented properties for the code presented in this article, for the sake of simplicity. In downloadable solution, though, I am using private fields to store properties' values.
- Thanks to Phillip Piper [^] for his awesome ObjectListView [^], which proves itself more valuable every day.
- I included two samples files if you do not want to spend time searching for one. They are located in the Samples folder in the root folder of the solution. Though, I did not include the generated maps related to these samples files. You will have to generate the map from the application.
Possible future improvements
- Being able to provide functions to limit specific characters combinations (for example, discard a result if it contains more than two following consonants, or if it contains specific characters combinations).
- Provide already built maps in the Winforms application, for given names, family names, possibly related to a specific culture.
- Implement a solution to generate random full names (given names + family names), possibly related to a specific culture.
- Implement
async
versions of GetName
and GetNames
methods.
Links
History
- 2016/10/29: Version 1.0.0.0 - First publication