Overview
This article presents a maximum entropy modeling library called SharpEntropy, and discusses its usage, first by way of a simple example of predicting outcomes, and secondly, by presenting a way of splitting English sentences into constituent tokens (useful for natural language processing). Please note that because most of the code is a conversion based on original Java libraries published under the LGPL license, the source code available for download with this article is also released under the LGPL license. This means, it can freely be used in software that is released under any sort of license, but if you make changes to the library itself and those changes are not for your private use, you must release the source code to those changes.
A second article, Statistical parsing of English sentences, shows how SharpEntropy can be used to perform sophisticated natural language processing tasks. A small additional library, SharpEntropySqlite, downloadable with this article, provides a simple means of storing maximum entropy models created by SharpEntropy in a Sqlite database.
Introduction
SharpEntropy is a C# port of the MaxEnt toolkit available from SourceForge. MaxEnt is a mature library written in Java, originally created by Jason Baldridge, Tom Morton, and Gann Bierner. SharpEntropy is based upon the latest version (2.3.0) of the MaxEnt toolkit, released in August 2004.
Maximum entropy modeling is a general-purpose machine learning technique originally developed for statistical physics, but which has been employed in a wide variety of fields, including computer vision and natural language processing. It can be usefully applied whenever there is a complex process whose internals are unknown but which must be modeled by the computer. Like any statistical modeling technique, it relies on the existence of a data sample that shows, for given sets of input, what output the process generates. This sample is analysed, and from it, a model is generated, encapsulating all the rules about the process that could be inferred from the sample. This model is then used to predict the output of the process, when supplied with sets of input not found in the sample data.
What technique should be used to generate the model from the sample data? The maximum entropy method follows the principle of �maximizing entropy� � that is, it chooses the model that takes account of all the facts available in the sample data but otherwise preserves as much uncertainty as possible. The principle could be stated as: "Don�t assume anything about your probability distribution other than what you have observed".
The more mathematically inclined may wish to follow some of the references at the end of the article to find out more about the maximum entropy method.
A Simple Example
Suppose you wish to predict whether on a given day, a certain man will leave the house with his umbrella or not. The first thing you do is gather some sample data by observing over a number of days what the man does:
Day 1 Warm Dry No_Umbrella
Day 2 Cold Dry No_Umbrella
Day 3 Cold Rainy Umbrella
Day 4 Cold Dry Umbrella
Day 5 Warm Dry No_Umbrella
After five days, you realise that there may be an additional factor at work: whether the man is late in the morning or not. So, you collect some more sample data:
Day 6 Cold Dry Early Umbrella
Day 7 Cold Rainy Early Umbrella
Day 8 Cold Dry Late No_Umbrella
Day 9 Warm Rainy Late No_Umbrella
Day 10 Warm Dry Late No_Umbrella
Each of these rows of data represents a training event. Each training event has an outcome (in this example, either "Umbrella" or "No_Umbrella"), and a context, which consists of one or more predicates ("Cold", "Dry" etc.). The context is the combination of predicates that lead to the outcome of the event. Note that the order of the predicates is not significant.
The training data for our simple umbrella example can be found in the file named data.txt in the same folder as the SimpleExample executable. This is a text file with one training event per line. In each line, the contextual predicates and the outcome are separated by single spaces (the outcome is always assumed to be the last string on the line). This training data format is chosen because the SharpEntropy library already has a class called SharpEntropy.IO.PlainTextByLineDataReader
that can read text files delimited in this way. The following code, taken from the source of the SimpleExample application, shows how this training data file can be loaded and used to generate a maximum entropy model:
System.IO.StreamReader trainingStreamReader =
new System.IO.StreamReader(trainingDataFile);
SharpEntropy.ITrainingEventReader eventReader =
new SharpEntropy.BasicEventReader(new
SharpEntropy.PlainTextByLineDataReader(trainingStreamReader));
SharpEntropy.GisTrainer trainer = new SharpEntropy.GisTrainer();
trainer.TrainModel(eventReader);
mModel = new SharpEntropy.GisModel(trainer);
The GisTrainer
class trains a maximum entropy model using the GIS (Generalized Iterative Scaling) method. There are a number of other methods for training MaxEnt models, many of which outperform GIS. Currently, however, GIS is the only method that the Java MaxEnt library supports, and therefore the only method whose algorithm was converted for use in SharpEntropy. The trainer class expects to be fed with a stream of events provided by a class that implements the ITrainingEventReader
interface. The PlainTextByLineDataReader
and BasicEventReader
classes together provide the means to supply this stream of events from the delimited text file, whose path is stored in the string variable trainingDataFile
.
The result is an instance of the GisModel
class, the only class currently coded that implements the IMaximumEntropyModel
interface. Once the model has been trained, it can be used to predict future outcomes. The SimpleExample application lets you choose a combination of predicates using dropdown lists:
When the Take Umbrella? button is clicked, the chosen features are added to an array of strings and passed into the Evaluate
method of the GisModel
object:
private void btnOutcome_Click(object sender, System.EventArgs e)
{
ArrayList context = new ArrayList();
if (cboTemperature.Text != "Unknown")
{
context.Add(cboTemperature.Text);
}
if (cboPrecipitation.Text != "Unknown")
{
context.Add(cboPrecipitation.Text);
}
if (cboTiming.Text != "Unknown")
{
context.Add(cboTiming.Text);
}
double[] probabilities = mModel.Evaluate((string[])
context.ToArray(typeof(string)));
lblOutcome.Text = probabilities[mUmbrellaOutcomeId].ToString("N5");
}
The Evaluate
method returns an array of double
s containing the probabilities for all the possible outcomes. Together, these probabilities will always add up to 1. In this simple example, there are only two possible outcomes, Umbrella
and No_Umbrella
, so it is easy enough to just display the probability of taking the umbrella. The IMaximumEntropyModel
interface contains a number of other useful methods for making sense of the results of a prediction, the details of which can be found in the NDoc-generated library help file. The reason why there is an overloaded version of the Evaluate
method that takes in an array of double
s as a parameter, is that in more complicated scenarios, where the Evaluate
method may be called many hundreds or thousands of times, performance is improved by continually recycling an array of double
s that is already pre-sized to match the number of possible outcomes. That array of double
s can then be passed, if required, to the GetBestOutcome
method, to obtain the name of the most likely outcome, or to the GetAllOutcomes
method, which will return a formatted string, listing all the possible outcomes, together with their probabilities.
Reading and Writing Models
In the simple example above, the model is built from the training data, each time it is run. This is acceptable with such a small training data set, but in a real world application, the training data file is likely to be much, much larger, and it would be unrealistic to require the overhead of rebuilding a model from training data at application load time in this way. Instead, the model would be built once, saved to persistent storage, and then read back from storage when needed. The SharpEntropy library provides three file formats in the SharpEntropy.IO
namespace that may be used to store the model, and three sets of Reader and Writer classes to access the three formats:
- Plain text format (
PlainTextGisModelReader
and PlainTextGisModelWriter
). This format has the twin advantages that it is (more or less) human readable, so you can check the results of the data training, and it is compatible with the plain text model files generated by the Java MaxEnt library. It produces large file sizes, and would normally be used only during development, not in production.
- "Java compatible" binary format (
JavaBinaryGisModelReader
and JavaBinaryGisModelWriter
). The file sizes are smaller than plain text, but the binary files are interoperable with the Java MaxEnt library. This results in some performance sacrifices, caused by the need to reverse the byte order of data items (the Java library preferring Big-Endian data formats) and the fact that the MaxEnt format is less than optimal.
- Binary format (
BinaryGisModelReader
and BinaryGisModelWriter
). This is the format you would normally choose. The files produced are slightly smaller than the Java compatible format, and they are faster to read in.
The code to write a model to one of these file formats, and to read it back, is simple:
string modelDataFile = @"C:\model.nbin";
SharpEntropy.IO.BinaryGisModelWriter writer =
new SharpEntropy.IO.BinaryGisModelWriter();
writer.Persist(model, modelDataFile);
SharpEntropy.IO.BinaryGisModelReader reader =
new SharpEntropy.IO.BinaryGisModelReader(modelDataFile);
SharpEntropy.IMaximumEntropyModel model = new SharpEntropy.GisModel(reader);
It is possible to implement new formats, and depending on the similarity of your chosen format to the existing ones, it may be useful to implement the IGisModelReader
interface or to inherit from the GisModelReader
and GisModelWriter
classes. I have working implementations of Reader/Writer pairs that use SQL Server and Sqlite as backing stores. You can download the Sqlite implementation from the link at the top of the article. However, I am still looking for an optimally efficient read-only file format.
A More Complex Example - Tokenizing English Sentences
The Java MaxEnt library is used by another LGPL-licensed open source Java library, called OpenNLP, which provides a number of natural language processing tools based on maximum entropy models. A C# port of the OpenNLP library is available in my second article, Statistical parsing of English sentences, but to provide a more complex SharpEntropy example, as well as to give a flavor of the OpenNLP code, here is a small application to tokenize English sentences, based on a MaxEnt model created by the Java OpenNLP team.
Tokenizing occurs in the MaxentTokenizer.Tokenize
method:
public string[] Tokenize(string input)
{
ArrayList tokens = new ArrayList();
string[] candidateTokens = input.Split(mWhitespaceChars);
foreach (string candidateToken in candidateTokens)
{
if (candidateToken.Length < 2)
{
tokens.Add(candidateToken);
}
else if (mAlphaNumericRegex.IsMatch(candidateToken))
{
tokens.Add(candidateToken);
}
else
{
int startPos = 0;
int endPos = candidateToken.Length;
for (int currentPos = startPos + 1;
currentPos < endPos; currentPos++)
{
double[] probabilities =
mModel.Evaluate(GenerateContext(candidateToken,
currentPos));
string bestOutcome = mModel.GetBestOutcome(probabilities);
if (bestOutcome == "T")
{
tokens.Add(candidateToken.Substring(startPos,
currentPos - startPos));
startPos = currentPos;
}
}
tokens.Add(candidateToken.Substring(startPos,
endPos - startPos));
}
}
return (string[])tokens.ToArray(typeof(string));
}
It first splits the input at each whitespace character into candidate tokens. Each candidate token is then examined, and if it is less than two characters long, or consists only of alphanumeric characters, then it is accepted as a token. Otherwise, each character position in the candidate token is examined in turn to see if it should be split into more than one token at that position. This is done by building up a set of predicates related to the possible split position in the GenerateContext
method. Various features, such as the characters before and after the split, whether the characters on each side are letters or numbers, and so on, are used to generate this set of predicates. This set of predicates is then evaluated against the MaxEnt model. The model has two possible outcomes, "T" for a split, and "F" for a non-split, so if the best outcome is "T", then the characters to the left of the split position are separated off into a new token.
The sample application simply allows you to type in an input sentence, tokenizes it, and lists the resulting tokens, each on a separate line, in a read-only textbox.
A Sqlite Model Reader/Writer
SharpEntropy, like its Java original, holds all of the maximum entropy model data in memory at once, and complex models (such as those used by OpenNLP) may easily consume hundreds of MB. I have been looking into the possibility of keeping most of the model in permanent storage and only accessing it when necessary. An early result of this is the SharpEntropySqlite library, which stores model data in a Sqlite database, and only accesses it on demand when the GIS model is queried. The ADO.NET Sqlite provider used by this library is the public domain one created by Robert Simpson and available on SourceForge.
Notes On The Java Conversion
Lastly, I thought it may be interesting to provide some explanation of how the SharpEntropy library developed. I do not have a mathematical background, and my interest lies largely in the possibility of using maximum entropy for practical purposes, including statistical natural language processing.
Though I started off with a copy of the MaxEnt source code and Microsoft's Java Language Conversion Assistant (JLCA), the SharpEntropy code has subsequently been through many iterations to reach its current state. I have endeavored to make the Java origins of the code as unnoticeable as possible, rather than having the library resemble Java written in a foreign language. Some of the major phases of coding I went through (some revisited iteratively) were:
- Conversion with the JLCA, and fixing of errors in the generated code. The result of this phase was compilable C# code with a heavy Java flavour. The MaxEnt library uses the open source Trove collections library to handle many of the MaxEnt model data structures, so this phase also involved dutifully converting a large chunk of Trove into C# as well.
- Removal of Java-like constructs from the code and replacing them with C# versions, particularly where the C# version is known to be more efficient. Some Java-like remnants remain, but hopefully they are few in number.
- Changing a large number of identifiers in the code, ranging all the way from variables with unclear names to classes and files, particularly those with endings that clash with .NET framework conventions (
EventStream
, for example).
- Running FxCop and fixing as many of the issues raised as possible.
- Tuning for performance. I did a lot of work trying to reduce SharpEntropy's memory footprint. My optimizations have now been back-ported into the Java MaxEnt library.
References
History
- Third version (3rd May 2006). Added .NET 2.0 versions of the code, plus Sqlite Model Reader/Writer.
- Second version (16th November 2005). Added links to OpenNLP article; minor edits to the text.
- Initial version (24th July 2005).