Contents
Context-sensitive spelling correction is the task of fixing spelling errors that result in valid words, such as I’d like to eat desert, where dessert was typed when desert was intended.
These errors will go undetected by conventional spell checkers, which only flag words that are not found in a word list.
Context-sensitive spelling correction involves learning to characterize the linguistic contexts in which different words, such as dessert and desert, tend to occur. The problem is that there is a multitude of features one might use to characterize these contexts: features that test for the presence of a particular word nearby the target word; features that test the pattern of parts of speech around the target word; and so on. In general, the number of features will range from a few hundred to over ten thousands.
Now comes another problem, how do we determine which features to actually consider? How to separate the wheat from the chaff?
Note that I'm not talking about Feature selection, because even after filtering or removing noisy features, you will still end up with a large feature space, while the target concept (a context in which desert can occur in this case) depends only on a small subset.
For this problem, we will employ the Winnow Algorithm.
This article follows the work of ANDREW R. GOLDING and DAN ROTH, A Winnow-Based Approach to Context-sensitive Spelling Correction. (License)
Fork or clone from bitbucket.
Put simply, winnow is a machine learning technique for discerning the disjunction, (i.e. the ORing function) from labelled examples.
Concretely, what this means is, given m labelled samples, each with a vector of n Boolean features x = (x1, . . . , xn), winnow learns which features to union to get the correct label.
Take the example of email spam filtering, on the basic level, we will take the body of the many emails, and tokenize them into array of words, this will be our features vector.
Each feature (word) is denoted 1 if it exists in the example, 0 if it doesn't.
Now you will end up with this:
x1 | x2 | x3 | x4 | x5 | ... | xn | spam? (yes/no) |
$$$ | 100% free | viagra | the | weight loss | ... | | f(x) |
0 | 1 | 0 | 1 | 0 | ... | | ? |
Now you see the problem, the feature set can be very large, with many features set to true (the for example occurs in almost every email) yet the correct labelling depends only on a small subset of features r.
- Initialize the weights w1 = w2 = . . . = wn = 1 on the n variables (features vector).
- Given an example x = (x1, . . . , xn)
output 1 if
else output 0.
θ
is a threshold value we set to our discretion - If the algorithm makes a mistake:
- If it predicts 0 when f (x) = 1, then for each xi equal to 1, increase the value of wi.(value was too low, promote contributing features)
- If it predicts 1 when f (x) = 0, then for each xi equal to 1, decrease the value of wi.(value was too high, demote contributing features)
Winnow gives us a mistake bound of O(r log n).
Where:
r
:relevant features
n
:total number of features
set θ
=6 (features count)
x1
w1=1 | x2
w2=1 | x3
w3=1 | x4
w4=1 | x5
w5=1 | x6
w6=1 | prediction of f(x) |
1 | 0 | 0 | 0 | 0 | 0 | 0 (Σixiwi=1 ≥6?) |
0
w1=1 | 1
w2=2 | 0
w3=1 | 1
w4=2 | 1
w5=2 | 1
w6=2 | 0 (Σixiwi=4 ≥6?)
double wi for xi=1
|
0 | 0 | 0 | 0 | 0 | 1 | 0 (2 ≥6?) |
0
w1=1 | 0
w2=2 | 0
w3=1 | 1
w4=1 | 1
w5=1 | 1
w6=1 | 1 (6 ≥6?)
halve wi for xi=1
|
0 | 0 | 0 | 0 | 1 | 1 | 0 (2 ≥6?) |
... | ... |
(mistakes in red; the target f(x)
= x2 ∨ x3, n
= 6, r
= 2)
public class Sample
{
public bool[] Features { get; set; }
public bool Class { get; set; }
public Sample ToggleClass()
{
var invertedSample = (Sample)this.MemberwiseClone();
invertedSample.Class = !invertedSample.Class;
return invertedSample;
}
public override string ToString()
{
StringBuilder sb = new StringBuilder("(" + Class + ")[");
for (int i = 0; i < Features.Length; i++)
{
sb.Append(Convert.ToInt16(Features[i]));
if (i < Features.Length - 1)
{
sb.Append(',');
}
}
sb.Append("]");
return sb.ToString();
}
}
public class Winnow : ISupervisedLearning
{
#region Member Variables
private readonly double[] _weights;
private readonly double _threshold;
private readonly double _promotion;
private readonly double _demotion;
#endregion
public int Mistakes { get; private set; }
#region Constructor
public Winnow(int featuresCount, double threshold, double promotion, double demotion, double initialWeight)
{
this._threshold = threshold;
this._promotion = promotion;
this._demotion = demotion;
this._weights = new double[featuresCount];
for (int i = 0; i < _weights.Length; i++)
{
_weights[i] = initialWeight;
}
}
#endregion
#region ISupervisedLearning
public void Train(IEnumerable<Sample> samples)
{
Train(samples.ToArray());
}
public void Train(params Sample[] samples)
{
lock (this)
{
foreach (var s in samples)
{
bool prediction = Predict(s);
if (prediction != s.Class)
{
Mistakes++;
if (!prediction && s.Class)
{
AdjustWeights(s, _promotion);
}
else
{
AdjustWeights(s, _demotion);
}
}
}
}
}
public bool Predict(Sample sample)
{
double sum = GetScore(sample);
return sum >= _threshold;
}
public double GetScore(Sample sample)
{
double sum = 0;
for (int i = 0; i < _weights.Length; i++)
{
if (sample.Features[i])
{
sum += _weights[i];
}
}
return sum;
}
#endregion
#region Public Methods
public override string ToString()
{
var sb = new StringBuilder();
foreach (var item in _weights)
{
sb.Append(item);
sb.Append(",");
}
return sb.ToString();
}
#endregion
#region Private Methods
private void AdjustWeights(Sample s, double adjustment)
{
for (int i = 0; i < _weights.Length; i++)
{
if (s.Features[i])
{
_weights[i] *= adjustment;
}
}
}
#endregion
}
Right, so now we have an idea how Winnow works, let's get back to the problem at hand; context-sensitive spelling correction as a word disambiguation task.
We will need to feed Winnow 2 things; a vector of Boolean features, and a label; this comprises a single training example.
The label here would be one of 2 or more confused words, together these words form a confusion set.
a confusion set C ={W1;....;Wn} means that each word Wi in the set is ambiguous with each other word.
Concretely, if C ={hear, here}, then the output label will be either hear or here.
obviously we have many confusion sets; each with their own distinct features vector:
C1 ={hear, here}
C2 ={by, buy}
C3 ={to, two,too}
and so on...In the code I used some sets from Words Commonly Confused.
So these were the labels, how about the other component (features)?
Given a sentence, we want to extract a particular linguistic pattern in the context of the target word.
We'll use two types of features: context words and collocations.
Context-word features test for the presence of a particular word within ±k words of the target word.
Collocations features test for a pattern of up to ℓ contiguous words and/or part-of-speech tags around the target word. In the code, k was set to 10 and ℓ to 2.
Examples of useful features for the confusion set {weather, whether} include:
(1) cloudy within ±10 words
(2)__to VERB
Feature (1) is a context-word feature that tends to imply weather. Feature 2 is a collocation that checks for the pattern “to VERB” immediately after the target word, and tends to imply whether (as in I don’t know whether to laugh or cry).
High Level diagram
The following operation takes place for a single confusion set
Breakdown
Corpus is first split into complete sentences; each sentence is tokenized into list of words and pos tags. Richard Northedge POS tagger was used.
public class Sentence
{
public string[] Words { get; set; }
public string[] POSTags { get; set; }
}
private IEnumerable<Sentence> PreProcessCorpora(IEnumerable<string> corpora)
{
var sentences = new ConcurrentBag<sentence>();
Parallel.ForEach(corpora, phrase =>
{
string[] tokens = SplitIntoWords(phrase);
string[] posTags;
lock (_lock)
{
posTags = _posTagger.Tag(tokens);
}
sentences.Add(new Sentence
{
Words = tokens,
POSTags = posTags
});
});
return sentences;
}
Combing through the training corpus (i.e. for each sentence), each time a word in the confusion set occurs, get all possible features for the context—namely, one context-word feature for every distinct word within ±k words.
public class ContextFeaturesExtractor : AbstractExtractor
{
public ContextFeaturesExtractor(int k)
: base(k)
{
}
public override HashSet<string> Extract(string[] tokens, int targetPosition)
{
var features = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
int backward = targetPosition - 1;
int forward = targetPosition + 1;
for (int counter = 0; counter < n; counter++)
{
if (backward <= -1 && forward >= tokens.Length)
{
break;
}
if (backward > -1)
{
AddFeature(features, tokens[backward]);
backward--;
}
if (forward < tokens.Length)
{
AddFeature(features, tokens[forward]);
forward++;
}
}
return features;
}
}
...and 2 collocations (occurring before and after word) for every way of expressing a pattern of up to ℓ contiguous elements.
public class CollocationFeaturesExtractor : AbstractExtractor
{
public CollocationFeaturesExtractor(int l)
: base(l)
{
}
public override HashSet<string> Extract(string[] posTags, int targetPosition)
{
var features = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
for (int c = 0; c < n; c++)
{
var preceding = string.Empty;
int backward = targetPosition - (c + 1);
while (backward < targetPosition && backward > -1)
{
preceding += posTags[backward] + " ";
backward++;
}
if (!string.IsNullOrEmpty(preceding))
{
preceding += "_";
AddFeature(features, preceding);
}
string succeeding = "_ ";
int forward = targetPosition + 1;
for (int j = 0; j <= c && forward < posTags.Length; j++, forward = targetPosition + j + 1)
{
succeeding += posTags[forward] + " ";
}
succeeding = succeeding.TrimEnd();
if (succeeding != "_")
{
AddFeature(features, succeeding.TrimEnd());
}
}
return features;
}
}
Statistics of occurrence of the features are collected in the process as well, namely, for each feature F
we collect for target word W
, we record:
N11
: Number of documents (sentences) containing both F
and W
.
N10
: Number of documents containing F
only.
N01
: Number of documents containing W
only.
N00
: Number of documents where neither W
nor F were found.
N
: Total number of documents.
After these steps, an exhaustive list of all features found in the training corpus is generated, at this point; pruning criteria may be applied to eliminate unreliable or uninformative features. two criteria are used (which make use of the aforementioned statistics of occurrence):
(1) How much information the presence/absence of a feature F
contributes to making the correct classification decision on W
. (Mutual information)
(2) The presence of the feature is not significantly correlated with the identity of the target word (determined by a chi-square test at the 0.05 significance level).
public class MutualInformation : IFeatureSelector
{
#region Member Variables
private readonly int k;
#endregion
#region Constructor
public MutualInformation(int k)
{
this.k = k;
}
#endregion
#region IFeatureSelector
public IDictionary<string, Stats> Select(IDictionary<string, Stats> terms)
{
var features = new Dictionary<string, double>(terms.Count, StringComparer.OrdinalIgnoreCase);
foreach (var term in terms)
{
var value = Calculate(n: term.Value.N,
n00: term.Value.N00,
n01: term.Value.N01,
n10: term.Value.N10,
n11: term.Value.N11);
features.Add(term.Key, value);
}
var orderedFeatures = features
.OrderByDescending(s => s.Value)
.Select(s => s.Key);
var prunedFeatures = orderedFeatures
.Take(k)
.ToDictionary(key => key, key => terms[key]);
return prunedFeatures;
}
#endregion
#region Private Methods
private double Calculate(int n, double n00, double n01, double n10, double n11)
{
double n1_ = n10 + n11;
double n_1 = n01 + n11;
double n0_ = n00 + n01;
double n_0 = n10 + n00;
double sum = 0;
sum += (n11 / n) * Math.Log((n * n11) / (n1_ * n_1), 2);
sum += (n01 / n) * Math.Log((n * n01) / (n0_ * n_1), 2);
sum += (n10 / n) * Math.Log((n * n10) / (n1_ * n_0), 2);
sum += (n00 / n) * Math.Log((n * n00) / (n0_ * n_0), 2);
return sum;
}
#endregion
}
public class ChiSquaredTest : IFeatureSelector
{
#region Member Variables
private readonly double _significanceLevel;
private readonly Dictionary<double, double> _distributionTable;
#endregion
#region Constructor
public ChiSquaredTest(double significanceLevel)
{
this._significanceLevel = significanceLevel;
_distributionTable = new Dictionary<double, double> {
{0.1,2.71},
{0.05,3.84},
{0.01,6.63},
{0.005,7.88},
{0.001,10.83}
};
}
#endregion
#region IFeatureSelector
public IDictionary<string, Stats> Select(IDictionary<string, Stats> features)
{
var prunedFeatures = new Dictionary<string, Stats>(StringComparer.OrdinalIgnoreCase);
foreach (var feature in features)
{
var featureValue = Calculate(feature.Value.N, feature.Value.N00, feature.Value.N01, feature.Value.N10, feature.Value.N11);
if (featureValue > _distributionTable[_significanceLevel])
{
prunedFeatures.Add(feature.Key, feature.Value);
}
}
return prunedFeatures;
}
#endregion
#region Private Methods
private double Calculate(int n, double n00, double n01, double n10, double n11)
{
var numerator = n * Math.Pow(((n11 * n00) - (n10 * n01)), 2);
var denominator = (n11 + n01) *
(n11 + n10) *
(n10 + n00) *
(n01 + n00);
var value = numerator / denominator;
return value;
}
#endregion
}
A training example consists of a sentence, represented as a set of active features, together with the word Wc in the confusion set that is correct for that sentence.
private class RoughSample
{
public HashSet<string> Features { get; set; }
public string Word { get; set; }
}
Now that we have a list of proper training data, we train a classifier for each word in our confusion set.
var cloudClassifiers = new Dictionary<string, ISupervisedLearning[]>(confusionSet.Count());
foreach (var word in confusionSet)
{
var classifiers = new ISupervisedLearning[1];
int i = 0;
for (double beta = 0.5; beta < 1 && i < classifiers.Length; beta += 0.1, i++)
{
classifiers[i] = new Winnow.Winnow(features.Length, threshold: 1, promotion: 1.5, demotion: beta, initialWeight: 1);
}
cloudClassifiers[word] = classifiers;
}
Training proceeds in an on-line manner, each example is treated as a positive example for the classifiers for Wc, and as a negative example for the classifiers for the other words in the confusion set.(One Vs All)
Parallel.ForEach(trainingData, sample =>
{
var positive = new Sample
{
Class = true,
Features = CreateFeaturesVector(sample.Features, features)
};
Sample negative = positive.ToggleClass();
foreach (var cloud in cloudClassifiers)
{
var example = cloud.Key == sample.Word ? positive : negative;
foreach (var classifier in cloud.Value)
{
classifier.Train(example);
}
}
});
When a sentence is presented, the system checks the classifiers of each confusion set, if the sentence contains a word from a confusion set, features are extracted and run against the corresponding classifier, and returns the prediction with the highest score.
public Dictionary<int, string> Predict(string phrase)
{
string[] tokens = SplitIntoWords(phrase);
var correctWords = new Dictionary<int, string>();
foreach (var comparator in _comparators)
{
foreach (var confusedWord in comparator.ConfusionSet)
{
for (int i = 0; i < tokens.Length; i++)
{
if (!tokens[i].Equals(confusedWord, StringComparison.OrdinalIgnoreCase))
{
continue;
}
string[] posTags;
lock (_lock)
{
posTags = _posTagger.Tag(tokens);
}
var features = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
features.UnionWith(_contextFeaturesExtractor.Extract(tokens, i));
features.UnionWith(_collocationtFeaturesExtractor.Extract(posTags, i));
var sample = new Sample
{
Features = CreateFeaturesVector(features, comparator.Features)
};
var predictions = new Dictionary<string, double>(comparator.Cloud.Count);
foreach (var classifier in comparator.Cloud.Keys)
{
predictions[classifier] = comparator
.Cloud[classifier]
.Sum(c => c.GetScore(sample));
}
string correctedWord = predictions.Aggregate((a, b) => a.Value > b.Value ? a : b).Key;
if (!tokens[i].Equals(correctedWord, StringComparison.OrdinalIgnoreCase))
{
correctWords.Add(i, correctedWord);
}
}
}
}
return correctWords;
}
In the experiments that follow, the training and test sets were drawn from the written part of the American National Corpus (OANC) (which is about 11 million words, split into random 80-20% by sentence) and 28 confusion sets.
|
Results Confusion Set | Accuracy % | where-were | 89 | to-too-two | 95 | hour-our | 95.4 | by-buy-bye | 98.7 | their-there | 90 | knew-new | 95.98 | later-latter | 77.9 | weather-whether | 95 | loose-lose | 87 | hear-here | 89 | threw-through | 98.8 | brake-break | 95.83 | peace-piece | 86.7 | cite-site-sight | 88.79 | passed-past | 83 | sea-see | 92 | quiet-quit-quite | 79 | weak-week | 88 | coarse-course | 96.4 | fourth-forth | 92.9 | council-counsel | 89 | principal-principle | 70 | vain-vane-vein | 79 | rain-reign-rein | 80 | desert-dessert | 89.79 | complement-compliment | 92 | plaine-plane | 100 | waist-waste | 87 |
|
Overall occuarcy is 94%
Running time for feature extraction + training + testing took on average 59 minutes on my i7, 3.4 GHz, 8 GB RAM machine.
The above results were achieved with pruning disabled, when enabled, accuracy went down and running time increased.
|