Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / artificial-intelligence / machine-learning

Context Sensitive Spelling Correction using Winnow

4.88/5 (10 votes)
30 Oct 2014CPOL8 min read 47.4K  
Implementation of a sophisticated spell checker that makes use of language model to consider the context in which a word occurs

Image 1

Contents

Introduction

Context-sensitive spelling correction is the task of fixing spelling errors that result in valid words, such as I’d like to eat desert, where dessert was typed when desert was intended.
These errors will go undetected by conventional spell checkers, which only flag words that are not found in a word list.
Context-sensitive spelling correction involves learning to characterize the linguistic contexts in which different words, such as dessert and desert, tend to occur. The problem is that there is a multitude of features one might use to characterize these contexts: features that test for the presence of a particular word nearby the target word; features that test the pattern of parts of speech around the target word; and so on. In general, the number of features will range from a few hundred to over ten thousands.

Now comes another problem, how do we determine which features to actually consider? How to separate the wheat from the chaff?

Note that I'm not talking about Feature selection, because even after filtering or removing noisy features, you will still end up with a large feature space, while the target concept (a context in which desert can occur in this case) depends only on a small subset.

For this problem, we will employ the Winnow Algorithm.

This article follows the work of 
ANDREW R. GOLDING and DAN ROTHA Winnow-Based Approach to Context-sensitive Spelling Correction. (License)

Download

Fork or clone from bitbucket.

Winnow in a nutshell

Put simply, winnow is a machine learning technique for discerning the disjunction, (i.e. the ORing function) from labelled examples.
Concretely, what this means is, given m labelled samples, each with a vector of n Boolean features x = (x1, . . . , xn), winnow learns which features to union to get the correct label.

Take the example of email spam filtering, on the basic level, we will take the body of the many emails, and tokenize them into array of words, this will be our features vector.
Each feature (word) is denoted 1 if it exists in the example, 0 if it doesn't.

Now you will end up with this:

x1 x2 x3 x4 x5 ... xn spam? (yes/no)
$$$ 100% free viagra  the weight loss ...   f(x)
0 1 0 1 0 ...   ?

Now you see the problem, the feature set can be very large, with many features set to true (the for example occurs in almost every email) yet the correct labelling depends only on a small subset of features r.

How Winnow works

  • Initialize the weights w1 = w2 = . . . = wn = 1 on the n variables (features vector).
  • Given an example x = (x1, . . . , xn)
    output 1 if
    Image 2
    else output 0.
    θ is a threshold value we set to our discretion
  • If the algorithm makes a mistake:
    • If it predicts 0 when f (x) = 1, then for each xequal to 1, increase the value of wi.(value was too low, promote contributing features)
    • If it predicts 1 when f (x) = 0, then for each xi equal to 1, decrease the value of wi.(value was too high, demote contributing features)

Winnow gives us a mistake bound of O(r log n).
Where:
r:relevant features
n:total number of features
 

Winnow in Action

set θ=6 (features count)

x1
w1=1
x2
w2=1
x3
w3=1
x4
w4=1
x5
w5=1
x6
w6=1
prediction of f(x)
1 0 0 0 0 0 0ixiwi=1 ≥6?)
0
w1=1
1
w2=2
0
w3=1
1
w4=2
1
w5=2
1
w6=2

0ixiwi=4 ≥6?)

double wi for xi=1

0 0 0 0 0 1 0 (2 ≥6?)
0
w1=1
0
w2=2
0
w3=1
1
w4=1
1
w5=1
1
w6=1

1 (6 ≥6?)

halve wi for xi=1

0 0 0 0 1 1 0 (2 ≥6?)
... ...

 

(mistakes in red; the target f(x) = x2 ∨ x3, n = 6, r = 2)

C#
public class Sample
    {
        public bool[] Features { get; set; }
        public bool Class { get; set; }

        public Sample ToggleClass()
        {
            var invertedSample = (Sample)this.MemberwiseClone();
            invertedSample.Class = !invertedSample.Class;
            return invertedSample;
        }

        public override string ToString()
        {
            StringBuilder sb = new StringBuilder("(" + Class + ")[");

            for (int i = 0; i < Features.Length; i++)
            {
                sb.Append(Convert.ToInt16(Features[i]));

                if (i < Features.Length - 1)
                {
                    sb.Append(',');
                }
            }

            sb.Append("]");

            return sb.ToString();
        }
    }
C#
/// <summary>
/// Algorithm for learning monotone disjunction function from labeled examples
/// </summary>
public class Winnow : ISupervisedLearning
{
    #region Member Variables

    private readonly double[] _weights;
    private readonly double _threshold;
    private readonly double _promotion;
    private readonly double _demotion;

    #endregion

    public int Mistakes { get; private set; }

    #region Constructor

    public Winnow(int featuresCount, double threshold, double promotion, double demotion, double initialWeight)
    {
        this._threshold = threshold;
        this._promotion = promotion;
        this._demotion = demotion;
        this._weights = new double[featuresCount];

        for (int i = 0; i < _weights.Length; i++)
        {
            _weights[i] = initialWeight;
        }
    }

    #endregion

    #region ISupervisedLearning

    public void Train(IEnumerable<Sample> samples)
    {
        Train(samples.ToArray());
    }

    public void Train(params Sample[] samples)
    {
        lock (this)
        {
            foreach (var s in samples)
            {
                bool prediction = Predict(s);

                if (prediction != s.Class)//prediction was wrong
                {
                    Mistakes++;

                    if (!prediction && s.Class)
                    {
                        AdjustWeights(s, _promotion);
                    }
                    else
                    {
                        AdjustWeights(s, _demotion);
                    }
                }
            }
        }
    }

    public bool Predict(Sample sample)
    {
        double sum = GetScore(sample);

        return sum >= _threshold;
    }

    public double GetScore(Sample sample)
    {
        double sum = 0;

        for (int i = 0; i < _weights.Length; i++)
        {
            if (sample.Features[i])
            {
                sum += _weights[i];
            }
        }

        return sum;
    }

    #endregion

    #region Public Methods

    public override string ToString()
    {
        var sb = new StringBuilder();

        foreach (var item in _weights)
        {
            sb.Append(item);
            sb.Append(",");
        }

        return sb.ToString();
    }

    #endregion

    #region Private Methods

    private void AdjustWeights(Sample s, double adjustment)
    {
        for (int i = 0; i < _weights.Length; i++)
        {
            if (s.Features[i])
            {
                _weights[i] *= adjustment;
            }
        }
    }

    #endregion

}

Problem formulation

Right, so now we have an idea how Winnow works, let's get back to the problem at hand; context-sensitive spelling correction as a word disambiguation task.
We will need to feed Winnow 2 things; a vector of Boolean features, and a label; this comprises a single training example.
The label here would be one of 2 or more confused words, together these words form a confusion set.
a confusion set C ={W1;....;Wn} means that each word Wi in the set is ambiguous with each other word.
Concretely, if C ={hear, here}, then the output label will be either hear or here.
obviously we have many confusion sets; each with their own distinct features vector:
C1 ={hear, here}
C2 ={by, buy}
C3 ={to, two,too}
and so on...In the code I used some sets from Words Commonly Confused.

So these were the labels, how about the other component (features)?
Given a sentence, we want to extract a particular linguistic pattern in the context of the target word.

We'll use two types of features: context words and collocations.
Context-word features test for the presence of a particular word within ±k words of the target word.
Collocations features test for a pattern of up to  contiguous words and/or part-of-speech tags around the target word. In the code, k was set to 10 and  to 2.

Examples of useful features for the confusion set {weather, whether} include:

(1) cloudy within ±10 words
(2)__to VERB

Feature (1) is a context-word feature that tends to imply weather. Feature 2 is a collocation that checks for the pattern “to VERB” immediately after the target word, and tends to imply whether (as in I don’t know whether to laugh or cry).

High Level diagram

The following operation takes place for a single confusion set

Image 3

Breakdown

Corpus Preprocessing

Corpus is first split into complete sentences; each sentence is tokenized into list of words and pos tags. Richard Northedge POS tagger was used.

C#
public class Sentence
    {
        public string[] Words { get; set; }
        public string[] POSTags { get; set; }
    }
C#
private IEnumerable<Sentence> PreProcessCorpora(IEnumerable<string> corpora)
        {
            var sentences = new ConcurrentBag<sentence>();

            Parallel.ForEach(corpora, phrase =>
            {
                string[] tokens = SplitIntoWords(phrase);
                string[] posTags;

                lock (_lock)
                {
                    posTags = _posTagger.Tag(tokens);
                }

                sentences.Add(new Sentence
                {
                    Words = tokens,
                    POSTags = posTags
                });
            });

            return sentences;
        }

Getting context features

Combing through the training corpus (i.e. for each sentence), each time a word in the confusion set occurs, get all possible features for the context—namely, one context-word feature for every distinct word within ±k words.

C#
public class ContextFeaturesExtractor : AbstractExtractor
    {
        public ContextFeaturesExtractor(int k)
            : base(k)
        {

        }

        public override HashSet<string> Extract(string[] tokens, int targetPosition)
        {
            var features = new HashSet<string>(StringComparer.OrdinalIgnoreCase);

            int backward = targetPosition - 1;
            int forward = targetPosition + 1;

            for (int counter = 0; counter < n; counter++)
            {
                if (backward <= -1 && forward >= tokens.Length)
                {
                    break;
                }

                if (backward > -1)
                {
                    AddFeature(features, tokens[backward]);
                    backward--;
                }

                if (forward < tokens.Length)
                {
                    AddFeature(features, tokens[forward]);
                    forward++;
                }
            }

            return features;
        }
    }

Getting collocation features

...and 2 collocations (occurring before and after word) for every way of expressing a pattern of up to ℓ contiguous elements.

C#
public class CollocationFeaturesExtractor : AbstractExtractor
    {
        public CollocationFeaturesExtractor(int l)
            : base(l)
        {

        }

        public override HashSet<string> Extract(string[] posTags, int targetPosition)
        {
            var features = new HashSet<string>(StringComparer.OrdinalIgnoreCase);

            for (int c = 0; c < n; c++)
            {
                var preceding = string.Empty;
                int backward = targetPosition - (c + 1);

                while (backward < targetPosition && backward > -1)
                {
                    preceding += posTags[backward] + " ";
                    backward++;
                }
                if (!string.IsNullOrEmpty(preceding))
                {
                    preceding += "_";
                    AddFeature(features, preceding);
                }

                string succeeding = "_ ";

                int forward = targetPosition + 1;
                for (int j = 0; j <= c && forward < posTags.Length; j++, forward = targetPosition + j + 1)
                {
                    succeeding += posTags[forward] + " ";
                }
                succeeding = succeeding.TrimEnd();
                if (succeeding != "_")
                {
                    AddFeature(features, succeeding.TrimEnd());
                }
            }

            return features;
        }
    }

Statistics collection

Statistics of occurrence of the features are collected in the process as well, namely, for each feature F we collect for target word W, we record:
N11 : Number of documents (sentences) containing both F and W .
N10 : Number of documents containing F only.
N01 : Number of documents containing W only.
N00 : Number of documents where neither W nor F were found.
N   : Total number of documents.

Feature pruning

After these steps, an exhaustive list of all features found in the training corpus is generated, at this point; pruning criteria may be applied to eliminate unreliable or uninformative features. two criteria are used (which make use of the aforementioned statistics of occurrence):
(1) How much information the presence/absence of a feature F contributes to making the correct classification decision on W. (Mutual information)
(2) The presence of the feature is not significantly correlated with the identity of the target word (determined by a chi-square test at the 0.05 significance level).

C#
public class MutualInformation : IFeatureSelector
    {
        #region Member Variables

        private readonly int k;

        #endregion

        #region Constructor

        public MutualInformation(int k)
        {
            this.k = k;
        }

        #endregion

        #region IFeatureSelector

        public IDictionary<string, Stats> Select(IDictionary<string, Stats> terms)
        {
            var features = new Dictionary<string, double>(terms.Count, StringComparer.OrdinalIgnoreCase);

            foreach (var term in terms)
            {
                var value = Calculate(n: term.Value.N,
                    n00: term.Value.N00,
                    n01: term.Value.N01,
                    n10: term.Value.N10,
                    n11: term.Value.N11);
                features.Add(term.Key, value);
            }

            var orderedFeatures = features
               .OrderByDescending(s => s.Value)
               .Select(s => s.Key);
            var prunedFeatures = orderedFeatures
                .Take(k)
               .ToDictionary(key => key, key => terms[key]);

            return prunedFeatures;
        }

        #endregion

        #region Private Methods

        private double Calculate(int n, double n00, double n01, double n10, double n11)
        {
            double n1_ = n10 + n11;
            double n_1 = n01 + n11;
            double n0_ = n00 + n01;
            double n_0 = n10 + n00;

            double sum = 0;
            sum += (n11 / n) * Math.Log((n * n11) / (n1_ * n_1), 2);
            sum += (n01 / n) * Math.Log((n * n01) / (n0_ * n_1), 2);
            sum += (n10 / n) * Math.Log((n * n10) / (n1_ * n_0), 2);
            sum += (n00 / n) * Math.Log((n * n00) / (n0_ * n_0), 2);

            return sum;
        }

        #endregion
    }
C#
 

public class ChiSquaredTest : IFeatureSelector
    {
        #region Member Variables

        private readonly double _significanceLevel;
        private readonly Dictionary<double, double> _distributionTable;

        #endregion

        #region Constructor

        public ChiSquaredTest(double significanceLevel)
        {
            this._significanceLevel = significanceLevel;
            _distributionTable = new Dictionary<double, double> { 
                {0.1,2.71},
                {0.05,3.84},
                {0.01,6.63},
                {0.005,7.88},
                {0.001,10.83}
            };
        }

        #endregion

        #region IFeatureSelector

        public IDictionary<string, Stats> Select(IDictionary<string, Stats> features)
        {
            var prunedFeatures = new Dictionary<string, Stats>(StringComparer.OrdinalIgnoreCase);

            foreach (var feature in features)
            {
                var featureValue = Calculate(feature.Value.N, feature.Value.N00, feature.Value.N01, feature.Value.N10, feature.Value.N11);

                if (featureValue > _distributionTable[_significanceLevel])
                {
                    prunedFeatures.Add(feature.Key, feature.Value);
                }
            }

            return prunedFeatures;
        }

        #endregion

        #region Private Methods

        private double Calculate(int n, double n00, double n01, double n10, double n11)
        {
            var numerator = n * Math.Pow(((n11 * n00) - (n10 * n01)), 2);
            var denominator = (n11 + n01) *
                (n11 + n10) *
                (n10 + n00) *
                (n01 + n00);
            var value = numerator / denominator;
            return value;
        }

        #endregion
    }

 

Adding labelled samples

A training example consists of a sentence, represented as a set of active features, together with the word Wc in the confusion set that is correct for that sentence. 

C#
   private class RoughSample
        {
            public HashSet<string> Features { get; set; }
            public string Word { get; set; }
        }

Training

Now that we have a list of proper training data, we train a classifier for each word in our confusion set.

C#
var cloudClassifiers = new Dictionary<string, ISupervisedLearning[]>(confusionSet.Count());

            foreach (var word in confusionSet)
            {
                var classifiers = new ISupervisedLearning[1];
                int i = 0;

                for (double beta = 0.5; beta < 1 && i < classifiers.Length; beta += 0.1, i++)
                {
                    classifiers[i] = new Winnow.Winnow(features.Length, threshold: 1, promotion: 1.5, demotion: beta, initialWeight: 1);
                }

                cloudClassifiers[word] = classifiers;
            }

Training proceeds in an on-line manner, each example is treated as a positive example for the classifiers for Wc, and as a negative example for the classifiers for the other words in the confusion set.(One Vs All)

C#
Parallel.ForEach(trainingData, sample =>
            {
                var positive = new Sample
                  {
                      Class = true,
                      Features = CreateFeaturesVector(sample.Features, features)
                  };

                Sample negative = positive.ToggleClass();

                foreach (var cloud in cloudClassifiers)
                {
                    var example = cloud.Key == sample.Word ? positive : negative;

                    foreach (var classifier in cloud.Value)
                    {
                        classifier.Train(example);
                    }
                }
            });

Prediction

When a sentence is presented, the system checks the classifiers of each confusion set, if the sentence contains a word from a confusion set, features are extracted and run against the corresponding classifier,  and returns the prediction with the highest score.

Image 4

C#
/// <summary>
        /// Checks every word in given sentence to check its contextually correct
        /// </summary>
        /// <param name="phrase">sentence to check</param>
        /// <returns> a Dictionary of the wrong word position in sentence as Key and its correction as Value</returns>
        public Dictionary<int, string> Predict(string phrase)
        {
            string[] tokens = SplitIntoWords(phrase);
            var correctWords = new Dictionary<int, string>();

            foreach (var comparator in _comparators)
            {
                foreach (var confusedWord in comparator.ConfusionSet)
                {
                    for (int i = 0; i < tokens.Length; i++)
                    {
                        if (!tokens[i].Equals(confusedWord, StringComparison.OrdinalIgnoreCase))
                        {
                            continue;
                        }

                        string[] posTags;

                        lock (_lock)
                        {
                            posTags = _posTagger.Tag(tokens);
                        }

                        var features = new HashSet<string>(StringComparer.OrdinalIgnoreCase);

                        features.UnionWith(_contextFeaturesExtractor.Extract(tokens, i));
                        features.UnionWith(_collocationtFeaturesExtractor.Extract(posTags, i));

                        var sample = new Sample
                        {
                            Features = CreateFeaturesVector(features, comparator.Features)
                        };

                        var predictions = new Dictionary<string, double>(comparator.Cloud.Count);

                        foreach (var classifier in comparator.Cloud.Keys)
                        {
                            predictions[classifier] = comparator
                                                 .Cloud[classifier]
                                                 .Sum(c => c.GetScore(sample));
                        }

                        string correctedWord = predictions.Aggregate((a, b) => a.Value > b.Value ? a : b).Key;

                        if (!tokens[i].Equals(correctedWord, StringComparison.OrdinalIgnoreCase))
                        {
                            correctWords.Add(i, correctedWord);
                        }
                    }

                }
            }

            return correctWords;
        }

Evaluation

In the experiments that follow, the training and test sets were drawn from the written part of the American National Corpus (OANC) (which is about 11 million words, split into random 80-20% by sentence) and 28 confusion sets.
 

 
Results
Confusion Set Accuracy %
where-were 89
to-too-two 95
hour-our 95.4
by-buy-bye 98.7
their-there 90
knew-new 95.98
later-latter 77.9
weather-whether 95
loose-lose 87
hear-here 89
threw-through 98.8
brake-break 95.83
peace-piece 86.7
cite-site-sight 88.79
passed-past 83
sea-see 92
quiet-quit-quite 79
weak-week 88
coarse-course 96.4
fourth-forth 92.9
council-counsel 89
principal-principle 70
vain-vane-vein 79
rain-reign-rein 80
desert-dessert 89.79
complement-compliment 92
plaine-plane 100
waist-waste 87

Image 5

Overall occuarcy is 94%

Running time for feature extraction + training + testing took on average 59 minutes on my i7, 3.4 GHz, 8 GB RAM machine.

The above results were achieved with pruning disabled, when enabled, accuracy went down and running time increased.
 

Areas of improvement

References

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)